bigdata - spark

Generic Dataflow Management

MapReduce is very simple and generic, but many more complex uses involve not just one, but a sequence of several MapReduce jobs. Furthermore, the MapReduce API is low-level, and most users need higherlevel interfaces, either in the form of APIs or query languages. This is why, after MapReduce, another generation of distributed processing technologies were invented. The most popular one is the open source Apache Spark.

A more general dataflow model

MapReduce consists of a map phase, followed by shuffling, followed by a reduce phase. In fact, the map phase and the reduce phase are not so different: both involve the computation of tasks in parallel on slots.

Resilient distributed datasets

The first idea behind generic dataflow processing is to allow the dataflow to be arranged in any distributed acyclic graph (DAG).

Alt text

All the rectangular nodes in the above graph correspond to intermediate data. They are called resilient distributed datasets, or in short, RDDs.

A major difference with MapReduce, though, is that RDDs need not be collections of pairs. In fact, RDDs can be (ordered) collections of just about anything: strings, dates, integers, objects, arrays, arbitrary JSON values, XML nodes, etc.The only constraint is that the values within the same RDD share the same static type, which does not exclude the use of polymorphism.

The RDD lifecycle

Creation

RDDs undergo a lifecycle. First, they get created. RDDs can be created by reading a dataset from the local disk, or from cloud storage, or from a distributed file system, or from a database source, or directly on the fly from a list of values residing in the memory of the client using Apache Spark

Transformation

Then, RDDs can be transformed into other RDDs. Mapping or reducing, in this model, become two very specific cases of transformations. However, Spark allows for many, many more kinds of transformations. This also includes transformations with several RDDs as input.

Action

RDDs can also undergo a final action leading to making an output persistent. This can be by outputting the contents of an RDD to the local disk, to cloud storage, to a distributed file system, to a database system, or directly to the screen of the user.

Lazy evaluation

Another important aspect of the RDD lifecycle is that the evaluation is lazy: in fact, creations and transformations on their own do nothing. It is only with an action that the entire computation pipeline is put into motion, leading to the computation of all the necessary intermediate RDDs all the way down to the final output corresponding to the action.

Transformations

Unary transformations

Let us start with unary transformations, i.e., those that take a single RDD as their input.

Binary transformations

There are also transformations that take two RDDs as input.

Pair transformations

Spark has transformations specifically tailored for RDDs of key-value pairs.

Actions

Gathering output locally

The collect action downloads all values of an RDD on the client machine and outputs them as a (local) list. It is important to only call this action on an RDD that is known to be small enough (e.g., after filtering) to avoid a memory overflow.

The count action computes (in parallel) the total number of values in the input RDD. This one is safe even for RDDs with billions of values, as it returns a simple integer to the client.

Writing to sharded datasets

There is also an action called saveAsTextFile that can save the entire RDD to a sharded dataset on Cloud storage (S3, Azure blob storage) or a distributed file system (HDFS).
Binary outputs can be saved with saveAsObjectFile.

Actions on Pair RDDs

Physical architecture

Narrow-dependency transformations

In a narrow-dependency transformation, the computation of each output value involves a single input value. In a narrow-dependency transformation, the computation of each output value involves a single input value.

By default, if the transformation is applied to an RDD freshly created from reading a dataset from HDFS, each partition will correspond to an HDFS block. Thus, the computation of the narrow-dependency transformation mostly involves local reads by short-circuiting HDFS.

In fact, the way this works is quite similar to MapReduce: the sequential calls of the transformation function on each input value within a single partition is called a task. And just like MapReduce, the tasks are assigned to slots. These slots correspond to cores within YARN containers. YARN containers used by Spark are called executors. The processing of the tasks is sequential within each executor, and tasks are executed in parallel across executors. And like in MapReduce, a queue of unprocessed tasks is maintained, and everytime a slot is done, it gets a new task. When all tasks have been assigned, the slots who are done become idle and wait for all others to complete.

Chains of narrow-dependency transformations

In fact, on the physical level, the physical calls of the underlying map/filter/etc functions are directly chained on each input value to directly produce the corresponding final, output value, meaning that the intermediate RDDs are not even materialized anywhere and exist purely logically.

Such a chain of narrow-dependency transformations executed efficiently as a single set of tasks is called a stage, which would correspond to what is called a phase in MapReduce.

Physical parameters

Users can parameterize how many executors there are, how many cores there are per executor and how much memory per executor (remember that these then correspond to YARN container requests).

Shuffling

What about wide-dependency transformations? They involve a shuffling of the data over the network, so that the data necessary to compute each partition of the output RDD is physically at the same location.Thus, on the high-level of a job, the physical execution consists of a sequence of stages, with shuffling happening everytime a new stage begins.

Optimizations

Pinning RDDs

Everytime an action is triggered, all the computations of the ”reverse transitive closure” (i.e., all the way up the DAG through the reverted edges) are set into motion. In some cases, several actions might share subgraphs of the DAG. It makes sense, then, to “pin” the intermediate RDD by persisting it.

Pre-partitioning

Shuffle is needed to bring together the data that is needed to jointly contribute to individual output values. If, however, Spark knows that the data is already located where it should be, then shuffling is not needed.

DataFrames in Spark

Data independence

Unlike a relational database that has everything right off-theshelf, with RDDs, the user has to re-implement all the primitives they need. This is a breach of the data independence principle. The developers behind Spark addressed this issue in a subsequent version of Spark by extending the model with support for DataFrames and Spark SQL, bringing back a widely established and popular declarative, high-level language into the ecosystem.

A specific kind of RDD

A DataFrame can be seen as a specific kind of RDD: it is an RDD of rows (equivalently: tuples, records) that has relational integrity, domain integrity, but not necessarily atomic integrity.

Performance impact

DataFrames are stored column-wise in memory, meaning that the values that belong to the same column are stored together. Furthermore, since there is a known schema, the names of the attributes need not be repeated in every single row, as would be the case with raw RDDs. DataFrames are thus considerably more compact in memory than raw RDDs.

Generally, Spark converts Spark SQL to internal DataFrame transformation and eventually to a physical query plan. An optimizer known as Catalyst is then able to find many ways of making the execution faster.

Input formats

Note that Spark automatically infers the schema from discovering the JSON Lines file, which adds a static performance overhead that does not exist for raw RDDs.

DataFrame transformations

It is also possible, instead of Spark SQL, to use a transformation API similar to (but distinct from) the RDD transformation API.

Unlike the RDD transformation API, there is no guarantee that the execution will happen as written, as the optimizer is free to reorganize the actual computations.

DataFrame column types

DataFrames also support the three structured types: arrays, structs, and maps. As a reminder, structs have string keys and arbitrary value types and correspond to generic JSON objects, while in maps, all values have the same type. Thus, structs are more common than maps.

The Spark SQL dialect

Note that both GROUP BY and ORDER BY will trigger a shuffle in the system, even though this can be optimized as the grouping key and the sorting key are the same. The SORT BY clause can sort rows within each partition, but not across partitions, i.e., does not induce any shuffling. The DISTRIBUTE BY clause forces a repartition by putting all rows with the same value (for the specified field(s)) into the same new partition.
Note that the SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.

A word of warning must be given on SORT, DISTRIBUTE and CLUSTER clauses: they are, in fact, a breach of data independence, because they expose partitions.

Spark SQL also comes with language features to deal with nested arrays and objects. First, nested arrays can be navigated with the explode() function. Lateral views are more powerful and generic than just an explode() because they give more control, and they can also be used to go down several levels of nesting. A lateral view can be intuitively described this way: the array mentioned in the LATERAL VIEW clause is turned into a second, virtual table with the rest of the original table is joined. The other clauses can then refer to columns in both the original and second, virtual table.

exercise

RDD

Why RDD should be immutable and lazy: immutable is for lineage.
Why need RDD partitioning: parallel computing and reduce shuffling.

DataFrame API

For nested array,use array_contains.

spark SQL

In jupyter notebook, we can use “%load_ext sparksql_magic” directly.

reference

https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Probabilistic Artificial Intelligence - Bayesian Deep Learning

SWAG(Stochastic Weight Averaging Gaussian)

This paper proposes a different approach to Bayesian deep learning: they use the information contained in the SGD trajectory to efficiently approximate the posterior distribution over the weights of the neural network [1].

SWA

This paper shows that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training [2].

cyclical learning rate schedule

Calibration of Modern Neural Networks

Confidence calibration – the problem of predicting probability estimates representative of the true correctness likelihood – is important for classification models in many applications. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration.

references

[1] Maddox W J, Izmailov P, Garipov T, et al. A simple baseline for bayesian uncertainty in deep learning[J]. Advances in neural information processing systems, 2019, 32.
[2] Izmailov P, Podoprikhin D, Garipov T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv preprint arXiv:1803.05407, 2018.
[3] Guo C, Pleiss G, Sun Y, et al. On calibration of modern neural networks[C]//International conference on machine learning. PMLR, 2017: 1321-1330.

bigdata - YARN

Resource management

The first version of MapReduce is inefficient in several respects. For this reason, the architecture was fundamentally changed by adding a resource management layer to the stack, adding one more level of decoupling between scheduling and monitoring. A resource management system, here YARN, is a very important building block not only for a better MapReduce, but also for many other technologies running on a cluster.

Limitations of MapReduce in its first version

The JobTracker has a lot on its shoulders! It has to deal with resource management, scheduling, monitoring, the job lifecycle, and fault tolerance.The first consequence of this is scalability. The second consequence is the bottleneck that this introduces at the JobTracker level, which slows down the entire system. The third issue is that it is difficult to design a system that do many things well. The fourth issue is that resources are statically allocated to the Map or the Reduce phase, meaning that parts of the cluster remain idle during both phases. The fifth issue is the lack of fungibility between the Map phase and the Reduce phase.

YARN

General architecture

YARN means Yet Another Resource manager. It was introduced as an additional layer that specifically handles the management of CPU and memory resources in the cluster.

YARN, unsurprisingly, is based on a centralized architecture in which the coordinator node is called the ResourceManager, and the worker nodes are called NodeManagers. NodeManagers furthermore provide slots (equipped with exclusively allocated CPU and memory) known as containers.

When a new application is launched, the ResourceManager assigns one of the container to act as the ApplicationMaster which will take care of running the application. The ApplicationMaster can then communicate with the ResourceManager in order to book and use more containers in order to run jobs. This is a fundamental change from the initial MapReduce architecture, in which the JobTracker was also taking care of running the MapReduce job.

Alt text

Thus, YARN cleanly separates between the general management of resources and bootstrapping new applications, which remains centralized on the coordinator node, and monitoring the job lifecycle, which is now delegated to one or more ApplicationMasters running concurrently. This means, in particular, that several applications can run concurrently in the same cluster. This ability, known as multi-tenancy, is very important for large companies or universities in order to optimize the use of their resources.

Resource management

In resource management, one abstracts away from hardware by distinguish between four specific resources used in a distributed database system. These four resources are: • Memory • CPU • Disk I/O • Network I/O.
ApplicationMasters can request and release containers at any time, dynamically. A container request is typically made by the ApplicationMasters with a specific demand. If the request is granted by the ResourceManager fully or partially, this is done indirectly by signing and issuing a container token to the ApplicationMaster that acts as proof that the resource was granted. The ApplicationMaster can then connect to the allocated NodeManager and send the token. The NodeManager will then check the validity of the token and provide the memory and CPU granted by the ResourceManager. The ApplicationMaster ships the code (e.g., as a jar file) as well as parameters, which then runs as a process with exclusive use of this memory and CPU.

Job lifecycle management and fault tolerance

Version 2 of MapReduce works on top of YARN by leaving the job lifecycle management to an ApplicationMaster. The ApplicationMaster requests containers for the Map phase, and sets these containers up to execute Map tasks. As soon as a container is done executing a Map task, the ApplicationMaster will assign a new Map task to this container from the remaining queue, until no Map tasks are left.

a container in the Map phase can contain several Map slots. Sharing memory and containers across slots in this way improves the overall efficiency, because setting up a container adds latency. It is thus more efficient to allocate 10 containers of each 4 cores, compared to 40 containers of each 1 core.

In the event of a container crashing during the Map phase, the ApplicationMaster will handle this by re-requesting containers and restarting the failed tasks. In the case that some data is lost in the Reduce phase, it is possible that the entire job must be restarted, because this the only way to recreate the intermediate data is to re-execute the Map tasks.

Scheduling

The ResourceManager keeps track of the list of available NodeManagers (who can dynamically come and go) and their status. Just like in HDFS, NodeManagers send periodic heartbeats to the ResourceManager to give a sign of life. The ResourceManager also offers an interface so that ApplicationMasters can register and send container requests. ApplicationMasters also send periodic heartbeats to the ResourceManager. It is important to understand that, unlike the JobTracker, the ResourceManager does not monitor tasks, and will not restart slots upon failure. This job is left to the ApplicationMasters.

Scheduling strategies

FIFO scheduling

In FIFO scheduling, there is one application at a time running on the entire cluster. When it is done, the next application runs again on the entire cluster, and so on.

Capacity scheduling

In capacity scheduling, the resources of the cluster are partitioned into several sub-clusters of various sizes. Each one of these sub-clusters has its own queue of applications running in a FIFO fashion within this queue. Capacity scheduling also exists in a more “dynamic flavour” in which, when a sub-cluster is not currently used, its resources can be temporarily lent to the other sub-clusters. This is also in the spirit of usage maximization.

Fair scheduling

Fair scheduling involves more complex algorithms that attempt to allocate resources in a way fair to all users of the cluster and based on the share they are normally entitled to. fair scheduling consists on making dynamic decisions regarding which requests get granted and which requests have to wait. Fair scheduling combines several ways to compute cluster shares: Steady fair share: this is the share of the cluster officially allocated to each department. Instantaneous fair share: this is the fair share that a department should ideally be allocated (according to economic and game theory considerations) at any point in time. This is a dynamic number that changes constantly, based on departments being idle: if a department is idle, then the instantaneous fair share of others department becomes higher than their steady fair shares. Current share: this is the actual share of the cluster that a department effectively uses at any point in time. This is highly dynamic. The current share does not necessarily match the instantaneous fair share because there is some inertia in the process: a department might be using more resources while another is idle. When the other department later stops being idle, these resources are not immediately withdrawn from the first department; rather, the first department will stop getting more resources, and the second department will gradually recover these resources as they get released by the first department.

The easiest case of fair scheduling is when only one resource is considered: for example, only CPU cores, or only memory. Things become more complicated when several resources are considered, in practice both CPU cores and memory. This problem was solved game-theoretically with the Dominant Resource Fairness algorithm. The two (or more) dimensions are projected again to a single dimension by looking at the dominant resource for each user.

exercise

motivation

improve hadoop 1.x by adding a layer YARN to separate resource management from data processing.

architecture

NodeMananger is generalized taskTracker.
On ResourceManager, there is an ApplicationManager rsponsible for admiting new jobs and collecting finished jobs and logs.
A scheduler has a global view to assign an ApplicationMaster. The applicationMaster will compute how many resources needed and send a request to the scheduler.

fault tolerance is provided by applicationMaster+NodeManager+HDFS.

bigdata - Map Reduce

Patterns of large-scale query processing

Textual input

We saw that the data we want to query can take many forms. First, it can be billions of lines of text. It can also be plenty of CSV lines. Some other formats (e.g., Parquet, …) can be binary. We also encountered HFiles in Chapter 6, which are lists of keyvalue pairs. In fact, Hadoop has another such kind of key-value-based format called Sequence File, which is simply a list of key-values, but not necessarily sorted by key (although ordered) and with keys not necessarily unique.

Shards

There are several practical motivations for the many-files pattern even in HDFS.

Querying pattern

This is the motivation behind the standard MapReduce pattern: a map-like phase on the entire input, then a shuffle phase on the intermediate data, then another map-like phase (called reduce) producing the entire output

MapReduce Model

In MapReduce, the input data, intermediate data, and output data are all made of a large collection of key-value pairs (with the keys not necessarily unique, and not necessarily sorted by key). The types of the keys and values are known at compile-time (statically), and they do not need to be the same across all three collections.

MapReduce architecture

On a cluster, the architecture is centralized, just like for HDFS and HBase. In the original version of MapReduce, the main node is called JobTracker, and the worker nodes are called TaskTrackers.

In fact, the JobTracker typically runs on the same machine as the NameNode (and HMaster) and the TaskTrackers on the same machines as the DataNodes (and RegionServers). This is called “bring the query to the data.”

As the map phase progresses, there is a risk that the memory becomes full. But we have seen this before with HBase: the intermediate pairs on that machine are then sorted by key and flushed to the disk to a Sequence File. And as more flushes happen, these Sequence Files can be compacted to less of them, very similarly to HBase’s Log-Structured Merge Trees. When the map phase is over, each TaskTracker runs an HTTP server listening for connections, so that they can connect to each other and ship the intermediate data over to create the intermediate partitions ensuring that the same keys are on the same machines.This is the phase called shuffling. Then, the reduce phase can start. When the reduce phase is completed, each output partition will be output to a shard (as we saw, a file named incrementally) in the output destination (HDFS, S3, etc) and in the desired format.

Note that shuffling can start before the map phase is over, but the reduce phase can only start after the map phase is over.

MapReduce input and output formats

Impedance mismatch

MapReduce can read its input from files lying in a data lake as well as directly from a database system such as HBase or a relational database management system. MapReduce only reads and writes lists of key-value pairs, where keys may be duplicates and need not appear in order. However, the inputs we considered are not key-value pairs. So we need an additional mechanism that allows MapReduce to interpret this input as key-value pairs. For tables, whereas relational or in a wide column stores, this is relatively easy: indeed, tables have primary keys, consisting of either a single column or multiple columns. Thus, each tuple can be interpreted as a key-value pair.

Mapping files to pairs

How do we read a (possibly huge) text file as a list of key-value pairs? The most natural way to do so is to turn each line of text in a key value pair1: the value is the string corresponding to the entire line, while the key is an integer that expresses the position (as a number of characters), or offset, at which the line starts in the current file being read.

A few examples

Counting words

Alt text
Alt text

Selecting

This is something easily done by having a map function that outputs a subset of its input, based on some predicate provided by the user.
Alt text
Here we notice that the output of the map phase already gives us the desired result; we still need to provide a reduce function, which is taken trivially as the identity function. This is not unusual (and there are also examples where the map function is trivial, and the reduce function is doing the actual processing).

Projecting

The map function can project this object to an object with less attributes:
Alt text

MapReduce and the relational algebra

As an exercise, try to figure out how to implement a GROUP BY clause and an ORDER BY clause. What about the HAVING clause? Naturally, executing these queries directly in MapReduce is very cumbersome because of the low-level code we need to write.

Combine functions and optimization

In addition to the map function and the reduce function, the user can supply a combine function. This combine function can then be called by the system during the map phase as many times as it sees fit to “compress” the intermediate key-value pairs. Strategically, the combine function is likely to be called at every flush of key-value pairs to a Sequence File on disk, and at every compaction of several Sequence Files into one.

However, there is no guarantee that the combine function will be called at all, and there is also no guarantee on how many times it will be called. Thus, if the user provides a combine function, it is important that they think carefully about a combine function that does not affect the correctness of the output data.

In fact, in most of the cases, the combine function will be identical to the reduce function, which is generally possible if the intermediate key-value pairs have the same type as the output key-value pairs, and the reduce function is both associative and commutative.

MapReduce programming API

Mapper classes

In Java, the user needs to define a so-called Mapper class that contains the map function, and a Reducer class that contains the reduce function. A map function takes in particular a key and a value. Note that it outputs key-value pairs via the call of the write method on the context, rather than with a return statement.

Reducer classes

A reduce function takes in particular a key and a list of values. Note that it outputs key-value pairs via the call of the write method on the context, rather than with a return statement.

Running the job

Finally, a MapReduce job can be created and invoked by supplying a Mapper and Reducer instance to the job. A combine function can also be supplied with the setCombinerClass method. It is also possible to use Python rather than Java, via the socalled Streaming API. The Streaming API is the general way to invoke MapReduce jobs in other languages than Java.

Using correct terminology

Functions

A map function is a mathematical, or programmed, function that takes one input key-value pair and returns zero, one or more intermediate key-value pairs.

A reduce function is a mathematical, or programmed, function that takes one or more intermediate key-value pairs and returns zero, one or more output key-value pairs.

A combine function is a mathematical, or programmed, function that takes one or more intermediate key-value pairs and returns zero, one or more intermediate key-value pairs.

Note that the combine function is an optional local aggregation step that occurs before shuffling and sorting, and its purpose is to reduce the amount of data that needs to be transferred to the reducers. The reduce function, on the other hand, performs the final aggregation and processing based on keys.

Tasks

A map task is an assignment (or “homework”, or “TODO”) that consists in a (sequential) series of calls of the map function on a subset of the input.

A reduce task is an assignment that consists in a (sequential) series of calls of the reduce function on a subset of the intermediate input.

We insist that the calls within a task are sequential, meaning that there is no parallelism at all within a task.

There is no such thing as a combine task. Calls of the combine function are not planned as a task, but is called ad-hoc during flushing and compaction.

Slots

The map tasks are processed thanks to compute and memory resources (CPU and RAM). These resources are called map slots. One map slot corresponds to one CPU core and some allocated memory. The number of map slots is limited by the number of available cores. Each map slot then processes one map task at a time, sequentially. The resources used to process reduce tasks are called reduce slots. So, there is no parallelism either within one map slot, or one reduce slot. In fact, parallelism happens across several slots.

Phases

The map phase thus consists of several map slots processing map tasks in parallel.

blocks vs. splits

HDFS blocks have a size of (at most) 128 MB. In every file, all blocks but the last one have a size of exactly 128 MB. Splits, however, only contain full records: a key-value pair will only belong to one split (and thus be processed by one map task).

Probabilistic Artificial Intelligence - Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

The key idea of Markov chain Monte Carlo methods is to construct a Markov chain, which is efficient to simulate and has the stationary distribution p.

Markov Chains

Alt text

Intuitively, the Markov property states that future behavior is independent of past states given the present state.

Markov chains can be classified into different types based on their behavior. These classifications include time-homogeneous and time-inhomogeneous Markov chains, irreducible Markov chains, and periodic and aperiodic Markov chains.

We restrict our attention to time-homogeneous Markov chains. That is, the transition probabilities do not change over time, which can be characterized by a transition function.

Irreducible Markov chains are those in which every state can be reached from any other state, while periodic Markov chains exhibit a repeating pattern in their states. On the other hand, aperiodic Markov chains do not exhibit any regular pattern in their states. If there is no fixed period at which the probabilities return to the starting values, the chain is classified as aperiodic. Aperiodic Markov chains often display a more complex and unpredictable behavior compared to periodic ones.

Stationarity

Alt text

Alt text

Convergence

Alt text

Alt text

Alt text

A Markov Chain is called ergodic, if there exists a finite t such that every state can be reached from every state in exactly t steps.

Fundamental theorem of ergodic Markov chains

Alt text

Detailed Balance Equation

Alt text

Ergodic Theorem

Alt text

This result is the fundamental reason for why Markov chain Monte Carlo methods are possible. In practice, one observes that Markov chain Monte Carlo methods have a so-called “burn-in” time during which the distribution of the Markov chain does not yet approximate the posterior distribution well. Typically, the first t0 samples are therefore discarded. It is not clear in general how T and t0 should be chosen such that the estimator is unbiased, rather they have to be tuned.

Elementary Sampling Methods

Metropolis-Hastings Algorithm

Alt text

Gibbs Sampling

A popular example of a Metropolis-Hastings algorithm is Gibbs sampling.

Alt text

Intuitively, by re-sampling single coordinates according to the posterior distribution given the other coordinates, Gibbs sampling finds states that are successively “more” likely.

Langevin Dynamics

Gibbs Distributions

references

https://www.collimator.ai/reference-guides/what-is-a-aperiodic-markov-chain

bigdata - data model

data model

A data model is an abstract view over the data that hides the way it is stored physically.

The JSON Information Set

The appropriate abstraction for any JSON document is a tree. The nodes of that tree, which are JSON logical values, are naturally of six possible kinds: the six syntactic building blocks of JSON.

These are the four leaves corresponding to atomic values:Strings, Numbers,Booleans, Nulls. As well as two intermediate nodes: Objects, Arrays. These nodes are generally called information items and form the logical building blocks of the model, called information set.

The XML Information Set

A fundamental difference between JSON trees and XML trees is that for JSON, the labels (object keys) are on the edges connecting an object information item to each one of its children information items. In XML, the labels (these would be element and attribute names) are on the nodes (information items) directly.

In XML, there are many more information items: Document information items, Element, Attribute, Character, Comment, Processing instruction, Name space, Unexpanded entity reference, DTD, Unparsed entity, Notation. Here we only focus on documents, elements, attributes, and characters.

Validation

Once documents, JSON or XML, have been parsed and logically abstracted as a tree in memory, the natural next step is to check for further structural constraints.

In a relational database, the schema of a table is defined before any data is populated into the table. Thus, the data in the table is guaranteed, at all times, to fulfil all the constraints of the schema. The schema was enforced at write time (schema on write). A collection of JSON and XML documents out there can exist without any schema and contain arbitrary structures. Validation happens “ex post,” that is, only after reading the data (schema on read).

JSON and XML documents undergo two steps: a well-formedness check: attempt to parse the document and construct a tree representation in memory;(if the first step succeeded) a validation check given a specific schema. Note that, unlike well-formedness, validation is schema dependent: a given well-formed document can be valid against schema A and invalid against schema B.

Item types

A fundamental aspect of validation is the type system. A well-designed type system, in turn, allows for storing the data in much more efficient, binary formats tailored to the model.

Atomic types

Atomic types correspond to the leaf of a tree data model: these are types that do not contain any further nestedness. The kinds of atomic types available are also relatively standard and common to most technologies. Also, all atomic types have in common that they have a logical value space and a lexical value space.

An atomic type also has a (not necessarily injective) mapping from its lexical value space to its logical value space (e.g., mapping the hexadecimal literal x10 to the mathematical integer sixteen).
Atomic types can be in a subtype relationship: a type is a subtype of another type if its logical value space is a subset of the latter.

Strings

In “pure computer science” textbooks, strings are often presented as structured values rather than as atomic values because of their complexity on the physical layer. However, for us data scientists, strings are atomic values.

Numbers: integers

Integers correspond to finite cardinalities (counting) as well as their negative counterparts. In older programming languages, support for integers used to be bounded. However, in modern databases, it is customary to support unbounded integers. Engines can optimize computations for small integers, but might become less efficient with very large integers.

Numbers: decimals

Decimals correspond to real numbers that can be written as a finite sequence of digits in base 10, with an optional decimal period.

Numbers: floating-point

Support for the entire decimal value space can be costly in performance. In order to address this issue, a floating-point standard (IEEE 754) was invented and is still very popular today.

Floating-point numbers are limited both in precision and magnitude (both upper and lower) in order to fit on 32 bits (float) or 64 bits (double). Floats have about 7 digits of precision and their absolute value can be between roughly 10^−37 and 10^37, while doubles have 15 digits of precision and their absolute value can be between roughly 10^−307 and 10^308.

Booleans

The logical value space for the Boolean type is made of two values: true and false as in NoSQL queries, two-valued logic is typically assumed.

Dates and times

Dates are commonly using the Gregorian calendar (with some technologies possibly supporting more) with a year (BC or AD), a month and a day of the month. Times are expressed in the hexagesimal (60) basis with hours, minutes, seconds, where the seconds commonly go all the way to microseconds (six digits after the decimal period). Datetimes are expressed with a year, a month, a day of the month, hours, minutes and (decimal) seconds.

Timestamp values are typically stored as longs (64-bit integers) expressing the number of milliseconds elapsed since January 1, 1970 by convention.

XML Schema, JSound and JSONiq follow the ISO 8601 standard, where lexical values look like so (with many parts optional): 2022-08-07T14:18:00.123456+02:00.

Durations

The lexical representation can vary, but there is a standard defined by ISO 8601 as well, starting with a P and prefixing sub-day parts with a T.
4 days, 3 hours, 2 minutes and 1.123456 seconds: P4DT3H2M1.123456S.

Binary data

Binary data is, logically, simply a sequence of bytes. There are two main lexical representations used in data: hexadecimal and base64. Hexadecimal expresses the data with two hexadecimal digits per byte. Base 64, formally, does the same but in the base 64, which “wastes” less lexical space in the text. It does so by encoding the bits six by six, encoding each sequence of six bits with one base-64 digit.

Null

A schema can either allow, or disallow the null value.
XML also supports null values, but calls them “nil” and does so with a special attribute and no content rather than with a lexical representation

Structured types

Lists

Lists correspond to JSON arrays and are ordered sequences of (atomic or structured) values.

Records

Records, or structs, correspond to JSON objects and are maps from strings to values.

Maps

Maps (not be confused with records, which are similar) are maps from any atomic value to any value, i.e., generalize objects to keys that are not necessarily strings (e.g., numbers, dates, etc).
With a schema, it is possible to restrict the type of the keys, as well as the type of the values. However, unlike records, the type of the values must be the same for all keys.

Sets

Sets are like lists, but without any specific ordering, and without duplicate values.

XML elements and attributes

XML Schema stands apart from most other technologies and formats, in that it does not offer specific support for records and maps; it offers some limited support for lists, but considers them to be simple types, which are “inbetween” atomic types and structured types. n XML Schema, structure is obtained, instead, with elements and attributes, and the machinery for elements and attributes is highly specific to XML.

Type names

Alt text

Sequence types

Cardinality

Many type system give options regarding the number of occurrences of items in a sequence.

Collections vs. nested lists

A collection of items is on the outer level, and can be massively large (billions, trillions of items).

A list (or array) of items, however, usually refers to a nested structure, for example an array nested inside a document or object. Such lists of items are usually restricted in size for reasons of performance and scalability.

It is thus important to keep this subtle difference in mind, in particular, do not confuse a collection of integers with a collection that contains a single array of integers.

JSON validation

Validating flat objects

JSound is a schema language that was designed to be simple for 80% of the cases, making it particularly suitable in a teaching environment.It is independent of any programming language.JSON Schema is another technology for validating JSON documents. The available JSON Schema types are string, number, integer, boolean, null, array and object.

An example for a json document is like:
{ “name” : “Einstein”, “first” : “Albert”, “age” : 142 }
The JSound and the JSON Schema are as follows:
{ “name” : “string”, “first” : “string”, “age” : “integer” }

{ “type” : “object”, “properties” : { “name” : “string”, “first” : “string”, “age” : “number” } }.

The type system of JSON Schema is thus less rich than that of JSound, but extra checks can be done with so-called formats, which include date, time, duration, email, and so on including generic regular expressions.

Requiring the presence of a key

It is possible to require the presence of a key by adding an exclamation mark in JSound. The equivalent JSON Schema uses a “required” property associated with the list of required keys to express the same.

Open and closed object types

In the JSound compact syntax, extra keys are forbidden. The schema is said to be closed. There are ways to define JSound schemas to allow arbitrary additional keys (open schemas), with a more verbose syntax. Unlike JSound, in JSON Schema, extra properties are allowed by default. JSON Schema then allows to forbid extra properties with the “additionalProperties” property.

Nested structures

{ “numbers” : [ “integer” ] }
Every schema can be given a name, turning into a type.
JSound allows for the definition not only of arbitrary array and object types, but also user-defined types.

Primary key constraints, allowing for null, default values

There are a few more features available in the compact JSound syntax (not in JSON Schema) with the special characters @, ? and =. The question mark (?) allows for null values. The arobase (@) indicates that one or more fields are primary keys for a list of objects that are members of the same array. The equal sign (=) is used to indicate a default value that is automatically populated if the value is absent.

Note that validation only checks whether lexical values are part of the type’s lexical space.

Accepting any values

Accepting any values in JSound can be done with the type “item”, which contains all possible values. In JSON Schema, in order to declare a field to accept any values, you can use either true or an empty object in lieu of the type. JSON Schema additionally allows to use false to forbid a field.

Type unions

In JSON Schema, it is also possible to combine validation checks with Boolean combinations. JSound schema allows defining unions of types with the vertical bar inside type strings, like so: “string|array”.

Type conjunction, exclusive or, negation

In JSON Schema only (not in JSound), it is also possible to do a conjunction (logical and), as well as exclusive or (xor), as well as negation.

XML validation

Simple types

All elements in an XML Schema are in a namespace, the XML Schema namespace. It is recommended to stick to the prefix xs, or xsd, which is also quite popular. We do not recommend declaring the XML Schema namespace as a default namespace, because it can create confusion in several respects.

The list of predefined atomic types is the same as in JSound, except that in XML Schema, all these predefined types live in the XML Schema namespace and thus bear the prefix xs as well.

Builtin types

XML Schema allows you to define user-defined atomic types, for example restricting the length of a string to 3 for airport codes, and then use it with an element.

Complex types

It is also possible to constrain structures and the element/attribute/text hierarchy with complex types applying to element nodes.
There are four main kinds of complex types:• complex content: there can be nested elements, but there can be no text nodes as direct children. • simple content: there are no nested elements: just text, but attributes are also possible. • empty content: there are neither nested elements nor text, but attributes are also possible. • mixed content: there can be nested elements and it can be intermixed with text as well.

Attribute declarations

Finally, all types of content can additionally contain attributes. Attributes always have a simple type.

Anonymous types

Finally, it is not mandatory to give a name to all types. Be careful: if there is neither a type attribute nor a nested type declaration, then anything is allowed!

Miscellaneous

Finally, XML Schema documents are themselves XML documents, and can thus be validated against a “schema or schemas”, itself written as an XML Schema.This schema has the wonderful property of being valid against itself.

Data frames

Heterogeneous, nested datasets

The beauty of the JSON data model is that, unlike the relational model and the CSV syntax, it supports nested, heterogeneous datasets, while also supporting as a particular case flat, homogeneous datasets.

Dataframe visuals

There is a particular subclass of semi-structured datasets that are very interesting: valid datasets, which are collections of JSON objects valid against a common schema, with some requirements on the considered schemas. The datasets belonging to this particular subclass are called data frames, or dataframes.

Specifically, for the dataset to qualify as a data frame, firstly, we forbid schemas that allow for open object types. secondly, we forbid schemas that allow for object or array values to be too permissive and allow any values. We, however, include schemas that allow for null values and/or absent values. Relational tables are data frames, while data frames are not necessarily relational tables. Data frames are a generalization of (normalized) relational tables allowing for (organized and structured) nestedness.

exercies

complextType cannot contain character by default but with mixed=”true” it can.

protobuf

convert json-like data to columnar representation(why we want this: make it more efficient to get relevant data rather than get the whole table).

convert the columnar representation back to the original format. Replace the missing field with NULL. It’s a “lossless” conversion.

Dremel

optional: 0 or 1. repeated: 1 or more.

Probabilistic Artificial Intelligence - Variational Inference

Variational Inference

The fundamental idea behind variational inference is to approximate the true posterior distribution using a “simpler” posterior that is as close as possible to the true posterior.

where λ represents the parameters of the variational posterior q, also called variational parameters.

Laplace Approximation

A natural idea for approximating the posterior distribution is to use a Gaussian approximation (that is, a second-order Taylor approximation) around the mode of the posterior. Such an approximation is known as a Laplace approximation.

As an example, we will look at Laplace approximation in the context of Bayesian logistic regression. Bayesian logistic regression corresponds to Bayesian linear regression with a Bernoulli likelihood.

Intuitively, the Laplace approximation matches the shape of the true posterior around its mode but may not represent it accurately elsewhere — often leading to extremely overconfident predictions.

Inference with a Variational

Information Theory

Surprise

The surprise about an event with probability u is defined as $S[u] = -log u$.
Alt text

Entropy

The entropy of a distribution p is the average surprise about samples from p.if the entropy of p is large, we are more uncertain about x ∼ p than if the entropy of p were low.

Cross-Entropy

Alt text
Cross-entropy can also be expressed in terms of the KL-divergence. Quite intuitively, the average surprise in samples from p with respect to the distribution q is given by the inherent uncertainty in p and the additional surprise that is due to us assuming the wrong data distribution q. The “closer” q is to the true data distribution p, the smaller is the additional average surprise.

Variational Families

We can view variational inference more generally as a family of approaches aiming to approximate the true posterior distribution by one that is closest (according to some criterion) among a “simpler” class of distributions.
Alt text

Kullback-Leibler Divergence

Alt text

In words, KL(p∥q) measures the additional expected surprise when observing samples from p that is due to assuming the (wrong) distribution q and which not inherent in the distribution p already. Intuitively, the KL-divergence measures the information loss when approximating p with q.
Alt text

Forward and reverse KL-divergence

KL(p∥q) is also called the forward (or inclusive) KL-divergence. In contrast, KL(q∥p) is called the reverse (or exclusive) KL-divergence.

It can be seen that the reverse KL-divergence tends to greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction. The forward KL-divergence, in contrast, is more conservative and yields what one could consider the “desired” approximation.

The reverse KL-divergence is typically used in practice. This is for computational reasons. In principle, if the variational family Q is “rich enough”, minimizing the forward and reverse KL-divergences will yield the same result.

There is a nice interpretation of minimizing the forward KullbackLeibler divergence of the approximation $q_\lambda$ with respect to the true distribution p as being equivalent to maximizing the marginal model likelihood on a sample of infinite size. This interpretation is not useful in the setting where p is a posterior distribution over model parameters θ as a maximum likelihood estimate requires samples from p which we cannot obtain in this case.

KL-divergence of Gaussians

Alt text

Alt text

Alt text

Evidence Lower Bound

The Evidence Lower Bound is a quantity maximization of which is equivalent to minimizing the KL-divergence. As its name suggests, the evidence lower bound is a (uniform) lower bound to the log-evidence $log p(y{1:n}|x{1:n})$.
Alt text

Alt text

Note that this inequality lower bounds the logarithm of an integral by an expectation of a logarithm over some variational distribution q. Hence, the ELBO is a family of lower bounds — one for each variational distribution. Such inequalities are called variational inequalities.

Alt text

Gradient of Evidence Lower Bound

Alt text
Obtaining the gradient of the evidence lower bound is more difficult. This is because the expectation integrates over the measure $q_\lambda$, which depends on the variational parameters $\lambda$. Thus, we cannot move the gradient operator inside the expectation.

There are two main techniques which are used to rewrite the gradient in such a way that Monte Carlo sampling becomes possible. The first is to use score gradients. The second is the so-called reparameterization trick.

reparameterization trick

Alt text
Alt text

Alt text

The procedure of approximating the true posterior using a variational posterior by maximizing the evidence lower bound using stochastic optimization is also called black box stochastic variational inference. The only requirement is that we can obtain unbiased gradient estimates from the evidence lower bound (and the likelihood).

bigdata - wide column stores

wide column stores

Wide column stores were invented to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.

Why wide column stores

A wide column store will be more tightly integrated with the parallel data processing systems.
wide column stores have a richer logical model than the simple key-value model behind object storage. wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.

different from RDBMS

It does not have any data model for values, which are just arrays of bytes; since it efficiently handles values up to 10 MB, the values can be nested data in various formats, which breaks the first normal form; tables do not have a schema; there is no language like SQL, instead the API is on a lower level and more akin to that of a key-value store; tables can be very sparse, allowing for billions of rows and millions of columns at the same time; this is another reason why data stored in HBase is denormalized.

BigTable and HBase

Rationale

HBase is an open-source equivalent to the BigTable as part of the Hadoop ecosystem.
The data model of HBase is based on the realization that joins are expensive, and that they should be avoided or minimized on a cluster architecture.

The second design principle underlying HBase is that it is efficient to store together what is accessed together. In the big picture, this is a flavor of batch processing, one of the overarching principles in Big Data. Batch processing reduces the impact of latency.

Tables and row IDs

From an abstract perspective, HBase can be seen as an enhanced keyvalue store, in the sense that:a key is compound and involves a row, a column and a version;values can be larger (clobs, blobs), up to around 10 MB; keys are sortable.

A row ID is logically an array of bytes, although there is a library to easily create row ID bytes from specific primitive values. In HBase, the key identifying every cell consists of: the row ID, the column family, the column qualifier, the version.

Column families

The other attributes, called columns, are split into so-called column families. This is a concept that does not exist in relational databases and that allows scaling the number of columns.

Column qualifiers

Columns in HBase have a name (in addition to the column family) called column qualifier, however unlike traditional RDBMS, they do not have a particular type. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values. Unlike the values which can be large arrays of bytes (blobs), it is important to keep column families and column qualifiers short, because as we will see, they are repeated a gigantic number of times on the physical layer.

Versioning

HBase generally supports versioning, in the sense that it keeps track of the past versions of the data. As we will see, this is implemented by associating any value with a timestamp, also called version, at which it was created (or deleted).Users can also override timestamps with a value of their choice to have more control about versions.

Logical queries

HBase supports four kinds of low-level queries: get, put, scan and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.

HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time can modify any given row.

Physical architecture

Partitioning

A table in HBase is physically partitioned in two ways: on the rows and on the columns. The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded. A partition is called a store and corresponds to the intersection of a region and of a column family.
Alt text

Network topology

HBase has exactly the same centralized architecture as HDFS. The HMaster and the RegionServers should be understood as processes running on the nodes, rather than the nodes themselves. The HMaster assigns responsibility of each region to one of the RegionServers. There is no need to attribute the responsibility of a region to more than one RegionServer at a time because, as we will see soon, fault tolerance is already handled on the storage level by HDFS. If a region grows too big, for example because of many writes in the same row ID interval, then the region will be automatically split by the responsible RegionServer. If a RegionServer has too many regions compared to other RegionServers, then the HMaster can reassign regions to other RegionServers.

Physical storage

The store is, physically, nothing less than a set of cells. Each cell is thus handled physically as a key-value pair where the key is a (row ID, column family, column qualifier, version) tuple. All the cells within a store are eventually persisted on HDFS, in files that we will call HFiles.

An HFile is, in fact, nothing else than a (boring) flat list of KeyValues, one per cell. What is important is that, in an HFile, all these KeyValues are sorted by key in increasing order, meaning, first sorted by row ID, then by column family (trivially unique for a given store), then by column qualifier, then by version (in decreasing order, recent to old). This means that all versions of a given cell that are in the same HFile are located together, and one of the cells (within this HFile) is the latest. If we zoom in at the bit level, a KeyValue consists in four parts: The length of the keys in bits (this length is encoded on a constant, known number of bits) • The length of the value in bits (this length is encoded on a constant, known number of bits) • The actual key (of variable length) • The actual value (of variable length). Why do we not just store the key and the value? This is because their length can vary. If we do not know their length, then it is impossible to know when they stop just looking at the bits.
Alt text

KeyValues, within an HFile, are organized in blocks. But to not confuse them with HDFS blocks, we will call them HBlocks. HBlocks have a size of 64 kB, but this size is variable: if the last KeyValue goes beyond this boundary, then the HBlock is simply longer and stops whenever the last KeyValue stops. The HFile then additionally contains an index of all blocks with their key boundaries. This separate index is loaded in memory prior to reading anything from the HFile.

Log-structured merge trees

When accessing data, HBase needs to generally look everywhere for cells to potentially return: in every HFile, and in memory. As long as there is room in memory, freshly created cells are added in memory. At some point, the memory becomes full (or some other limits are reached). When this happens, all the cells need to be flushed to a brand new HFile. Upon flushing, all cells are written sequentially to a new HFile in ascending key order, HBlock by HBlock, concurrently building the index structure. When cells are added to memory, they are added inside a data structure that maintains them in sorted order (such as tree maps) and then flushing is a linear traversal of the tree.

What happens if the machine crashes and we lose everything in memory? We have a so-called write-ahead-log for this. Before any fresh cells are written to memory, they are written in sequential order (append) to an HDFS file called the HLog. There is one HLog per RegionServer. A full write-ahead-log also triggers a flush of all cells in memory to a new HFile. If there is a problem and the memory is lost, the HLog can be retrieved from HDFS and “played back” in order to repopulate the memory and recreate the sorting tree structure.

After many flushes, the number of HFiles to read from grows and becomes impracticable. For this reason, there is an additional process called compaction that takes several HFiles and outputs a single,merged HFile. Since the cells within each HFile are already sorted, this can be done in linear time, as this is essentially the merge part of the merge-sort algorithm. Compaction is not done arbitrarily but follows a regular, logarithmic pattern. When the memory is flushed again, an standard-size HFile is written and the two standard-size HFiles are immediately compacted to a double-size HFile.

Additional design aspects

Bootstrapping lookups

In order to know which RegionServer a client should communicate with to receive cells corresponding to a specific region, there is a main, big lookup table that lists all regions of all tables together with the coordinates of the RegionServer in charge of this region as well as additional metadata.

Caching

In order to improve latency, cells that are normally persisted to HFiles (and thus no longer in memory) can be cached in a separate memory region, with the idea of keeping in the cache those cells that are frequently accessed.

Bloom filters

HBase has a mechanism to avoid looking for cells in every HFile. This mechanism is called a Bloom filter. It is basically a black box that can tell with absolute certainty that a certain key does not belong to an HFile, while it only predicts with good probability (albeit not certain) that it does belong to it.

Data locality and short-circuiting

When a RegionServer flushes cells to a new HFile, a replica of each (HDFS) block of the HFile is written, by the DataNode process living on the same machine as the RegionServer process, to the local disk. This makes accessing the cells in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.

using Habse

After installing Hbase, we can use Hbase shell. There are some commands in hbase shell: get, scan.
We can use filters with scan to query and filter data.

exercises

bloom filter

perfect hash function should have uniform probability.
all hash functions set a bit to 1 = collide at the same place: probability of a FP case
deleting only happens when compacting.

references

https://ghislainfourny.github.io/big-data-textbook/
https://datakaresolutions.com/hbase-quick-guide-to-key-commands/
https://www.datapotion.io/blog/hbase-shell-column-filters/

Probabilistic Artificial Intelligence - Bayesian Linear Regression

Bayesian Linear Regression

Linear Regression

Given a set of (x, y) pairs, linear regression aims to find a linear model that fits the data optimally.
Given the linear model $y=w^Tx$, we want to find optimal weights $w$.
There are many ways of estimating w from data, the most common being the least squares estimator:

A slightly different estimator is used for ridge regression:

As the formula shows, the squared $l_2$ regularization term $\lambda|w|_2^2$ penalizes large $w$ and thus reduces the complexity of the resulting model,so Ridge regression is more robust than standard linear regression in the presence of multicollinearity. Multicollinearity occurs when multiple independent inputs are highly correlated. In this case, their individual effects on the predicted variable cannot be estimated well. Classical linear regression is highly volatile to small input changes. The regularization of ridge regression reduces this volatility by introducing a bias on the weights towards 0.

uncertainty

In practice, our data D is merely a sample of the process we are modeling. In these cases, we are looking for models that generalize to unseen data.
Therefore, it is useful to express the uncertainty about our model due to the lack of data. This uncertainty is commonly referred to as the epistemic uncertainty.
Usually, there is another source of uncertainty called aleatoric uncertainty, which originates directly from the process that we are modeling. This uncertainty is the noise in the labels that cannot be explained by the inputs.

Weight-space View

The most immediate and natural Bayesian interpretation of linear regression is to simply impose a prior on the weights $w$.

Assume that prior $w\sim\mathcal{N}(0,\sigma_p^2I)$ and likelihood $y_i\mid x_i,w\sim\mathcal{N}(w^\top x_i,\sigma_n^2).$ are both Gaussian, we will get the posterior distribution over the weights as:

As the above is a quadratic form in $w$, it follows that the posterior distribution is a Gaussian.
This also shows that Gaussians with known variance and linear likelihood are self-conjugate. A distribution is said to be self-conjugate (or a conjugate prior is self-conjugate) if, when used as a prior distribution, it results in a posterior distribution that belongs to the same family of distributions. It can be shown more generally that Gaussians with known variance are self-conjugate to any Gaussian likelihood. For general distributions the posterior will not be closed-form. This is a very special property of Gaussians.
We can compute the MAP estimate for the weights,

we can find that this is simply the MLE loss with an additional $l_2$ regularization term and this coincides with the optimization objective of ridge regression with weight decay $\lambda=\frac{\sigma_n^2}{\sigma_p^2}$. Also, recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean.(The mode of a probability distribution is the value where the distribution reaches its maximum point. In the context of the posterior distribution, the mode corresponds to the most probable value of the parameter given the observed data.).

A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which uses $l_1$ regularization.It turns out that lasso can also be viewed as Bayesian learning, using a Laplace prior.

Note that using point estimates like the MAP estimate does not quantify uncertainty in the weights. The MAP estimate simply collapses all mass of the posterior around its mode.This is especially harmful when we are unsure about the best model.

Recursive Bayesian Updates

As data arrives online (i.e., in “real-time”), we can obtain the new posterior and use it to replace our prior.

Non-linear Regression

We can use linear regression not only to learn linear functions. The trick is to apply a non-linear transformation $\phi:\mathbb{R}^d\to\mathbb{R}^e$.
In Polynomial regression, to learn polynomials of degree m in d input dimensions, we need to apply the nonlinear transformation

the dimension of the feature space grows exponentially in the degree of polynomials and input dimensions. Even for relatively small m and d, this becomes completely unmanageable.

Function-space View

Previously, we have been interpreting it as a distribution over the weights $w$ of a linear function $\hat{f}=\boldsymbol{\Phi}w.$, we can equivalently consider a distribution directly over the estimated function values. Instead of considering a prior over the weights${w\sim\mathcal{N}}(0,\sigma_p^2I)$, we now impose a prior directly on the values of our model at the observations.Using that Gaussians are closed under linear maps, we obtain the equivalent prior:

K is the kernel matrix.The kernel function is:
$k(x,x^{\prime})\doteq\sigma_p^2\cdot\phi(x)^\top\phi(x^{\prime})$
The kernel matrix is a covariance matrix and the kernel function measures the covariance of the function values $k(x,x^{\prime})=\mathrm{Cov}\big[f(x),f(x^{\prime})\big].$
Moreover, note that we have reformulated the learning algorithm such that the feature space is now implicit in the choice of kernel, and the kernel is defined by inner products of (nonlinearly transformed) inputs. In other words, the choice of kernel implicitly determines the class of functions that f is sampled from, and encodes our prior beliefs. This is known as the kernel trick.

reference

[1] A. Krause, “Probabilistic Artificial Intelligence”.

hexo upgrade

hexo

what is hexo

“Hexo is a fast, simple and powerful blog framework. You write posts in Markdown (or other markup languages) and Hexo generates static files with a beautiful theme in seconds.”

how to use hexo

Hexo is dependent on Git and Node.js. After installed the two, if you are lucky, you can start to use Hexo as the official documentation(https://hexo.io/docs/) does. Here I will focus on the problems I have met. I accidently uninstalled Node.js and installed newest Node.js on my laptop, however, this action caused a series of problems. I spent 3 hours to fix the problem. In this opportunity, I will document the encountered issues and their solutions. At the same time, I will clarify the dependencies among these software tools.

problems

Here we assume that we have installed Git and Node.js successfully. We can check this by running

1
git --version

1
node -v

The first issue I encountered was when I ran

1
hexo deploy

I got “TypeError [ERR_INVALID_ARG_TYPE]: The “mode” argument must be integer”.

As https://github.com/hexojs/hexo/issues/4281 suggests, I upgraded Hexo, but it didn’t work, and there was another problem, I ran

1
2
npm install -g hexo-cli
npm install -g hexo

Hexo is still older version. After I modified package.json under the Hexo path and using command like this, it upgraded.
1
npm i hexo@5.2.0

Unfortunately, even I upgraded Hexo to the newest version, I still got the error, so I decided to degraded Node.js. This is where I knew about nvm, a command-line tool that allows you to easily manage multiple versions of Node.js on a single machine. With nvm, you can install, switch between, and manage different Node.js versions for your development projects. Remember to uninstall the Node.js first, I didn’t, so when I use nvm to install an older version, I still got the installed version.
Finally got the older version, I got “YAMLException: Specified list of YAML types (or a single Type object) contains a non-Type object.”, then as https://github.com/hexojs/hexo/issues/4917 suggests, I degraded Node.js and installed yaml-js.

Running hexo again, I got “TypeError: Cannot read property ‘length’ of undefined” and “Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath ‘./lib/js-yaml/type’ is not defined by “exports””, then I upgraded the theme I use and ran npm install to fix the problem, but there are still some problems with the theme. I reinstalled the theme as https://github.com/ppoffice/hexo-theme-icarus. As the theme upgrading, there are more dependencies needed, just install them as the message shows.

reference

https://hexo.io/docs/
https://github.com/ppoffice/hexo-theme-icarus