bigdata - Cube Data

On-Line Analytic Processing(OLAP)

On-Line Analytic Processing generally involves highly complex queries that use one or more aggregations.

OLAP and Data Warehouses

Data from many separate databases may be integrated into the warehouse. The warehouse is usually only updated overnight. Data warehouses play an important role in OLAP applications. First, the warehouse is necessary to organize and centralize data in a way that supports OLAP queries. Second, OLAP queries are usually complex and touching much of the data and take too much time to be executed in a transaction-processing system with high throughput requirements.

A Multidimensional View of OLAP Data

In typical OLAP applications there is a central relation or collection of data called the fact table. Often, it helps to think of the objects in the fact table as arranged in a multidimensional space. Two broad directions that have been taken by specialized systems that support cube-structured data for OLAP: ROLAP and MOLAP. ROLAP, which is Relational OLAP. In this approach, data may be stored in relations with a specialized structure called a “star schema”. MOLAP, which is Multidimensional OLAP. A specialized structure “data cude” is used to hold the data, including its aggregates.

Star Schemas

A star schema consists of the schema for the fact table, which links to several other relations, called “dimension tables”.

Slicing and Dicing

A choice of partition for each dimension “dices” the cude. The result is that the cude is divided into smaller cubes that represent groups of points whose statistics are aggregated by a query that performs this partitioning in its “group by” clause. Through the “where” clause, a query has the option of focusing on particular partitions alone one or more dimensions.(on a particular “slice” of the cube).
Alt text

Alt text

Alt text

Alt text

Data Cubes

The formal data cube precomputes all possible aggregates in a systematic way.

The Cube Operator

Given a fact table F, we can define an augmented table CUBE(F) that adds an additional value, denoted , to each dimension. The has the intuitive meaning “any,” and it represents aggregation along the dimension in which it appears.

The Cube Operator in SQL

SQL gives us a way to apply the cube operator within queries. If we add the term “WITH CUBE” to a group-by clause, then we get not only the tuple for each group, but also the tuples that represent aggregation along one or more of the dimensions along which we have grouped. These tuples appear in the result with NULL where we have used *.
Alt text

Alt text

However, SalesRollup would not contain tuples such as
Alt text

bigdata - Graph Database

Why graphs

Now, we do know a way to avoid joins and studied it at length: denormalizing the data to (homogeneous) collections of trees is a way of “pre-computing” the joins statically, so that the data is already joined (via nesting) at runtime.

Now, why is it efficient? Because doing down a tree only necessitates following pointers in memory. But trees cannot have cycles. Graph databases provide a way of generalizing the use in-memory pointers to traverse data to the general case in which cycles are present: this is called “index-free adjacency.”

Kinds of graph databases

There are many different graph database system products on the market, and they can be classified along several dimensions: Labeled property graph model vs. triple stores;Read-intensive vs. write-intensive;Local vs. distributed;Native vs. non-native.

Graph data models

Labeled property graphs

Computer scientists need to go one step further and also design how to store graphs physically. One way of doing so is to create an associative array mapping each node to the list of nodes that it connects to via an edge (adjacency lists). Another storage form is with an adjacency matrix: each row and each column represent a node, and a 0 or a 1 indicate the absence or presence of an edge between the row node and the column node.

Now, this does not quite work for us, because labeled property graphs enhance mathematical graphs with extra ingredients: properties, and labels.
how to “convert” a relational table to a labeled property graph: labels can be seen as table names, nodes as records, and properties as the attribute values for the records. This shows that relational tables can be physically stored as labeled property graphs. Of course, this does not work the other way round: given a graph, it will often not be possible to convert it “back” to a table in this specific way.

Triple stores

Triple stores are a different and simpler model. It views the graph as nodes and edges that all have labels, but without any properties. The graph is then represented as a list of edges, where each edge is a triple with the label of the origin node (called the subject), the label of the edge (called the property), and the label of the destination node (called the object).

Triple stores typically provide SPARQL capabilities to reason about and stored RDF data.

Querying graph data

We will now have a look at query languages for querying graphs, with a focus on Cypher, which is neo4j’s query language.
Alt text

Cypher Philosophy

Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Colloquially, we ask the database to “find things like this.” And the way we describe what “things like this” look like is to draw them, using ASCII art.

Like most query languages, Cypher is composed of clauses. The simplest queries consist of a MATCH clause followed by a RETURN clause. There are other clauses we can use in a Cypher query: WHERE,WITH…AS…,CREATE,MERGE,DELETE,SET,UNION,FORWACH and so on.

A Comparison of Relational and Graph Modeling

Relational databases—with their rigid schemas and complex modeling characteristics—are not an especially good tool for supporting rapid change. What we need is a model that is closely aligned with the domain, but that doesn’t sacrifice performance, and that supports evolution while maintaining the integrity of the data as it undergoes rapid change and growth. That model is the graph model.

creating a graph

In practice, we tend to use CREATE when we’re adding to the graph and don’t mind duplication, and MERGE when duplication is not permitted by the domain.

Beginning a Query

In Cypher we always begin our queries from one or more well-known starting points in the graph—what are called bound nodes. Cypher uses any labels and property predicates supplied in the MATCH and WHERE clauses, together with metadata supplied by indexes and constraints, to find the starting points that anchor our graph patterns.

INDEXES AND CONSTRAINTS

To support efficient node lookup, Cypher allows us to create indexes per label and property combinations. For unique property values we can also specify constraints that assure uniqueness.

Neo4j

Neo4j is a graph database with native processing capabilities as well as native graph storage.

Native Graph Processing

Of the many different engine architectures, we say that a graph database has native processing capabilities if it exhibits a property called index-free adjacency.

A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node, therefore, acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched. A nonnative graph database engine, in contrast, uses (global) indexes to link nodes together.

Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals. To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

Native Graph Storage

Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g., there are separate stores for nodes, relationships, labels, and properties).The division of storage responsibilities—particularly the separation of graph structure from property data—facilitates performant graph traversals, even though it means the user’s view of their graph and the actual records on disk are structurally dissimilar.

Like most of the Neo4j store files, the node store is a fixed-size record store, where each record is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. If we have a node with id 100, then we know its record begins 900 bytes into the file. Based on this format, the database can directly compute a record’s location, at cost O(1), rather than performing a search, which would be cost O(log n).

Transactions

Transactions in Neo4j are semantically identical to traditional database transactions. Writes occur within a transaction context, with write locks being taken for consistency purposes on any nodes and relationships involved in the transaction. On successful completion of the transaction, changes are flushed to disk for durability, and the write locks released. These actions maintain the atomicity guarantees of the transaction.

CORE API, TRAVERSAL FRAMEWORK, OR CYPHER?

The Core API allows developers to fine-tune their queries so that they exhibit high affinity with the underlying graph. A well-written Core API query is often faster than any other approach. The downside is that such queries can be verbose, requiring considerable developer effort. Moreover, their high affinity with the underlying graph makes them tightly coupled to its structure. When the graph structure changes, they can often break. Cypher can be more tolerant of structural changes—things such as variable-length paths help mitigate variation and change.

The Traversal Framework is both more loosely coupled than the Core API (because it allows the developer to declare informational goals), and less verbose, and as a result a query written using the Traversal Framework typically requires less developer effort than the equivalent written using the Core API. Because it is a general-purpose framework, however, the Traversal Framework tends to perform marginally less well than a well-written Core API query.

exercise

Neo4j system design is different from mongodb on the consistency. MongoDB is eventually consistent, but neo4j is strong consistency and obey the ACID rules.

Cypher: note that label and property are case sensitive but the clause is not case sensitive.

RDF: Turtle Syntax

Mannual grouping and grouping sets: need to know how to translate between them.
grouping with rollup: rollup(c1,c2,c3) is equal to grouping sets((c1,c2,c3),(c1,c2),(c1),()). The ordering is important.

difference between neo4j and RDF:
Data Model:
Neo4j: Neo4j is a graph database that uses the property graph data model. In this model, nodes represent entities, relationships represent connections between entities, and both nodes and relationships can have key-value pairs as properties.
RDF: RDF is a data model for representing knowledge in the form of subject-predicate-object triples. Each triple represents a statement, and these triples can be used to build a graph of linked data. RDF is more abstract and can be implemented using various storage formats.

Query Language:
Neo4j: Neo4j uses the Cypher query language, which is specifically designed for querying graph databases. Cypher allows users to express graph patterns and relationships in a concise and readable manner.
RDF: RDF data is typically queried using SPARQL (SPARQL Protocol and RDF Query Language). SPARQL is a query language for querying and manipulating RDF data, and it provides powerful capabilities for navigating the graph structure.

Graph Structure:
Neo4j: In Neo4j, the graph is explicit, with nodes, relationships, and properties forming a connected graph structure. The focus is on relationships between entities and their properties.
RDF: RDF represents a graph, but the graph structure is more implicit. Resources are identified by URIs, and relationships are expressed through triples, allowing for the creation of a distributed and linked data web.

Schema:
Neo4j: Neo4j supports a flexible schema where nodes and relationships can have dynamic properties. While it provides some level of schema flexibility, users can define constraints and indexes to enforce certain structures.
RDF: RDF is schema-agnostic, allowing for more dynamic and extensible data representation. Schemas can be defined using RDF vocabularies and ontologies, such as RDFS and OWL.

Use Cases:
Neo4j: Neo4j is often used for applications where relationships and graph structures are central, such as social networks, recommendation engines, and network analysis.
RDF: RDF is commonly used in the context of the Semantic Web for representing and linking diverse data sources, allowing for interoperability and knowledge representation.

references

Robinson, I. et al. (2015). Graph Databases (2nd ed.)

pai - Model-based Approximate Reinforcement Learning

model-based reinforcement learning

We face three main challenges in model-based reinforcement learning. First, given a fixed model, we need to perform planning to decide on which actions to play. Second, we need to learn models f and r accurately and efficiently. Third, we need to effectively trade exploration and exploitation.

Planning

Deterministic Dynamics

To begin with, let us assume that our dynamics model is deterministic and known. That is, given a state-action pair, we know the subsequent state.

pai - Model-free Approximate Reinforcement Learning

Model-free Approximate Reinforcement Learning

Tabular Reinforcement Learning as Optimization

In particular, in the tabular setting (i.e., over a discrete domain), we can parameterize the value function exactly by learning a separate parameter for each state.
Alt text

Alt text

Now, we cannot compute this derivative because we cannot compute the expectation. Firstly, the expectation is over the true value function which is unknown to us. Secondly, the expectation is over the transition model which we are trying to avoid in model-free methods. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. we will use a Monte Carlo estimate using a single sample. Recall that this is only possible because the transitions are conditionally independent given the state-action pair.

Alt text

Alt text

Therefore, TD-learning is essentially performing stochastic gradient descent using the TD-error as an unbiased gradient estimate.
Stochastic gradient descent with a bootstrapping estimate is also called stochastic semi-gradient descent.

Value Function Approximation

Our goal for large state-action spaces is to exploit the smoothness properties5 of the value function to condense the representation.

Heuristics

The vanilla stochastic semi-gradient descent is very slow.
There are mainly two problems.
As we are trying to learn an approximate value function that depends on the bootstrapping estimate, this means that the optimization target is “moving” between iterations. In practice, moving targets lead to stability issues. One such technique aiming to “stabilize” the optimization targets is called neural fitted Q-iteration or deep Q-networks (DQN). DQN updates the neural network used for the approximate bootstrapping estimate infrequently to maintain a constant optimization target across multiple episodes. One approach is to clone the neural network and maintain one changing neural network (“online network”) for the most recent estimate of the Q-function which is parameterized by θ, and one fixed neural network (“target network”) used as the target which is parameterized by θold and which is updated infrequently.

Alt text

This technique is known as experience replay. Another approach is Polyak averaging where the target network is gradually “nudged” by the neural network used to estimate the Q function.

Now, observe that the estimates Q⋆ are noisy estimates of q⋆. The fact that the update rules can be affected by inaccuracies (i.e., noise in the estimates) of the learned Q-function is known as the “maximization bias”. Double DQN (DDQN) is an algorithm that addresses this maximization bias. Instead of picking the optimal action with respect to the old network, it picks the optimal action with respect to the new network. Intuitively, this change ensures that the evaluation of the target network is consistent with the updated Q-function, which makes it more difficult for the algorithm to be affected by noise.

Alt text

Policy Approximation

Methods that find an approximate policy are also called policy search methods or policy gradient methods. Policy gradient methods use randomized policies for encouraging exploration.

Estimating Policy Values

Alt text
The policy value function measures the expected discounted payoff of policy π.

Alt text
Alt text

Reinforce Gradient

Alt text

Alt text
In this context, however, we cannot apply the reparameterization trick. Fortunately, there is another way of estimating this gradient.

Alt text

Alt text

When using a neural network for the parameterization of the policy π, we can use automatic differentiation to obtain unbiased gradient estimates. However, it turns out that the variance of these estimates is very large. Using so-called baselines can reduce the variance dramatically.

The baseline essentially captures the expected or average value, providing a reference point. By subtracting this reference point, the updates become more focused on the deviations from the expected values, which can reduce the variance in these deviations.

Alt text

Alt text

Alt text

Alt text

Typically, policy gradient methods are slow due to the large variance in the score gradient estimates. Because of this, they need to take small steps and require many rollouts of a Markov chain. Moreover, we cannot reuse data from previous rollouts, as policy gradient methods are fundamentally on-policy.

Actor-Critic Methods

Actor-Critic methods reduce the variance of policy gradient estimates by using ideas from value function approximation. They use function approximation both to approximate value functions and to approximate policies.

Advantage Function

Alt text

Intuitively, the advantage function is a shifted version of the state-action function q that is relative to 0. It turns out that using this quantity instead, has numerical advantages.

Policy Gradient Theorem

Alt text

Alt text

Intuitively, ρθ(x) measures how often we visit state x when following policy πθ. It can be thought of as a “discounted frequency”. Importantly, ρθ is not a probability distribution, as it is not normalized to integrate to 1. Instead, ρθ is what is often called a finite measure. Therefore, eq. (12.57) is not a real expectation!

On-policy Actor-Critics

Alt text

OAC

Alt text
Due to the use of TD-learning for learning the critic, this algorithm is fundamentally on-policy.

A2C

Alt text

that the Q-function is an absolute quantity, whereas the advantage function is a relative quantity, where the sign is informative for the gradient direction. Intuitively, an absolute value is harder to estimate than the sign. Actor-Critic methods are therefore often implemented with respect to the advantage function rather than the Q-function.

GAE/GAAE

Taking a step back, observe that the policy gradient methods such as REINFORCE generally have high variance in their gradient estimates. However, due to using Monte Carlo estimates of Gt, the gradient estimates are unbiased. In contrast, using a bootstrapped Q-function to obtain gradient estimates yields estimates with a smaller variance, but those estimates are biased. We are therefore faced with a bias-variance tradeoff. A natural approach is therefore to blend both gradient estimates to allow for effectively trading bias and variance. This leads to algorithms such as generalized advantage estimation (GAE/GAAE).

Improving sample efficiency(TRPO/PPO)

Actor-Critic methods generally suffer from low sample efficiency. One well-known variant that slightly improves the sample efficiency is trust-region policy optimization (TRPO).

Alt text
Intuitively, taking the expectation with respect to the previous policy πθk , means that we can reuse data from rollouts within the same iteration. TRPO allows reusing past data as long as it can still be “trusted”. This makes TRPO “somewhat” off-policy. Fundamentally, though, TRPO is still an on-policy method. Proximal policy optimization (PPO) is a heuristic variant of TRPO that often works well in practice.

Off-policy Actor-Critics

These algorithms use the reparameterization gradient estimates, instead of score gradient estimators.

DDPG

As our method is off-policy, a simple idea in continuous action spaces is to add Gaussian noise to the action selected by πθ — also known as Gaussian noise “dithering”. This corresponds to an algorithm called deep deterministic policy gradients.
This algorithm is essentially equivalent to Q-learning with function approximation (e.g., DQN), with the only exception that we replace the maximization over actions with the learned policy πθ.
Alt text

Twin delayed DDPG (TD3) is an extension of DDPG that uses two separate critic networks for predicting the maximum action and evaluating the policy. This addresses the maximization bias akin to Double-DQN. TD3 also applies delayed updates to the actor network, which increases stability.

Off-Policy Actor Critics with Randomized Policies

Alt text

The algorithm that uses eq. (12.81) to obtain gradients for the critic and reparameterization gradients for the actor is called stochastic value gradients (SVG).

Off-policy Actor-Critics with Entropy Regularization

In practice, algorithms like SVG often do not explore enough. A key issue with relying on randomized policies for exploration is that they might collapse to deterministic policies.

A simple trick that encourages a little bit of extra exploration is to regularize the randomized policies “away” from putting all mass on a single action. This approach is known as entropy regularization and it leads to an analogue of Markov decision processes called entropy-regularized Markov decision process, where suitably defined regularized state-action value functions (so-called soft value functions) are used.

soft actor critic(SAC)

Alt text

references

https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665
https://github.com/tsmatz/reinforcement-learning-tutorials/blob/master/06-sac.ipynb
https://spinningup.openai.com/en/latest/algorithms/sac.html
https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

bigdata - JSONiq

Querying denormalized data

imperative vs declarative

Imperative programming is a programming paradigm that expresses computation as a series of statements that change a program’s state. In imperative programming, the focus is on describing how a program operates step by step. Common imperative languages include C, C++, Java, and Python.

In contrast to imperative programming, declarative programming focuses on describing what the program should accomplish rather than how to achieve it. In a declarative host language, the emphasis is on specifying the desired outcome or properties, and the language itself takes care of the underlying implementation details. Such as SQL, HTML.

motivation

Denormalized data

it is characterized with two features: nestedness, and heterogeneity. In fact, denormalized datasets should not be seen as “broken tables pushed to their limits”, but rather as collections of trees.

Features of a query language

A query language for datasets has three main features: Declarative, Functional, and Set-based. First, it is declarative. This means that the users do not focus on how the query is computed, but on what it should return. Being functional means that the query language is made of composable expressions that nest with each other, like a Lego game. Finally, it is set-based, in the sense that the values taken and returned by expressions are not only single values (scalars), but are large sequences of items (in the case of SQL, an item is a row).

Query languages for denormalized data

For denormalized data though, sadly, the number of languages keeps increasing: the oldest ones being XQuery, JSONiq, but then now also JMESPath, SpahQL, JSON Query, PartiQL, UnQL, N1QL, ObjectPath, JSONPath, ArangoDB Query Language (AQL), SQL++, GraphQL, MRQL, Asterix Query Language (AQL), RQL.

JSONiq as a data calculator

it can perform arithmetics, but also comparison and logic. It is, however, more powerful than a common calculator and supports more complex constructs, for example variable binding.
It also supports all JSON values. Any copy-pasted JSON value literally returns itself.
And, unlike a calculator, it can access storage.

The JSONiq Data Model

Every expression of the JSONiq “data calculator” returns a sequence of items.An item can be either an object, an array, an atomic item, or a function item. A sequence can also be empty. Caution, the empty sequence is not the same logically as a null item.

Navigation

The general idea of navigation is that it is possible to “dive” into the nested data with dots and square brackets (originally, these were slashes with XPath) – all in parallel: starting with an original collection of objects (or, possibly, a single document), each step (i.e., for each dot and each pair of square brackets) goes down the nested structure and returns another sequence of nested items.

Object lookups (dot syntax)

json-doc(“file.json”).o

Array unboxing (empty square bracket syntax)

We can unbox the array, meaning, extract its members as a sequence of seven object items, with empty square brackets.json-doc(“file.json”).o[].

Parallel navigation

The dot syntax, in fact, works on sequences, too and will extract the value associated with a key in every object of the sequence (anything else than an object is ignored and thrown away).

Filtering with predicates (simple square bracket syntax)

It is possible to filter any sequence with a predicate, where .c = 3].It is also possible to access the item at position n in a sequence with this same notation: json-doc(“file.json”).o[].a.b[][5].

Array lookup (double square bracket syntax)

To access the n-th member of an array, you can use double-squarebrackets, like so:
json-doc(“file.json”).o[[2]].a.

A common pitfall: Array lookup vs. Sequence predicates

Do not confuse sequence positions (single square brackets) with array positions (double square brackets)!
([1, 2], [3, 4])[2] -> [ 3, 4 ]
([1, 2], [3, 4])[[2]] -> 2 4

Schema discovery

Collections

many datasets are in fact found in the form of large collections of smaller objects (as in document stores). Such collections are access with a function call together with a name or (if reading from a data lake) a path.

Getting all top-level keys

The keys function retrieves all keys. It can be called on the entire sequence of objects and will return all unique keys found (at the top level) in that collection.

Getting unique values associated with a key

With distinct-values, it is then possible to eliminate duplicates and look at unique values:
distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type)

Aggregations

Aggregations can be made on entire sequences with a single function call:The five basic functions are count, sum, avg, min, max.
count(distinct-values(collection( “https://www.rumbledb.org/samples/git-archive.jsonl“ ).type))

Construction

Construction of atomic values

Atomic values that are core to JSON can be constructed with exactly the same syntax as JSON.

Construction of objects and arrays

In fact, one can copy-paste any JSON value, and it will always be recognized as a valid JSONiq query returning that value.

Construction of sequences

Sequences can be constructed (and concatenated) using commas.Increasing sequences of integers can also be built with the to keyword.

Scalar expressions

Sequences of items can have any number of items. A few JSONiq expression (arithmetic, logic, value comparison…) work on the particular case that a sequence has zero or one items.

Arithmetic

JSONiq supports basic arithmetic: addition (+), subtraction (-), division (div), integer division (idiv) and modulo (mod).If the data types are different, then conversions are made automatically. The empty sequence enjoys special treatment: if one of the sides (or both) is the empty sequence, then the arithmetic expression returns an empty sequence (no error). If one of the two sides is null (and the other side is not the empty sequence), then the arithmetic expression returns null. If one of the sides (or both) is not a number, null, or the empty sequence, then a type error is thrown.

String manipulation

String concatenation is done with a double vertical bar: “foo” || “bar”.
Most other string manipulation primitives are available from the rich JSONiq builtin function library:
concat(“foo”, “bar”),string-join((“foo”, “bar”, “foobar”), “-“),substr(“foobar”, 4, 3),
string-length(“foobar”).

Value comparison

Sequences of one atomic item can be compared with eq (equal), ne (not equal), le (lower or equal), gt (greater or equal), lt (lower than) and gt (greater than).

Logic

JSONiq supports the three basic logic expressions and, or, and not. not has the highest precedence, then and, then or.
JSONiq also supports universal and existential quantification:
every $i in 1 to 10 satisfies $i gt 0, some $i in 1 to 10 satisfies $i gt 5.

If one of the two sides is not a sequence of a single Boolean item, then implicit conversions are made. This mechanism is called the Effective Boolean Value (EBV). For example, an empty sequence, or a sequence of one empty string, or a sequence of one zero integer, is considered false. A sequence of one non-empty string, or a sequence or one non-zero integer, or a sequence starting with one object (or array) is considered true.

General comparison

JSONiq has a shortcut for existential quantification on value comparisons. This is called general comparison.

some $i in (1, 2, 3, 4, 5) satisfies $i eq 1 ==
(1, 2, 3, 4, 5) = 1.

Composability

Data flow

A few expressions give some control over the data flow by picking the output or this or that expression based on the value of another expression. This includes conditional expressions. This includes conditional expressions. This also includes try-catch expressions.

Binding variables with cascades of let clauses

Variables in JSONiq start with a dollar sign. It is important to understand that this is not a variable assignment that would change the value of a variable. This is only a declarative binding.

FLWOR expressions

One of the most important and powerful features of JSONiq is the FLWOR expression. It corresponds to SELECT-FROM-WHERE queries in SQL, however, it is considerably more expressive and generic than them in several aspects. In JSONiq the clauses can appear in any order with the exception of the first and last clause. JSONiq supports a let clause, which does not exist in SQL.
In SQL, when iterating over multiple tables in the FROM clause, they “do not see each other”. In JSONiq, for clauses (which correspond to FROM clauses in SQL), do see each other, meaning that it is possible to iterate in higher and higher levels of nesting by referring to a previous for variable.

For clauses

It can thus be seen that the for clause is akin to the FROM clause in SQL, and the return is akin to the SELECT clause. Projection in JSONiq can be made with a project() function call, with the keys to keep. It is possible to implement a join with a sequence of two for clauses and a predicate. Note that allowing empty can be used to perform a left outer join, i.e., to account for the case when there are no matching records in the second collection.

Let clauses

A let clause outputs exactly one outgoing tuple for each incoming tuple (think of a map transformation in Spark). Let clauses also allow for joining the two datasets.

Where clauses

Where clauses are used to filter variable bindings (tuples) based on a predicate on these variables. They are the equivalent to a WHERE clause in SQL.

Order by clauses

Order by clauses are used to reorganize the order of the tuples, but without altering them. They are the same as ORDER BY clauses in SQL.It is also possible, like in SQL, to specify an ascending or a descending order. In case of ties between tuples, the order is arbitrary. But it is possible to sort on another variable in case there is a tie with the first one (compound sorting keys). It is possible to control what to do with empty sequences: they can be considered smallest or greatest.

Group by clauses

Group by clauses organize tuples in groups based on matching keys, and then output only one tuple for each group, aggregating other variables (count, sum, max, min…). It is also possible to opt out of aggregating other (non-grouping-key) variables.

Types

Variable types

Since every value in JSONiq is a sequence of item, a sequence type consists of two parts: an item type, and a cardinality.
Item types can be any of the builtin atomic types. as well as “object”, “array” and the most generic item type, “item”.
Cardinality can be one of the following four:Any number of items (suffix ); for example object, One or more items (suffix +); for example array+,Zero or one item (suffix ?); for example integer,Exactly one item (no suffix); for example boolean?

Type expressions

An “instance of” expression checks whether a sequences matches a sequence type, and returns true or false. A “cast as” expression casts single items to an expected item type.
A “castable as” expression tests whether a cast would succeed (in which case it returns true) or not (false).
A “treat as” expression checks whether its input sequence matches an expected type (like a type on a variable); if it does, the input sequence is returned unchanged. If not, an error is raised.

Types in user-defined functions

JSONiq supports user-defined functions. Parameter types can be optionally specified, and a return type can also be optionally specified.

Validating against a schema

It is possible to declare a schema, associating it with a user-defined type, and to validate a sequence of items against this user-defined type.

Architecture of a query engine

Static phase

When a query is received by an engine, it is text that needs to be parsed. The output of this is a tree structure called an Abstract Syntax Tree. An Abstract Syntax Tree, even though it already has the structure of a tree, is tightly tied to the original syntax. Thus, it needs to be converted into a more abstract Intermediate Representation called an expression tree. Every node in this tree corresponds to either an expression or a clause in the JSONiq language, making the design modular. At this point, static typing takes place, meaning that the engine infers the static type of each expression, that is, the most specific type possible expected at runtime (but without actually running the program). Engines like RumbleDB perform their optimization round on this Intermediate Representation. Once optimizations have been done, RumbleDB decides the mode with which each expression and clause will be evaluated (locally, sequentially, in parallel, in DataFrames, etc). The resulting expression tree is then converted to a runtime iterator tree; this is the query plan that will actually be evaluated by the engine.

Dynamic phase

During the dynamic phase, the root of the tree is asked to produce a sequence of items, which is to be the final output of the query as a whole. Then, recursively, each node in the tree will ask its children to produce sequences of items (or tuple streams). Each node then combines the sequences of items (or tuple streams) it receives from its children in order to produce its own sequence of items according to its semantics, and pass it to its parent.

Materialization

When a sequence of items is materialized, it means that an actual List (or Array, or Vector), native to the language of implementation (in this case Java) is stored in local memory, filled with the items.

Streaming

When a sequence of items (or tuple stream) is produced and consumed in a streaming fashion, it means that the items (or tuples) are produced and consumed one by one, iteratively. But the whole sequence of items (or tuple stream) is never stored anywhere. The classical pattern for doing so is known as the Volcano iterator architecture.

However, there are two problems with this: first, it can take a lot of time to go through the entire sequence (imagine doing so with billions or trillions of items). Second, there are expressions or clauses that are not compatible with streaming (consider, for example, the group by or order by clause, which cannot be implemented without materializing their full input).

Parallel execution (with RDDs)

When a sequence becomes unreasonably large, RumbleDB switches to a parallel execution, leveraging Spark capabilities: the sequences of items are passed and processed as RDDs of Item objects.

Parallel execution (with DataFrames)

The RDD implementation supports heterogeneous sequences by leveraging the polymorphism of Item objects. However, this is not efficient in the case that Items in the same sequence happen to have a regular structure. Thus, if the Items in a sequence are valid against a specific schema, or even against an array type or an atomic type, the underlying physical storage in memory relies on Spark DataFrames instead of RDDs.

To summarize, homogeneous sequences of the most common types are stored in DataFrames, and RDDs are used in all other cases.

Parallel execution (with Native SQL)

In some cases (more in every release), RumbleDB is able to evaluate the query using only Spark SQL, compiling JSONiq to SQL directly instead of packing Java runtime iterators in UDFs. This leads to faster execution, because UDFs are slower than a native execution in SQL. This is because, to a SQL optimizer, UDFs are opaque and prevent automatic optimizations.

bigdata - MongoDB

Document stores

Can we rebuild a similar system for collections of trees, in the sense that we drop all three constraints: relational integrity, domain integrity, and atomic integrity? Document stores bring us one step in this direction.

Challenges

Schema on read

In a relational database management system, it is not possible to populate a table without having defined its schema first. However, when encountering such denormalized data, in the real world, there is often no schema. In fact, one of the important features of a system that deals with denormalized data is the ability to discover a schema.

Making trees fit in tables

Several XML elements (or, likewise, several JSON objects) can be naturally mapped to a relational table with several rows if the collection is flat and homogeneous, but semi-structured data can generally be nested and heterogeneous. if we map nested and heterogeneous into a table,such mapping will at best have to be done for every single dataset, and requires in most cases a schema, whereas we are looking for a generic solution for semistructured data with no a-priori schema information.

Document stores

Document stores provide a native database management system for semi-structured data. Document stores work on collections of records, generalizing the way that relational tables can be seen as collections of rows. It is important to understand that document stores are optimized for the typical use cases of many records of small to medium sizes. Typically, a collection can have millions or billions of documents, while each single document weighs no more than 16 MB (or a size in a similar magnitude). Finally, a collection of documents need not have a schema: it is possible to insert random documents that have dissimilar structures with no problem at all. Most document stores, however, do provide the ability to add a schema. Document stores can generally do selection, projection, aggregation and sorting quite well, but many of them are typically not (yet) optimized for joining collections. In fact, often, their language or API does not offer any joining functionality at all, which pushes the burden to reimplement joins in a host language to the users. This is a serious breach of data independence.

Implementations

There is a very large number of products in the document store space for both JSON and XML, let us mention for example MongoDB, CouchDB, ElasticSearch, Cloudant, existDB, ArangoDB, BaseX, MarkLogic, DocumentDB, CosmosDB, and so on. We will focus, as an example, on MongoDB.

Physical storage

Just like the storage format is optimized for tabular data in a relational database management system, it is optimized for tree-like data in a document store. In MongoDB, the format is a binary version of JSON called BSON. BSON is basically based on a sequence of tokens that efficiently encode the JSON constructs found in a document. The immediate benefit of BSON is that it takes less space in storage than JSON stored as a text file: for example, null, true and false literals need four or five bytes in text format at best, while they can be efficiently encoded as single bytes in BSON. Furthermore, BSON supports additional types that JSON does not have, such as dates.

Querying paradigm (CRUD)

The API of MongoDB, like many document stores, is based on the CRUD paradigm. CRUD means Create, Read, Update, Delete and corresponds to low-level primitives similar to those for HBase. MongoDB supports several host languages to query collections via an API. This includes in particular JavaScript and Python, but many other languages are supported via drivers. We will use JavaScript here because this is the native host language of MongoDB. It is important to note that these APIs are not query languages. MongoDB also provides access to the data via a shall called mongo or, newly, mongosh. This is a simple JavaScript interface wrapped around the MongoDB’s node.js driver.

Populating a collection

To create a collection, one can simply insert a document in it, and it will be automatically created if it does not exist.

MongoDB automatically adds to every inserted document a special field called “ id” and associated with a value called an Object ID and with a type of its own.Object IDs are convenient for deleting or updating a specific document with no ambiguity.

Querying a collection

Scan a collection

Asking for the contents of an entire collection is done with a simple find() call on the previous object:db.collection.find().This function does not in fact return the entire collection; rather, it returns some pointer, called a cursor, to the collection; the user can then iterate on the cursor in an imperative fashion in the host language.

Selection

It is possible to perform a selection on a collection by passing a parameter to find() that is a JSON object:
db.collection.find({ “Theory” : “Relativity” }).

A disjunction (OR) uses a special MongoDB keyword, prefixed with a dollar sign:
db.collection.find( { “$or” : [ { “Theory”:”Relativity” }, { “Last”:”Einstein” } ] } ).

MongoDB offers many other keywords, for example for comparison other than equality:
db.collection.find( { “Publications” : { “$gte” : 100 } } )

Projection

Projections are made with the second parameter of this same find() method. This is done in form of a JSON object associating all the desired keys in the projection with the value 1. db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last”: 1 } ).

It is also possible to project fields away in the same way with 0s, however 1s and 0s cannot be mixed in the projection parameter, except in the specific above case of projecting away the object ID

Counting

Counting can be done by chaining a count() method call:
db.scientists.find( { “Theory” : “Relativity” } ).count().

Sorting

Sorting can be done by chaining a sort() method call.
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } )

1 is for ascending order and -1 for descending order.It is also possible to add limits and offsets to paginate results also by chaining more method calls:
db.scientists.find( { “Theory” : “Relativity” }, { “First” : 1, “Last” : 1 } ).sort( { “First” : 1, “Name” : -1 } ).skip(30).limit(10).

Note that, contrary to intuition, the order of the calls does not matter, as this is really just the creation of a query plan by providing parameters (in any order).

Duplicate elimination

It is possible to obtain all the distinct values for one field with a distinct() call: db.scientists.distinct(“Theory”)

Querying for heterogeneity

Absent fields

Absent fields can be filtered with: db.scientists.find( { “Theory” : null } ).

Filtering for values across types

db.collection.find( { “$or” : [ { “Theory”: “Relativity” }, { “Theory”: 42 }, { “Theory”: null } ] } )

db.scientists.find( { “Theory” : { “$in” : [“Relativity”, 42, null ] } } ).

MongoDB is also able to sort on fields that have heterogeneous data types. It does so by first order by type in some (arbitrary, but documented) order, and then within each type.

Querying for nestedness

Nestedness in MongoDB is handled in several ad-hoc ways for specific use cases.

Values in nested objects

We saw how to select documents based on values associated with a top-level keys. What about values that are not at the top-level, but in nested objects?
The first solution that might come to mind is something like this: db.scientists.find({ “Name” : { “First” : “Albert” } }) However, this query will not have the behavior many would have expected: instead of finding documents that have a value “Albert” associated with the key “First” in an object itself associated with the top-level key ”Name”, this query looks for an exact match of the entire object.
In order to include documents such as above, MongoDB uses a dot syntax: db.scientists.find({ “Name.First” : “Albert” }).

Values in nested arrays

MongoDB allows to filter documents based on whether a nested array contains a specific value, like so: db.scientists.find({ “Theories” : “Special relativity” }).

Deleting objects from a collection

Objects can be deleted from a collection either one at a time with deleteOne(), or several at a time with deleteMany(): db.scientists.deleteMany( { “century” : “15” } ).

Updating objects in a collection

Documents can be updated with updateOne() and updateMany() by providing both a filtering criterion (with the same syntax as the first parameter of find()) and an update to apply. The command looks like so: db.scientists.updateMany( { “Last” : “Einstein” }, { $set : { “Century” : “20” } } ).

The granularity of updates is per document, that is, a single document can be updated by at most one query at the same time.However, within the same collection, several different documents can be modified concurrently by different queries in parallel.

Complex pipelines

For grouping and such more complex queries, MongoDB provides an API in the form of aggregation pipelines.
db.scientists.aggregate( { $match : { “Century” : 20 }}, { $group : { “Year” : “$year”, “Count” : { “$sum” : 1 } } }, { $sort : { “Count” : -1 } }, { $limit : 5 } ).

Limitations of a document store querying API

Simple use cases are straightforward to handle, however more complex use cases require a lot of additional code in the host language, be it JavaScript or Python. An example is that joins must be taken care of by the end user in the host language: the burden of implementing more complex use cases is pushed to the end user.

Architecture

The architecture of MongoDB follows similar principles to what we covered before: scaling out the hardware to multiple machine, and sharding as well as replicating the data.

Sharding collections

Collections in MongoDB can be sharded. Shards are determined by selecting one or several fields. (Lexicographically-ordered) intervals over these fields then determine the shards. The fields used to shard must be organized in a tree index structure.

Replica sets

A replica set is a set of several nodes running the MongoDB server process. The nodes within the same replica set all have a copy of the same data.

Each shard of each collection is assigned to exactly one replica set. Note that this architecture is not the same as that of HDFS, in which the replicas are spread over the entire cluster with no notion of “walls” between replica sets and no two DataNodes having the exact same block replicas.

Write concerns

When writing (be it delete, update or insert) to a collection, more exactly, to a specific shard of a collection, MongoDB checks that a specific minimum number of nodes (within the replica set that is responsible for the shard) have successfully performed the update.

Motivation

A document store, unlike a data lake, manages the physical data layout. This has a cost: the need to import (ETL) data before it is possible to query it, but this cost comes with a nice benefit: index support, just like relational database management systems.

Hash indices

Hash indices are used to optimize point queries and more generally query that select on a specific value of a field. The general idea is that all the values that a field takes in a specific collection can be hashed to an integer. The value, together with pointers to the corresponding documents, is then placed in a physical array in memory, at the position corresponding to this integer.

Tree indices

Hash indices are great and fast, but have limitations: first, they consume space. The more one wants to avoid hash collisions, the larger the array needs to be. But more importantly, hash indices cannot support range queries. This is because hashes do not preserve the order of the values and distribute them “randomly” in the index array structure.

Range queries are supported with tree indices. Instead of an array, tree indices use some sort of tree structure in which they arrange the possible values of the indexed field, such that the values are ordered when traversing the tree in a depth-first-search manner. More precisely, the structure is called a B+-tree. Unlike a simple binary tree, nodes have a large number of children.

Secondary indices

By default, MongoDB always builds a tree index for the id field. Users can request to build hash and tree indices for more fields. These indices are called secondary indices.

The command for building a hash index looks like so: db.scientists.createIndex({ “Name.Last” : “hash” })

And for a tree index (1 means in ascending order, -1 would be descending): db.scientists.createIndex({ “Name.Last” : 1 }).

When are indices useful

When building indices, it is important to get a feeling for whether a query will be faster or not with this index.

index types

Single Field Indexes

By default, all collections have an index on the _id field. You can add additional indexes to speed up important queries and operations. You can create a single-field index on any field in a document, including:Top-level document fields, Embedded documents ,Fields within embedded documents. When you create an index on an embedded document, only queries that specify the entire embedded document use the index. Queries on a specific field within the document do not use the index. In order for a dot notation query to use an index, you must create an index on the specific embedded field you are querying, not the entire embedded object.

Compound Indexes

Compound indexes collect and sort data from two or more fields in each document in a collection. Data is grouped by the first field in the index and then by each subsequent field.

The order of the indexed fields impacts the effectiveness of a compound index. Compound indexes contain references to documents according to the order of the fields in the index. To create efficient compound indexes, follow the ESR (Equality, Sort, Range) rule. The ESR (Equality, Sort, Range) Rule is to place fields that require exact matches first in your index. Sort follows equality matches because the equality matches reduce the number of documents that need to be sorted. Sorting after the equality matches also allows MongoDB to do a non-blocking sort. “Range” filters scan fields. The scan doesn’t require an exact match, which means range filters are loosely bound to index keys. To improve query efficiency, make the range bounds as tight as possible and use equality matches to limit the number of documents that must be scanned. MongoDB cannot do an index sort on the results of a range filter. Place the range filter after the sort predicate so MongoDB can use a non-blocking index sort.

Compound indexes cannot support queries where the sort order does not match the index or the reverse of the index.

Index prefixes are the beginning subsets of indexed fields. Compound indexes support queries on all fields included in the index prefix. Index fields are parsed in order; if a query omits an index prefix, it is unable to use any index fields that follow that prefix.

Multikey Indexes

Multikey indexes collect and sort data from fields containing array values. Multikey indexes improve performance for queries on array fields.

In a compound multikey index, each indexed document can have at most one indexed field whose value is an array.

exercise

data model

Embedded Data Models

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedded data models make it possible to update related data in a single atomic write operation.

Normalized Data Models

Normalized data models describe relationships using references between documents.

references

https://www.mongodb.com/docs/manual/core/data-model-design/

pai - Tabular Reinforcement Learning

Tabular Reinforcement Learning

The Reinforcement Learning Problem

Reinforcement learning is concerned with probabilistic planning in unknown environments. In this chapter, we will begin by considering reinforcement learning with small state and action spaces. This setting is often called the tabular setting, as the value functions can be computed exhaustively for all states and stored in a table.

Clearly, the agent needs to trade exploring and learning about the environment with exploiting its knowledge to maximize rewards. In fact, Bayesian optimization can be viewed as reinforcement learning with a fixed state and a continuous action space: In each round, the agent plays an action, aiming to find the action that maximizes the reward.Another key challenge of reinforcement learning is that the observed data is dependent on the played actions.

Trajectories

Alt text

Crucially, the newly observed states xt+1 and the rewards rt (across multiple transitions) are conditionally independent given the previous states xt and actions at. This independence property is crucial for being able to learn about the underlying Markov decision process. Notably, this implies that we can apply the law of large numbers (1.83) and Hoeffding’s inequality (1.87) to our estimators of both quantities.

The collection of data is commonly classified into two settings. In the episodic setting, the agent performs a sequence of “training” rounds (called episodes). In the beginning of each round, the agent is reset to some initial state. In contrast, in the continuous setting (or non-episodic,or online setting), the agent learns online.

control

Another important distinction in how data is collected, is the distinction between on-policy and off-policy control. As the names suggest, on-policy methods are used when the agent has control over its own actions, in other words, the agent can freely choose to follow any policy. In contrast, off-policy methods can be used even when the agent cannot freely choose its actions. Off-policy methods are therefore able to make use of observational data.Off-policy methods are therefore more sample-efficient than on-policy methods. This is crucial, especially in settings where conducting experiments (i.e., collecting new data) is expensive.

On-Policy learning algorithms are the algorithms that evaluate and improve the same policy which is being used to select actions. Off-Policy learning algorithms evaluate and improve a policy that is different from Policy that is used for action selection.

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy. The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. The update policy is how the agent updates the Q-function. On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, Off-policy algorithms learn the value of the optimal policy and can improve upon a policy that is different from the behavior policy. Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy.

Model-based Approaches

Approaches to reinforcement learning are largely categorized into two classes. Model-based approaches aim to learn the underlying Markov decision process. In contrast, model-free approaches learn the value function directly.

Learning the Underlying Markov Decision Process

A natural first idea is to use maximum likelihood estimation to approximate transition and reward function.

Alt text

ε-greedy Algorithm

Alt text

The key problem of ε-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly suboptimal actions.

Rmax Algorithm

A key principle in effectively trading exploration and exploitation is “optimism in the face of uncertainty”. Let us apply this principle to the reinforcement learning setting. The key idea is to assume that the dynamics and rewards model “work in our favor” until we have learned “good estimates” of the true dynamics and rewards.

Alt text

How many transitions are “enough”? We can use Hoeffding’s inequality to get a rough idea!

Alt text

challenges

Model-free Approaches

A significant benefit to model-based reinforcement learning is that it is inherently off-policy. That is, any trajectory regardless of the policy used to obtain it can be used to improve the model of the underlying Markov decision process. In the model-free setting, this not necessarily true.

On-policy Value Estimation

Alt text

Note that to estimate this expectation we use a single(!) sample.However, there is one significant problem in this approximation. Our approximation of vπ does in turn depend on the (unknown) true value of vπ. The key idea is to use a bootstrapping estimate of the value function instead. That is, in place of the true value function vπ, we will use a “running estimate” Vπ. In other words, whenever observing a new transition, we use our previous best estimate of vπ to obtain a new estimate Vπ.

Crucially, using a bootstrapping estimate generally results in biased estimates of the value function. Moreover, due to relying on a single sample, the estimates tend to have very large variance.

TD-learning

The variance of the estimate is typically reduced by mixing new estimates of the value function with previous estimates using a learning rate αt. This yields the temporal-difference learning algorithm.

Alt text

TD-learning is a fundamentally on-policy method. That is, for the estimates Vπ to converge to the true value function vπ, the transitions that are used for the estimation must follow policy π.

SARSA

Alt text

Off-policy Value Estimation

Alt text
This adapted update rule explicitly chooses the subsequent action a′ according to policy π whereas SARSA absorbs this choice into the Monte Carlo approximation. The algorithm has analogous convergence guarantees to those of SARSA. Crucially, this algorithm is off-policy. As noted, the key difference to the on-policy TD-learning and SARSA is that our estimate of the Qfunction explicitly keeps track of the next-performed action. It does so for any action in any state.

Q-learning

Alt text
Crucially, the Monte Carlo approximation of eq. (11.21) does not depend on the policy. Thus, Q-learning is an off-policy method.
Alt text

Optimistic Q-learning

Alt text

Challenges

We have seen that both the model-based Rmax algorithm and the modelfree Q-learning take time polynomial in the number of states |X| and the number of actions |A| to converge. While this is acceptable in small grid worlds, this is completely unacceptable for large state and action spaces.

references

https://core-robotics.gatech.edu/2022/02/28/bootcamp-summer-2020-week-4-on-policy-vs-off-policy-reinforcement-learning/

pai - Markov Decision Processes

Markov Decision Processes

Planning deals with the problem of deciding which action an agent should play in a (stochastic) environment(An environment is stochastic as opposed to deterministic, when the outcome of actions is random.). A key formalism for probabilistic plan ning in known environments are so-called Markov decision processes.
Alt text

Our fundamental objective is to learn how the agent should behave to optimize its reward. In other words, given its current state, the agent should decide (optimally) on the action to play. Such a decision map — whether optimal or not — is called a policy.

Alt text

Alt text

For the purpose of our discussion of Markov decision processes and reinforcement learning, we will focus on a very common reward called discounted payoff.

Alt text

Alt text

Because we assumed stationary dynamics, rewards, and policies, the discounted payoff starting from a given state x will be independent of the start time t.

Bellman Expectation Equation

Let us now see how we can compute the value function:

Alt text

This equation is known as the Bellman expectation equation, and it shows a recursive dependence of the value function on itself. The intuition is clear: the value of the current state corresponds to the reward from the next action plus the discounted sum of all future rewards obtained from the subsequent states.

Policy Evaluation

Bellman’s expectation equation tells us how we can find the value function vπ of a fixed policy π using a system of linear equations.

Alt text

Fixed-point Iteration

Alt text

Policy Optimization

Greedy Policies

Alt text

Bellman Optimality Equation

Alt text

Alt text
These equations are also called the Bellman optimality equations. Intuitively, the Bellman optimality equations express that the value of a state under an optimal policy must equal the expected return for the best action from that state. Bellman’s theorem is also known as Bellman’s optimality principle, which is a more general concept.

Policy Iteration

Alt text

It can be shown that policy iteration converges to an exact solution in a polynomial number of iterations.Each iteration of policy iteration requires computing the value function, which we have seen to be of cubic complexity in the number of states.

Value Iteration

Alt text
Value iteration converges to an ε-optimal solution in a polynomial number of iterations. Unlike policy iteration, value iteration does not converge to an exact solution in general.an iteration of 7 Sparsity refers to the interconnectivity of the state space. When only few states are reachable from any state, we call an MDP sparse. value iteration can be performed in (virtually) constant time in sparse Markov decision processes.

Partial Observability

In this section, we consider how Markov decision processes can be extended to a partially observable setting where the agent can only access noisy observations Yt of its state Xt.

Alt text

Observe that a Kalman filter can be viewed as a hidden Markov model with conditional linear Gaussian motion and sensor models and a Gaussian prior on the initial state.

POMDPs can be reduced to a Markov decision process with an enlarged state space.

pai - Bayesian Optimization

Bayesian Optimization

Alt text

Exploration-Exploitation Dilemma

In Bayesian optimization, we want to learn a model of f ⋆ and use this model to optimize f ⋆ simultaneously. These goals are somewhat contrary. Learning a model of f ⋆ requires us to explore the input space while using the model to optimize f ⋆ requires us to focus on the most promising well-explored areas. This trade-off is commonly known as the exploration-exploitation dilemma.

It is common to use a so-called acquisition function to greedily pick the next point to sample based on the current model.

Online Learning and Bandits

Alt text

Multi-Armed Bandits(MAB)

The “multi-armed bandits” (MAB) problem is a classical, canonical formalization of the exploration-exploitation dilemma. In the MAB problem, we are provided with k possible actions (arms) and want to maximize our reward online within the time horizon T. We do not know the reward distributions of the actions in advance, however, so we need to trade learning the reward distribution with following the most promising action.

Bayesian optimization can be interpreted as a variant of the MAB problem where there can be a potentially infinite number of actions (arms), but their rewards are correlated (because of the smoothness of the Gaussian process prior). Smoothness means that nearby points in the input space are likely to have similar function values. Because of the smoothness property inherent in the Gaussian process prior, Bayesian optimization can make informed decisions about where to explore next in the input space. The model can leverage information from previously evaluated points to predict the behavior of the objective function at unexplored points more effectively.

Regret

The key performance metric in online learning is the regret.

Alt text

Achieving sublinear regret requires balancing exploration with exploitation. Typically, online learning (and Bayesian optimization) consider stationary environments, hence the comparison to the static optimum.

Acquisition Functions

Throughout our description of acquisition functions, we will focus on a setting where we model $f^⋆$ using a Gaussian process which we denote by f. The methods generalize to other means of learning $f^⋆$ such as Bayesian neural networks.

Alt text

One possible acquisition function is uncertainty sampling. However, this acquisition function does not at all take into account the objective of maximizing $f^⋆$ and focuses solely on exploration.

Suppose that our model f of $f^⋆$ is well-calibrated, in the sense that the true function lies within its confidence bounds. Consider the best lower bound, that is, the maximum of the lower confidence bound. Now, if the true function is really contained in the confidence bounds, it must hold that the optimum is somewhere above this best lower bound.

Therefore, we only really care how the function looks like in the regions where the upper confidence bound is larger than the best lower bound. The key idea behind the methods that we will explore is to focus exploration on these plausible maximizers.

Upper Confidence Bound

Alt text
This acquisition function naturally trades exploitation by preferring a large posterior mean with exploration by preferring a large posterior variance.
This optimization problem is non-convex in general. However, we can use approximate global optimization techniques like Lipschitz optimization (in low dimensions) and gradient ascent with random initialization (in high dimensions). Another widely used option is to sample some random points from the domain, score them according to this criterion, and simply take the best one.

Alt text
Observe that if the information gain is sublinear in T then we achieve sublinear regret and, in particular, converge to the true optimum. The term “sublinear” refers to a growth rate that is slower than linear.

Alt text
Intuitively, to work even if the unknown function $f^⋆$ is not contained in the confidence bounds, we use βt to re-scale the confidence bounds to enclose $f^⋆$.

Probability of Improvement

Alt text

Probability of improvement tends to be biased in favor of exploitation, as it prefers points with large posterior mean and small posterior variance.

Expected Improvement

Alt text

Intuitively, expected improvement seeks a large expected improvement (exploitation) while also preferring states with a large variance (exploration).

Thompson Sampling

Alt text

Probability matching is exploratory as it prefers points with larger variance (as they automatically have a larger chance of being optimal), but at the same time exploitative as it effectively discards points with low posterior mean and low posterior variance. Unfortunately, it is generally difficult to compute π analytically given a posterior. Instead, it is common to use a sampling-based approximation of π.

Alt text

In many cases, the randomness in the realizations of ̃ ft+1 is already sufficient to effectively trade exploration and exploitation.

Model Selection

Selecting a model of f ⋆ is much harder than in the i.i.d. data setting of supervised learning. There are mainly the two following dangers, • the data sets collected in active learning and Bayesian optimization are small; and • the data points are selected dependently on prior observations. This leads to a specific danger of overfitting. In particular, due to feedback loops between the model and the acquisition function, one may end up sampling the same point repeatedly.

Another approach that often works fairly well is to occasionally (according to some schedule) select points uniformly at random instead of using the acquisition function. This tends to prevent getting stuck in suboptimal parts of the state space.

difference betwwen active learning and Bayesian Optimization

Problem Setting:
Active Learning: Active learning typically deals with supervised learning tasks where there is a large pool of unlabeled instances, and the algorithm decides which instances to query for labels to improve the model.
Bayesian Optimization: Bayesian optimization deals with optimization problems where the objective function is unknown, expensive to evaluate, and possibly noisy. It aims to find the global optimum with as few evaluations as possible.

Nature of Queries:
Active Learning: In active learning, the queries are often in the form of “Which instance should be labeled next?” The goal is to select instances that will most benefit the model’s learning process.
Bayesian Optimization: In Bayesian optimization, the queries are in the form of “What point should be evaluated next in the input space to maximize/minimize the objective function?” The goal is to efficiently explore the input space and find the optimal configuration.

Algorithmic Approaches:
Active Learning: Active learning involves various strategies such as uncertainty sampling, query-by-committee, and diversity sampling to select informative instances for labeling.
Bayesian Optimization: Bayesian optimization employs probabilistic surrogate models (often Gaussian processes) to model the unknown objective function. Acquisition functions guide the search to balance exploration and exploitation efficiently.

pai - active learning

Active learning

Active learning is a machine learning paradigm in which a model is trained on a dataset that is not entirely labeled. Instead of relying solely on a fully labeled dataset for training, active learning involves an iterative process where the model actively selects the most informative instances from an unlabeled pool, queries an oracle (typically a human annotator), and adds the newly labeled instances to the training set. The model is then retrained on the expanded labeled dataset.

The key idea behind active learning is to strategically choose the most valuable instances for labeling, with the goal of improving the model’s performance while minimizing the number of labeled instances needed. This process is especially beneficial when obtaining labeled data is expensive or time-consuming.

Active learning strategies vary in how they select instances for labeling. Common strategies include uncertainty sampling (select instances where the model is uncertain), query-by-committee (select instances where different model hypotheses disagree), and diversity sampling (select instances to ensure diverse coverage of the input space).

Conditional Entropy

Alt text

Intuitively, the conditional entropy of X given Y describes our average surprise about realizations of X given a particular realization of Y, averaged over all such possible realizations of Y. In other words, conditional entropy corresponds to the expected remaining uncertainty in X after we observe Y.

Alt text
That is, the joint entropy of X and Y is given by the uncertainty about X and the additional uncertainty about Y given X.

Mutual Information

Alt text

In words, we subtract the uncertainty left about X after observing Y from our initial uncertainty about X. This measures the reduction in our uncertainty in X (as measured by entropy) upon observing Y.

Alt text

Thus, the mutual information between X and Y can be understood as the approximation error (or information loss) when assuming that X and Y are independent.

Alt text

Thus, the conditional mutual information corresponds to the reduction of uncertainty in X when observing Y, given we already observed Z.

Following our introduction of mutual information, it is natural to answer the question “where should I collect data?” by saying “wherever mutual information is maximized”.

Submodularity of Mutual Information

Alt text

Alt text

That is, “adding” x to the smaller set A yields more marginal gain than adding x to the larger set B. In other words, the function F has “diminishing returns”. In this way, submodularity can be interpreted as a notion of “concavity” for discrete functions.

Alt text

Maximizing Mutual Information

Uncertainty Sampling

Alt text

Therefore, if f is modeled by a Gaussian and we assume homoscedastic noise, greedily maximizing mutual information corresponds to simply picking the point x with the largest variance. This strategy is also called uncertainty sampling.

Heteroscedastic Noise

Uncertainty sampling is clearly problematic if the noise is heteroscedastic. If there are a particular set of inputs with a large aleatoric uncertainty dominating the epistemic uncertainty, uncertainty sampling will continuously choose those points even though the epistemic uncertainty will not be reduced substantially.
Alt text

Thus, we choose locations that trade large epistemic uncertainty with large aleatoric uncertainty. Ideally, we find a location where the epistemic uncertainty is large, and the aleatoric uncertainty is low, which promises a significant reduction of uncertainty around this location.

Classification

Here, uncertainty sampling corresponds to selecting samples that maximize the entropy of the predicted label yx.