Probabilistic Artificial Intelligence - Gaussian Process

Gaussian Process

Gaussian distribution

Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.

Grassian process

what is GP

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.

Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.

Mean and covariance functions

The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.

The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.

How GP works

For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.

Marginal likelihood and GP training

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.

The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.

To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.

Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.

what is kernel

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.

The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.

Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.

The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.

Interpretation of the hyperparameters

Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.

The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.

training tips

The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.

Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:

Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.

Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.

How to choose a kernel

The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
​In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.

others

The GP does not require any input normalization, but it can make sense to do so for numerical reasons.

reference

https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.

bigdata - Syntax

syntax

CSV

CSV is a textual format, in the sense that it can be opened in a text editor. CSV means comma-separated values. The main challenge with CSV files is that, in spite of a standard (RFC 4180), in practice there are many different dialects and variations, which limits interoperability. For example, another character can be used instead of the comma (tabulation, semi-colons, etc). Also, when a comma (or the special character used in its stead) needs to actually appear in a value, it needs to be escaped.

Data denormalization

Data denormalization makes a lot of sense in read-intensive scenarios in which not having to join brings a significant performance improvement.

The difference with CSV is that, in JSON, the attributes appear in every tuple, while in CSV they do not appear except in the header line. JSON is appropriate for data denormalization because including the attributes in every tuple allows us to drop the identical support requirement.

The generic name for denormalized data (in the same of heterogeneous and nested) is “semi-structured data”. Textual formats such as XML and JSON have the advantage that they can both be processed by computers, and can also be read, written and edited by humans. Another very important and characterizing aspect of XML and JSON is that they are standards: XML is a W3C standard. W3C, also known as the World Wide Web consortium, is the same body that also standardizes HTML, HTTP, etc. JSON is now an ECMA standard, which is the same body that also standardizes JavaScript.

Whichever syntax is used, they have in common the concept of well-formedness. A string is said to be well-formed if it belongs to the language. Concretely, when a document is well-formed XML, it means that it can be successfully opened by an editor as XML with no errors.

JSON

JSON stands for JavaScript Object Notation because the way it looks like originates from JavaScript syntax, however it is now living its own life completely independently of JavaScript.

JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays. Strings are simply text. In JSON, strings always appear in double quotes. Obviously, strings could contain quotes and in order not to confuse them with the surrounding quotes, they need to be differentiated. This is called escaping and, in JSON, escaping is done with backslash characters ().

JSON generally supports numbers, without explicitly naming any types nor making any distinction between numbers apart from how they appear in syntax. The way a number appears in syntax is called a lexical representation, or a literal. JSON places a few restrictions: a leading + is not allowed. Also, a leading 0 is not allowed except if the integer part is exactly 0.

There are two Booleans, true and false. Arrays are simply lists of values. The concept of list is abstract and mathematical. The concept of array is the syntactic counterpart of a list. Objects are simply maps from strings to values. The concept of object is the syntactic counterpart of a map,i.e., an object is a physical representation of an abstract map that explicitly lists all string-value pairs. The keys of an object must be strings. The JSON standard recommends for keys to be unique within an object.

XML

XML stands for eXtensible Markup Language. It resembles HTML, except that it allows for any tags and that it is stricter in what it allows.

XML’s most important building blocks are elements, attributes, text and comments. XML is a markup language, which means that content is “tagged”. Tagging is done with XML elements. An XML element consists of an opening tag, and a closing tag. What is “tagged” is everything inbetween the opening tag and the closing tag. ags consist of a name surrounded with angle brackets < … >, and the closing tag has an additional slash in front of the name. We use a convenient shortcut to denote the empty element with a single tag and a slash at the end. For example, \ is equal to
\\.
Unlike JSON keys, element names can repeat at will.

Attributes appear in any opening elements tag and are basically keyvalue pairs. Values can be either double-quoted or single-quoted. The key is never quoted, and it is not allowed to have unquoted values. Within the same opening tag, there cannot be duplicate keys. Attributes can also appear in an empty element tag. Attributes can never appear in a closing tag. It is not allowed to create attributes that start with XML or xml, or any case combination.

Text, in XML syntax, is simply freely appearing in elements and without any quotes (attribute values are not text!). Within an element, text can freely alternate with other elements. This is called mixed content and is unique to XML.

Comments in XML look like so: \. XML documents can be identified as such with an optional text declaration containing a version number and an encoding, like \<?xml version=”1.0” encoding=”UTF-8”?>. The version is either 1.0 or 1.1. Another tag that might appear is the doctype declaration, like \<!DOCTYPE person>.

Remember that in JSON, it is possible to escape sequences with a backslash character. In XML, this is done with an ampersand (&) character. There are exactly five possible escape sequences pre-defined in XML:
Alt text. Escape sequences can be used anywhere in text, and in attribute values. At other places (element names, attribute names, inside comments), they will not be recognized.
There are a few places where they are mandatory:& and < MUST be escaped. ” and ‘ should also be escaped in quoted qttribute values.

Namespaces are an extension of XML that allows users to group their elements and attributes in packages, similar to Python modules, Java packages or C++ namespaces. A namespace is identified with a URI. A point of confusion is that XML namespaces often start with http://, but are not meant to be entered as an address into a browser. A namespace declaration is like: \. If you remember, we saw that attributes starting with xml are forbidden, and this is because this is reserved for namespace declarations. What about documents that use multiple namespaces? This is done by associating namespaces with prefixes, which act as shorthands for a namespace. Then, we can use the prefix shorthand in every element that we want to have in this namespace.
Alt text
So, given any element, it is possible to find its local name, its (possibly absent) prefix, and its (possibly absent) namespace. The triplet (namespace, prefix, localname) is called a QName

Attributes can also live in namespaces, that is, attribute names are generally QNames. However, there are two very important aspects to consider. First, unprefixed attributes are not sensitive to default namespaces: unlike elements, the namespace of an unprefixed attribute is always absent even if there is a default namespace. Second, it is possible for two attributes to collide if they have the same local name, and different prefixes but associated with the same namespace (but again, we told you: do not do that!).

references

https://ghislainfourny.github.io/big-data-textbook/

bigdata - Distributed file systems

Distributed file systems

requirements of a distributed file system

Going back to our capacity-throughput-latency view of storage, a distributed file system is designed so that, in cruise mode, its bottleneck will be the data flow (throughput), not the latency. We saw that capacity increased much faster than throughput, and that this can be solved with parallelism. We saw that throughput increased much faster than latency decreased, and that this can be solved with batch processing. Distributed file systems support both parallelism and batch processing natively, forming the core part of the ideal storage system accessed by MapReduce or Apache Spark.The origins of such a system come back to the design of GoogleFS, the Google File System. Later on, an open source version of it was released as part of the Hadoop project, initiated by Doug Cutting at Yahoo, and called HDFS, for Hadoop Distributed File System.

HDFS

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. Unlike in S3, HDFS files are furthermore not stored as monolithic blackboxes, but HDFS exposes them as lists of blocks. As for the block size: HDFS blocks are typically 64 MB or 128 MB large, and are thus considerably larger than blocks on a local hard drive (around 4 kB).HDFS is designed to a run on a cluster of machines.

architecture

HDFS is implemented on a fully centralized architecture, in which one node is special and all others are interchangeable and connected to it.
In the case of HDFS, the central node is called the NameNode and the other nodes are called the DataNodes. Every file is divided into chunks called blocks. All blocks have a size of exactly 128 MB, except the last one which is usually smaller. Each one of the blocks is then replicated and stored on several DataNodes. How many times? This is a parameter called the replication factor. By default, it is 3.

The NameNode is responsible for the system-wide activity of the HDFS cluster. It store in particular three things: • the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems. • a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. • a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block. The DataNodes store the blocks themselves. These blocks are stored on their local disks. DataNodes send regular heartbeats to the NameNode. The frequency of these heartbeats is configurable and is by default a few seconds (e.g., 3s, but this value may change across releases). This is a way to let the NameNode know that everything is alright.Finally, the DataNode also sends, every couple of hours (e.g., 6h, but this value may change across releases), a full report including all the blocks that it contains. A NameNode never initiates a connection to a DataNode.

Finally, DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts. When a replication pipeline is ongoing and a new block is being written to the cluster, the content of the block is not sent in one single 128 MB packet. Rather, it is sent in smaller packets (e.g., 64 kB) in a streaming fashion via a network protocol.

replicas

Having this in mind, the first replica of the block, by default, gets written to the same machine that the client is running on. The second replica is written on a DataNode sitting in a different rack than the client, that we call B. The third replica is written to another DataNode on the same rack B.And further replicas are written mostly at random, but respecting two simple rules for resilience: at most one replica per node, and at most two replicas per rack.

Fault tolerance

HDFS has a single point of failure: the NameNode. If the metadata stored on it is lost, then all the data on the cluster is lost, because it is not possible to reassemble the blocks into files any more.For this reason, the metadata is backed up. More precisely, the file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. What is done is that updates to the file system arriving after the snapshot has been made are instead stored in a journal, called edit log, that lists the updates sorted by time of arrival. The snapshot and edit log are stored either locally or on a networkattached drive (not HDFS itself).

Logging and importing data

Two tools are worth mentioning: Apache Flume lets you collect, aggregate and move log data to HDFS. Apache Sqoop lets you import data from a relational database management system to HDFS.

references

https://ghislainfourny.github.io/big-data-textbook/
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

bigdata - introduction

big data

Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodat-ing for the issue of growing discrepancy between capacity, throughput and latency.
Alt text

data independence

Data independence means that the logical view on the data is cleanly separated, decoupled, from its physical storage.

architecture

Stack: storage, compute, model ,language

data model

what data looks like and what you can do with it.

data shapes

tables, trees, cubes

table

Row(Tuple), Column(Attribute), Primary Key,Value.

relational tables

schema: A set of attributes.
extension: A set/bag/list of tuples.
Three constraints: Relational integrity, domain integrity and atomic integrity.
Superkey, Candidate key(minimal superkey).

Database Normalization

In database management systems (DBMS), normal forms are a series of guidelines that help to ensure that the design of a database is efficient, organized, and free from data anomalies.
First Normal Form (1NF):In 1NF, each table cell should contain only a single value, and each column should have a unique name.
Second Normal Form (2NF): 2NF eliminates redundant data by requiring that each non-key attribute be dependent on the primary key.
Third Normal Form (3NF): 3NF builds on 2NF by requiring that all non-key attributes are independent of each other.

Denormalization

Denormalization is a database optimization technique in which we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. Note that denormalization does not mean ‘reversing normalization’ or ‘not to normalize’. It is an optimization technique that is applied after normalization.

SQL

SQL was originally named SEQUEL, for Structured English QUEry Language. SQL is a declarative language, which means that the user specifies what they want, and not how to compute it: it is up to the underlying system to figure out how to best execute the query.

View and Table

The view is a result of an SQL query and it is a virtual table, whereas a Table is formed up of rows and columns that store the information of any object and be used to retrieve that data whenever required. A view contains no data of its own but it is like a ‘window’ through which data from tables can be viewed or changed. The view is stored as a SELECT statement in the data dictionary. Creating a view fulfills the requirement without storing a separate copy of the data because a view does not store any data of its own and always takes the data from a base table. as the data is taken from the base table, accurate and up-to-date information is required.

SQL:1999 added the with clause to define “statement scoped views”. They are not stored in the database schema: instead, they are only valid in the query they belong to. This makes it possible to improve the structure of a statement without polluting the global namespace.With is not a stand alone command like create view is: it must be followed by select.

Natural Join and Inner Join

Natural Join joins two tables based on the same attribute name and datatypes. The resulting table will contain all the attributes of both the table but keep only one copy of each common column while Inner Join joins two tables on the basis of the column which is explicitly specified in the ON clause. The resulting table will contain all the attributes from both tables including the common column also.

data storage

Stack: Storage, Encoding, Syntax, Data models, Validation, Processing, Indexing, Data stores,Querying, User interfaces.

database and data lake

two main paradigms for storing and retrieving data:database and data lake. Data can be imported into the database (this is called ETL, for Extract-Transform-Load. ETL is often used as a verb).The data is internally stored as a proprietary format that is optimized to make queries faster. This includes in particular building indices on the data.
On the other hand, data can also just be stored on some file system.This paradigm is called the data lake paradigm and gained a lot of popularity in the past two decades. It is slower, however users can start querying their data without the effort of ETLing.

scaling up and scaling out

First, one can buy a bigger machine: more memory, more or faster CPU cores, a larger disk, etc. This is called scaling up. Second, one can buy more, similar machines and share the work across them. This is called scaling out.

Object stores

Amazon’s object storage system is called Simple Storage Service, abbreviated S3. From a logical perspective, S3 is extremely simple: objects are organized in buckets. Buckets are identified with a bucket ID, and each object within a bucket is identified with an Object ID.

CAP theorem

Consistency: at any point in time, the same request to any server returns the same result, in order words, all nodes see the same data;
Availability: the system is available for requests at all times with very high availability.
Partition tolerance: the system continues to function even if the network linking its machines is occasionally partitioned.
The CAP theorem is basically an impossibility triangle: a system cannot guarantee at the same time: usually are CP,AP or AC.

REST APIs

REST (Representational State Transfer) is an architectural style for designing networked applications.RESTful services often use HTTP as the communication protocol. A client and server communicated with the HTTP protocol interact in terms of methods applied to resources.A resource is referred to with what is called a URI. URI stands for Uniform Resource Identifier. A client can act on resources by invoking methods, with an optional body. The most important methods are: GET, PUT,DELETE,POST.

REST is not a standard or protocol, this is an approach to or architectural style for writing API.REST is an architectural style, and RESTful is the interpretation of it. That is, if your back-end server has REST API and you make client-side requests (from a website/application) to this API, then your client is RESTful. All requests you make have their HTTP status codes. There are a lot of them and they are divided into 5 classes. The first number indicates which of them a code belongs to:
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error

Amazon S3 and Azure Blob Storage

Alt text

Key-value store

A key-value store differs from a typical relational database in three aspects: • Its API is considerably simpler than that of a relational database (which comes with query languages) • It does not ensure atomic consistency; instead, it guarantees eventual consistency, which we covered earlier in this Chapter. • A key-value store scales out well, in that it is very fast also at large scales.

Amazon Dynamo

It is itself based (with some modifications) on the Chord protocol, which is a Distributed Hash Table.On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles. The first design principle is incremental stability. This means that new nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash.The second principle is symmetry: no node is particular in any way The third principle is decentralization: there is no “central node” that orchestrates the others.The fourth principle is heterogeneity: the nodes may have different CPU power, amounts of memory, etc.

A central aspect of the design of a distributed hash table, and part in particular of the Chord protocol, is that every logical key is hashed to bits that we will call IDs. In the case of Dynamo, the hash is made of 128 bits (7 bytes).In the chord protocol, a technology called a finger table is used. Each node knows the next node clockwise, and the second node, and the 4th node, and the 8th node. Dynamo changes this design to so-called “preference lists”: each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise).

Distributed hash tables, including Dynamo, are typically AP. A fundamental conceptual tool in AP systems is the use of vector clocks.Vector clocks are a way to annotate the versions when they follow a DAG structure.A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally.

vector clock
Lamport’s logical clock

Lamport’s Logical Clock was created by Leslie Lamport. It is a procedure to determine the order of events occurring. It provides a basis for the more advanced Vector Clock Algorithm. Due to the absence of a Global Clock in a Distributed Operating System Lamport Logical Clock is needed.

Implementation Rules:
[IR1]: If a -> b [‘a’ happened before ‘b’ within the same process] then, Ci(b) =Ci(a) + d
[IR2]: Cj = max(Cj, tm + d) [If there’s more number of processes, then tm = value of Ci(a), Cj = max value between Cj and tm + d]

Vector Clocks in Distributed Systems

Vector Clock is an algorithm that generates partial ordering of events and detects causality violations in a distributed system.
How does the vector clock algorithm work :

Initially, all the clocks are set to zero.
Every time, an Internal event occurs in a process, the value of the processes’s logical clock in the vector is incremented by 1.
Every time, a process receives a message, the value of the processes’s logical clock in the vector is incremented by 1, and moreover, each element is updated by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).

To sum up, Vector clocks algorithms are used in distributed systems to provide a causally consistent ordering of events but the entire Vector is sent to each process for every message sent, in order to keep the vector clocks in sync.

partial order relation

Note that vector clocks can be compared to each other with a partial order relation ≤. A partial order relation is any relation that is reflexive, antisymmetric, and transitive. A total order relation is a partial order in which every element of the set is comparable with every other element of the set. All total order relations are partial order relations, but the converse is not always true.

references

https://www.geeksforgeeks.org/normal-forms-in-dbms/
https://www.geeksforgeeks.org/denormalization-in-databases/
https://www.geeksforgeeks.org/difference-between-view-and-table/
https://modern-sql.com/feature/with
https://www.geeksforgeeks.org/sql-natural-join/
https://mlsdev.com/blog/81-a-beginner-s-tutorial-for-understanding-restful-api
https://ghislainfourny.github.io/big-data-textbook/
https://www.geeksforgeeks.org/lamports-logical-clock/

opencv

basic functions

reading and writing

cv2.VideoWriter_fourcc(‘M’, ‘P’, ‘4’, ‘V’)
cv2.VideoWriter(filename,fourcc,fps,frameSize[,isColor])
cv2.VideoWriter.write(image)

object tracking

object tracking

Multiple Object Tracking(MOT) is the task of detecting various objects of interest in a video, tracking these detected objects in subsequent frames by assigning them a unique ID, and maintaining these unique IDs as the objects move around in a video in successive frames.Generally, multiple object tracking happens in two stages: object detection and object association. Object detection is the process of identifying all potential objects of interest in the current frame using object detectors such as Faster-RCNN or YOLO. Object association is the process of linking objects detected in the current frame with its corresponding objects from previous frames, referred to as tracklets. Object or instance association is usually done by predicting the object’s location at the current frame based on previous frames’ tracklets using the Kalman Filter followed by one-to-one linear assignment typically using the Hungarian Algorithm to minimise the total differences between the matching results.

Metrics

MOTP (Multiple Object Tracking Precision)

MOTP (Multi-Object Tracking Precision) expresses how well exact positions of the object are estimated. It is the total error in estimated position for matched ground truth-hypothesis pairs over all frames, averaged by the total number of matches made. This metric is not responsible for recognizing object configurations and evaluating object trajectories.

MOTA (Multiple Object Tracking Accuracy)

MOTA (Multi-Object Tracking Accuracy) shows how many errors the tracker system has made in terms of Misses, False Positives, Mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of Misses, the ratio of False positives, and the ratio of Mismatches over all the frames.

IDF1 score (IDF1)

IDF1 score (IDF1) is the ratio of correctly identified detections over the average of ground truth and predicted detections.

Benchmarks

OTB

KITTI

MOT16

Methods(models)

IOU tracker

The Intersection-Over-Union (IOU) tracker uses the IOU values among the detector’s bounding boxes between the two consecutive frames to perform the association between them or assign a new target ID if no match found.

Simple Online And Realtime Tracking (SORT)

Simple Online And Realtime Tracking (SORT) is a lean implementation of a tracking-by detection framework.SORT uses the position and size of the bounding boxes for both motion estimation and data association through frames. SORT combines location and motion cues by adopting a Kalman filter to predict the location of the tracklets in the new frame, then computes the IoU between the detection boxes and the predicted boxes as the similarity.

DeepSORT

DeepSORT replaces the association metric with a more informed metric that combines motion and appearance information. In particular, a “deep appearance” distance metric is added. The core idea is to obtain a vector that can be used to represent a given image. DeepSort adopts a stand-alone RE-ID model to extract appearance features from the detection boxes. After similarity computation matching strategy assigns identities to the objects. This can be done by the Hungarian Algorithm or greedy assignment.

FairMOT

FairMOT is a new tracking approach built on top of the anchor-free object detection architecture CenterNet.It has a simple network structure that consists of two homogeneous branches for detecting objects and extracting re-ID features.

TransMOT

TransMOT is a new spatial-temporal graph Transformer that solves all these issues. It arranges the trajectories of all the tracked objects as a series of sparse weighted graphs that are constructed using the spatial relationships of the targets. TransMOT then uses these graphs to create a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial transformer decoder layer to model the spatial-temporal relationships of the objects.

ByteTrack

BYTE is an effective association method that utilizes all detection boxes from high scores to low ones in the matching process.BYTE is built on the premise that the similarity with tracklets provides a strong cue to distinguish the objects and background in low score detection boxes. BYTE first matches the high score detection boxes to the tracklets based on motion similarity. It uses Kalman Filter to predict the location of the tracklets in the new frame. The motion similarity is computed by the IoU of the predicted box and the detection box. Then, it performs the second matching between the unmatched tracklets.

The primary innovation of BYTETrack is keeping non-background low confidence detection boxes which are typically discarded after the initial filtering of detections and use these low-score boxes for a secondary association step. Typically, occluded detection boxes have lower confidence scores than the threshold, but still contain some information about the objects which make their confidence score higher than purely background boxes. Hence, these low confidence boxes are still meaningful to keep track of during the association stage.

Comparison of DeepSort and ByteTrack

DeepSort uses a pre-trained object detection model to detect objects in each frame and a Siamese network to match the detected objects based on their appearance features. It also uses Kalman filters to predict the locations of the objects in the next frame. ByteTrack, on the other hand, uses a lightweight Siamese network architecture that takes in two input frames and outputs a similarity score. It also uses a simple but effective data augmentation technique to improve its performance on challenging datasets.

using ByteTrack

ByteTracker initiates a new tracklet only if a detection is not matched with any previous tracklet and the bounding box score is higher than a threshold.

references

https://www.datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box
https://pub.towardsai.net/multi-object-tracking-metrics-1e602f364c0c
https://learnopencv.com/object-tracking-and-reidentification-with-fairmot/
https://medium.com/augmented-startups/top-5-object-tracking-methods-92f1643f8435
https://medium.com/@pedroazevedo6/object-tracking-state-of-the-art-2022-fe9457b77382

Neural Networks

training nerual networks

suggestions

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data.

Our next step is to set up a full training + evaluation skeleton and gain trust in its correctness via a series of experiments. At this stage it is best to pick some simple model that you couldn’t possibly have screwed up somehow.
Tips & tricks for this stage:
Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome.
Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage.
When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it.
Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?

The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss). The reason I like these two stages is that if we are not able to reach a low error rate with any model at all that may again indicate some issues, bugs, or misconfiguration.

In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.

references

http://karpathy.github.io/2019/04/25/recipe/

transformer

Introduction

The Transformer is a deep learning architecture introduced in the paper “Attention is All You Need” by Vaswani et al., published in 2017.
The Transformer is based on the self-attention mechanism, which allows it to capture long-range dependencies in sequences more effectively than traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The key components of the Transformer are:Self-Attention Mechanism,Encoder-Decoder Architecture,Multi-Head Attention,Positional Encoding,Feed-Forward Neural Networks.

Self-Attention Mechanism

The self-attention mechanism allows the model to weigh the importance of different words in a sentence while encoding the sequence. It computes the attention scores for each word in the input sequence based on its relationships with other words. By attending to relevant words, the model can focus on the most informative parts of the sequence.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

The fifth step is to multiply each value vector by the softmax score.

The sixth step is to sum up the weighted value vectors.This produces the output of the self-attention layer at this position (for the first word).

Multi-Head Attention

To capture different types of dependencies and relationships, the Transformer uses multi-head attention. It performs self-attention multiple times with different learned projection matrices, allowing the model to attend to various aspects of the input.

With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.We concat the matrices then multiply them by an additional weights matrix WO to condense these eight down into a single matrix.

sequence-to-sequence model

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.the model is composed of an encoder and a decoder.The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms.The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”.Attention allows the model to focus on the relevant parts of the input sequence as needed.

attention

An attention model differs from a classic sequence-to-sequence model in two main ways:First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder;Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:Look at the set of encoder hidden states it received,Give each hidden state a score,Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores.

Encoder-Decoder Architecture

The Transformer architecture consists of two main components: the encoder and the decoder. The encoder takes an input sequence and processes it, while the decoder generates an output sequence based on the encoded representation.

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Layer Normalization

In traditional normalization techniques like Batch Normalization, the activations of a layer are normalized by computing the mean and variance over a batch of examples. This normalization helps stabilize and accelerate the training process, especially for deeper networks. However, it introduces a dependency on the batch size during training, which can be problematic in scenarios where batch sizes vary or during inference when processing individual samples.

Layer Normalization addresses this dependency by computing the mean and variance across all the units within a single layer for each training example. This means that normalization is done independently for each sample and does not rely on batch statistics.

Positional Encoding

Since Transformers do not inherently have positional information like RNNs, positional encodings are added to the input embeddings. These positional encodings provide the model with information about the order of the elements in the input sequence,or the distance between different words in the sequence.

Training

loss function

cross entropy

The cross-entropy loss calculates the negative log-likelihood of the true class’s predicted probability.

Kullback–Leibler divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from another.KL divergence measures the average amount of information lost when using Q to approximate P. It is not symmetric.

decoding

greedy decoding

In greedy decoding, at each step of sequence generation, the model selects the most likely output token based on its predicted probability distribution. It chooses the token with the highest probability without considering the impact on future decisions. This means that the model makes locally optimal choices at each step without considering the global context of the entire sequence.For example, in machine translation, a model using greedy decoding will predict each target word one at a time, selecting the word with the highest probability given the source sentence and previously generated words. The process continues iteratively until an end-of-sentence token is generated.

Beam search

In beam search, instead of selecting only the most likely token at each step, the algorithm maintains a fixed-size list, known as the “beam,” containing the most promising candidate sequences. The beam size determines how many candidate sequences are considered at each decoding step.
At the beginning of the decoding process, the beam is initialized with a single token representing the start of the sequence. At each step, the model generates the probabilities for the next possible tokens and expands the beam with the top-k most likely candidate sequences based on their cumulative probabilities. The k represents the beam size, and higher values of k result in a more diverse exploration of possibilities.

references

http://fancyerii.github.io/2019/03/09/transformer-illustrated/#%E6%A6%82%E8%BF%B0
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
http://jalammar.github.io/illustrated-transformer/
https://colah.github.io/posts/2015-09-Visual-Information/
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

segmentation

Image segmentation

Image segmentation is a sub-domain of computer vision and digital image processing which aims at grouping similar regions or segments of an image under their respective class labels.

Semantic segmentation

Semantic segmentation refers to the classification of pixels in an image into semantic classes.

Instance segmentation

Instance segmentation models classify pixels into categories on the basis of “instances” rather than classes.

Panoptic segmentation

Panoptic segmentation can be expressed as the combination of semantic segmentation and instance segmentation where each instance of an object in the image is segregated and the object’s identity is predicted.

Neural networks that perform segmentation typically use an encoder-decoder structure where the encoder is followed by a bottleneck and a decoder or upsampling layers directly from the bottleneck (like in the FCN).

Regular Expressions

Regular expressions

A formal language for specifying text strings

rules

Disjunctions:
Letters inside square brackets[]: [A-Z]
pipe |: a|b|c
Negation in Disjunction: Ss

?: When placed after a character or a group, the question mark makes it optional, meaning that the character or group can occur zero or one time.
When placed after a quantifier, such as , +, or ?, it modifies the quantifier to be non-greedy or lazy. A non-greedy quantifier matches as few characters as possible, while a greedy quantifier matches as many characters as possible. :0 or more of previous char
+:1 or more of previous char
.:any char

Anchors:
^: The begining. $: The end.