Posted 2023-10-17Updated 2023-11-04database12 minutes read (About 1848 words)

wide column stores

Wide column stores were invented to provide more control over performance and in particular, in order to achieve high-throughput and low latency for objects ranging from a few bytes to about 10 MB, which are too big and numerous to be efficiently stored as so-called clobs (character large objects) or blobs (binary large objects) in a relational database system, but also too small and numerous to be efficiently accessed in a distributed file system.

Why wide column stores

A wide column store will be more tightly integrated with the parallel data processing systems.
wide column stores have a richer logical model than the simple key-value model behind object storage. wide column stores also handle very small values (bytes and kBs) well thanks to batch processing.

different from RDBMS

It does not have any data model for values, which are just arrays of bytes; since it efficiently handles values up to 10 MB, the values can be nested data in various formats, which breaks the first normal form; tables do not have a schema; there is no language like SQL, instead the API is on a lower level and more akin to that of a key-value store; tables can be very sparse, allowing for billions of rows and millions of columns at the same time; this is another reason why data stored in HBase is denormalized.

BigTable and HBase

Rationale

HBase is an open-source equivalent to the BigTable as part of the Hadoop ecosystem.
The data model of HBase is based on the realization that joins are expensive, and that they should be avoided or minimized on a cluster architecture.

The second design principle underlying HBase is that it is efficient to store together what is accessed together. In the big picture, this is a flavor of batch processing, one of the overarching principles in Big Data. Batch processing reduces the impact of latency.

Tables and row IDs

From an abstract perspective, HBase can be seen as an enhanced keyvalue store, in the sense that:a key is compound and involves a row, a column and a version;values can be larger (clobs, blobs), up to around 10 MB; keys are sortable.

A row ID is logically an array of bytes, although there is a library to easily create row ID bytes from specific primitive values. In HBase, the key identifying every cell consists of: the row ID, the column family, the column qualifier, the version.

Column families

The other attributes, called columns, are split into so-called column families. This is a concept that does not exist in relational databases and that allows scaling the number of columns.

Column qualifiers

Columns in HBase have a name (in addition to the column family) called column qualifier, however unlike traditional RDBMS, they do not have a particular type. Column qualifiers are arrays of bytes (rather than strings), and as for row IDs, there is a library to easily create column qualifiers from primitive values. Unlike the values which can be large arrays of bytes (blobs), it is important to keep column families and column qualifiers short, because as we will see, they are repeated a gigantic number of times on the physical layer.

Versioning

HBase generally supports versioning, in the sense that it keeps track of the past versions of the data. As we will see, this is implemented by associating any value with a timestamp, also called version, at which it was created (or deleted).Users can also override timestamps with a value of their choice to have more control about versions.

Logical queries

HBase supports four kinds of low-level queries: get, put, scan and delete. Unlike a traditional key-value store, HBase also supports querying ranges of row IDs and ranges of timestamps.

HBase offers a locking mechanism at the row level, meaning that different rows can be modified concurrently, but the cells in the same row cannot: only one user at a time can modify any given row.

Physical architecture

Partitioning

A table in HBase is physically partitioned in two ways: on the rows and on the columns. The rows are split in consecutive regions. Each region is identified by a lower and an upper row key, the lower row key being included and the upper row key excluded. A partition is called a store and corresponds to the intersection of a region and of a column family.
Alt text

Network topology

HBase has exactly the same centralized architecture as HDFS. The HMaster and the RegionServers should be understood as processes running on the nodes, rather than the nodes themselves. The HMaster assigns responsibility of each region to one of the RegionServers. There is no need to attribute the responsibility of a region to more than one RegionServer at a time because, as we will see soon, fault tolerance is already handled on the storage level by HDFS. If a region grows too big, for example because of many writes in the same row ID interval, then the region will be automatically split by the responsible RegionServer. If a RegionServer has too many regions compared to other RegionServers, then the HMaster can reassign regions to other RegionServers.

Physical storage

The store is, physically, nothing less than a set of cells. Each cell is thus handled physically as a key-value pair where the key is a (row ID, column family, column qualifier, version) tuple. All the cells within a store are eventually persisted on HDFS, in files that we will call HFiles.

An HFile is, in fact, nothing else than a (boring) flat list of KeyValues, one per cell. What is important is that, in an HFile, all these KeyValues are sorted by key in increasing order, meaning, first sorted by row ID, then by column family (trivially unique for a given store), then by column qualifier, then by version (in decreasing order, recent to old). This means that all versions of a given cell that are in the same HFile are located together, and one of the cells (within this HFile) is the latest. If we zoom in at the bit level, a KeyValue consists in four parts: The length of the keys in bits (this length is encoded on a constant, known number of bits) • The length of the value in bits (this length is encoded on a constant, known number of bits) • The actual key (of variable length) • The actual value (of variable length). Why do we not just store the key and the value? This is because their length can vary. If we do not know their length, then it is impossible to know when they stop just looking at the bits.
Alt text

KeyValues, within an HFile, are organized in blocks. But to not confuse them with HDFS blocks, we will call them HBlocks. HBlocks have a size of 64 kB, but this size is variable: if the last KeyValue goes beyond this boundary, then the HBlock is simply longer and stops whenever the last KeyValue stops. The HFile then additionally contains an index of all blocks with their key boundaries. This separate index is loaded in memory prior to reading anything from the HFile.

Log-structured merge trees

When accessing data, HBase needs to generally look everywhere for cells to potentially return: in every HFile, and in memory. As long as there is room in memory, freshly created cells are added in memory. At some point, the memory becomes full (or some other limits are reached). When this happens, all the cells need to be flushed to a brand new HFile. Upon flushing, all cells are written sequentially to a new HFile in ascending key order, HBlock by HBlock, concurrently building the index structure. When cells are added to memory, they are added inside a data structure that maintains them in sorted order (such as tree maps) and then flushing is a linear traversal of the tree.

What happens if the machine crashes and we lose everything in memory? We have a so-called write-ahead-log for this. Before any fresh cells are written to memory, they are written in sequential order (append) to an HDFS file called the HLog. There is one HLog per RegionServer. A full write-ahead-log also triggers a flush of all cells in memory to a new HFile. If there is a problem and the memory is lost, the HLog can be retrieved from HDFS and “played back” in order to repopulate the memory and recreate the sorting tree structure.

After many flushes, the number of HFiles to read from grows and becomes impracticable. For this reason, there is an additional process called compaction that takes several HFiles and outputs a single,merged HFile. Since the cells within each HFile are already sorted, this can be done in linear time, as this is essentially the merge part of the merge-sort algorithm. Compaction is not done arbitrarily but follows a regular, logarithmic pattern. When the memory is flushed again, an standard-size HFile is written and the two standard-size HFiles are immediately compacted to a double-size HFile.

Additional design aspects

Bootstrapping lookups

In order to know which RegionServer a client should communicate with to receive cells corresponding to a specific region, there is a main, big lookup table that lists all regions of all tables together with the coordinates of the RegionServer in charge of this region as well as additional metadata.

Caching

In order to improve latency, cells that are normally persisted to HFiles (and thus no longer in memory) can be cached in a separate memory region, with the idea of keeping in the cache those cells that are frequently accessed.

Bloom filters

HBase has a mechanism to avoid looking for cells in every HFile. This mechanism is called a Bloom filter. It is basically a black box that can tell with absolute certainty that a certain key does not belong to an HFile, while it only predicts with good probability (albeit not certain) that it does belong to it.

Data locality and short-circuiting

When a RegionServer flushes cells to a new HFile, a replica of each (HDFS) block of the HFile is written, by the DataNode process living on the same machine as the RegionServer process, to the local disk. This makes accessing the cells in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.

using Habse

After installing Hbase, we can use Hbase shell. There are some commands in hbase shell: get, scan.
We can use filters with scan to query and filter data.

exercises

bloom filter

perfect hash function should have uniform probability.
all hash functions set a bit to 1 = collide at the same place: probability of a FP case
deleting only happens when compacting.

references

https://ghislainfourny.github.io/big-data-textbook/
https://datakaresolutions.com/hbase-quick-guide-to-key-commands/
https://www.datapotion.io/blog/hbase-shell-column-filters/

Posted 2023-10-13Updated 2023-10-15theory8 minutes read (About 1166 words)

Probabilistic Artificial Intelligence - Bayesian Linear Regression

Bayesian Linear Regression

Linear Regression

Given a set of (x, y) pairs, linear regression aims to find a linear model that fits the data optimally.
Given the linear model $y=w^Tx$, we want to find optimal weights $w$.
There are many ways of estimating w from data, the most common being the least squares estimator:

$\hat{w}_\mathrm{ls}\doteq\arg\min_{\boldsymbol{w}\in\mathbb{R}^d}\sum_{i=1}^n(y_i-\boldsymbol{w}^\top\boldsymbol{x}_i)^2=\arg\min_{\boldsymbol{w}\in\mathbb{R}^d}\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{w}\|_2^2,$

A slightly different estimator is used for ridge regression:

$\hat{w}_{\mathrm{ridge}}\doteq\arg\min_{w\in\mathbb{R}^d}\|y-Xw\|_2^2+\lambda\|w\|_2^2$

As the formula shows, the squared $l_2$ regularization term $\lambda|w|_2^2$ penalizes large $w$ and thus reduces the complexity of the resulting model,so Ridge regression is more robust than standard linear regression in the presence of multicollinearity. Multicollinearity occurs when multiple independent inputs are highly correlated. In this case, their individual effects on the predicted variable cannot be estimated well. Classical linear regression is highly volatile to small input changes. The regularization of ridge regression reduces this volatility by introducing a bias on the weights towards 0.

uncertainty

In practice, our data D is merely a sample of the process we are modeling. In these cases, we are looking for models that generalize to unseen data.
Therefore, it is useful to express the uncertainty about our model due to the lack of data. This uncertainty is commonly referred to as the epistemic uncertainty.
Usually, there is another source of uncertainty called aleatoric uncertainty, which originates directly from the process that we are modeling. This uncertainty is the noise in the labels that cannot be explained by the inputs.

Weight-space View

The most immediate and natural Bayesian interpretation of linear regression is to simply impose a prior on the weights $w$.

Assume that prior $w\sim\mathcal{N}(0,\sigma_p^2I)$ and likelihood $y_i\mid x_i,w\sim\mathcal{N}(w^\top x_i,\sigma_n^2).$ are both Gaussian, we will get the posterior distribution over the weights as:

$\begin{aligned} &\log p(w\mid x_{1:n},y_{1:n}) \\ &=\log p(w)+\log p(y_{1:n}\mid x_{1:n},w)+\mathrm{const} \\ &=\log p(w)+\sum_{i=1}^n\log p(y_i\mid x_i,w)+\mathrm{const}\\ &=-\frac1{2\sigma_p^2}\|w\|_2^2-\frac1{2\sigma_n^2}\sum_{i=1}^n(y_i-w^\top x_i)^2+\mathrm{const} \\ &=-\frac1{2\sigma_p^2}\|w\|_2^2-\frac1{2\sigma_n^2}\|y-Xw\|_2^2+\mathrm{const} \\ &=-\frac1{2\sigma_p^2}w^\top w-\frac1{2\sigma_n^2}\Big(w^\top X^\top Xw-2y^\top Xw+y^\top y\Big)+\mathrm{const.}\\ &=-\frac12(w-\mu)^\top\Sigma^{-1}(w-\mu)+\mathrm{const} \end{aligned}$

As the above is a quadratic form in $w$, it follows that the posterior distribution is a Gaussian.
This also shows that Gaussians with known variance and linear likelihood are self-conjugate. A distribution is said to be self-conjugate (or a conjugate prior is self-conjugate) if, when used as a prior distribution, it results in a posterior distribution that belongs to the same family of distributions. It can be shown more generally that Gaussians with known variance are self-conjugate to any Gaussian likelihood. For general distributions the posterior will not be closed-form. This is a very special property of Gaussians.
We can compute the MAP estimate for the weights,

$\begin{gathered} \hat{w}_{\mathrm{MAP}} =\underset{w}{\operatorname*{\arg\max}}\log p(y_{1:n}\mid x_{1:n},w)+\log p(w) \\ =\arg\min_w\|y-Xw\|_2^2+\frac{\sigma_n^2}{\sigma_p^2}\|w\|_2^2. \end{gathered}$

we can find that this is simply the MLE loss with an additional $l_2$ regularization term and this coincides with the optimization objective of ridge regression with weight decay $\lambda=\frac{\sigma_n^2}{\sigma_p^2}$. Also, recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean.(The mode of a probability distribution is the value where the distribution reaches its maximum point. In the context of the posterior distribution, the mode corresponds to the most probable value of the parameter given the observed data.).

A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which uses $l_1$ regularization.It turns out that lasso can also be viewed as Bayesian learning, using a Laplace prior.

Note that using point estimates like the MAP estimate does not quantify uncertainty in the weights. The MAP estimate simply collapses all mass of the posterior around its mode.This is especially harmful when we are unsure about the best model.

Recursive Bayesian Updates

As data arrives online (i.e., in “real-time”), we can obtain the new posterior and use it to replace our prior.

Non-linear Regression

We can use linear regression not only to learn linear functions. The trick is to apply a non-linear transformation $\phi:\mathbb{R}^d\to\mathbb{R}^e$.
In Polynomial regression, to learn polynomials of degree m in d input dimensions, we need to apply the nonlinear transformation

$\begin{aligned} \boldsymbol{\phi}(x)& =[1,x_1,\ldots,x_d,x_1^2,\ldots,x_d^2,x_1\cdot x_2,\ldots,x_{d-1}\cdot x_d, \\ &x_{d-m+1}\cdot\cdot\cdot\cdot\cdot x_d]. \end{aligned}$

the dimension of the feature space grows exponentially in the degree of polynomials and input dimensions. Even for relatively small m and d, this becomes completely unmanageable.

Function-space View

Previously, we have been interpreting it as a distribution over the weights $w$ of a linear function $\hat{f}=\boldsymbol{\Phi}w.$, we can equivalently consider a distribution directly over the estimated function values. Instead of considering a prior over the weights${w\sim\mathcal{N}}(0,\sigma_p^2I)$, we now impose a prior directly on the values of our model at the observations.Using that Gaussians are closed under linear maps, we obtain the equivalent prior:

$f\mid X\sim\mathcal{N}(\boldsymbol{\Phi}\mathbb{E}[\boldsymbol{w}],\boldsymbol{\Phi}\mathrm{Var}[\boldsymbol{w}]\boldsymbol{\Phi}^\top)=\mathcal{N}(\boldsymbol{0},\underbrace{\sigma_p^2\boldsymbol{\Phi}\boldsymbol{\Phi}^\top}_{\boldsymbol{\kappa}})$

K is the kernel matrix.The kernel function is:
$k(x,x^{\prime})\doteq\sigma_p^2\cdot\phi(x)^\top\phi(x^{\prime})$
The kernel matrix is a covariance matrix and the kernel function measures the covariance of the function values $k(x,x^{\prime})=\mathrm{Cov}\big[f(x),f(x^{\prime})\big].$
Moreover, note that we have reformulated the learning algorithm such that the feature space is now implicit in the choice of kernel, and the kernel is defined by inner products of (nonlinearly transformed) inputs. In other words, the choice of kernel implicitly determines the class of functions that f is sampled from, and encodes our prior beliefs. This is known as the kernel trick.

reference

[1] A. Krause, “Probabilistic Artificial Intelligence”.

Posted 2023-10-12Updated 2023-10-13others3 minutes read (About 462 words)

hexo upgrade

hexo

what is hexo

“Hexo is a fast, simple and powerful blog framework. You write posts in Markdown (or other markup languages) and Hexo generates static files with a beautiful theme in seconds.”

how to use hexo

Hexo is dependent on Git and Node.js. After installed the two, if you are lucky, you can start to use Hexo as the official documentation(https://hexo.io/docs/) does. Here I will focus on the problems I have met. I accidently uninstalled Node.js and installed newest Node.js on my laptop, however, this action caused a series of problems. I spent 3 hours to fix the problem. In this opportunity, I will document the encountered issues and their solutions. At the same time, I will clarify the dependencies among these software tools.

problems

Here we assume that we have installed Git and Node.js successfully. We can check this by running

1	git --version

node -v

The first issue I encountered was when I ran

1	hexo deploy

I got “TypeError [ERR_INVALID_ARG_TYPE]: The “mode” argument must be integer”.

As https://github.com/hexojs/hexo/issues/4281 suggests, I upgraded Hexo, but it didn’t work, and there was another problem, I ran

1 2	npm install -g hexo-cli npm install -g hexo

Hexo is still older version. After I modified package.json under the Hexo path and using command like this, it upgraded.

1	npm i hexo@5.2.0

Unfortunately, even I upgraded Hexo to the newest version, I still got the error, so I decided to degraded Node.js. This is where I knew about nvm, a command-line tool that allows you to easily manage multiple versions of Node.js on a single machine. With nvm, you can install, switch between, and manage different Node.js versions for your development projects. Remember to uninstall the Node.js first, I didn’t, so when I use nvm to install an older version, I still got the installed version.
Finally got the older version, I got “YAMLException: Specified list of YAML types (or a single Type object) contains a non-Type object.”, then as https://github.com/hexojs/hexo/issues/4917 suggests, I degraded Node.js and installed yaml-js.

Running hexo again, I got “TypeError: Cannot read property ‘length’ of undefined” and “Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath ‘./lib/js-yaml/type’ is not defined by “exports””, then I upgraded the theme I use and ran npm install to fix the problem, but there are still some problems with the theme. I reinstalled the theme as https://github.com/ppoffice/hexo-theme-icarus. As the theme upgrading, there are more dependencies needed, just install them as the message shows.

reference

https://hexo.io/docs/
https://github.com/ppoffice/hexo-theme-icarus

Posted 2023-10-12Updated 2024-02-21theory14 minutes read (About 2055 words)

Probabilistic Artificial Intelligence - Gaussian Process

Gaussian Process

Gaussian distribution

Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.

Grassian process

what is GP

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.

Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.

Mean and covariance functions

The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.

The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.

How GP works

For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.

Marginal likelihood and GP training

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.

The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.

To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.

$\begin{aligned} p(\mathbf{y}|\mathbf{X},\theta)& =\int p(\mathbf{y}|f,\mathbf{X})p(f|\mathbf{X})df \\ &=\int\mathcal{N}(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})\mathcal{N}(f(\mathbf{X})|\mathbf{0},\mathbf{K})df(\mathbf{X}) \\ &=\mathcal{N}(\mathbf{y}|\mathbf{0},\mathbf{K}+\sigma_n^2\mathbf{I}), \end{aligned}$

Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.

what is kernel

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.

The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.

$k: \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R},\quad \Sigma = \text{Cov}(X,X’) = k(t,t’)$

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.

Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.

The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.

Interpretation of the hyperparameters

Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.

The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.

training tips

The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.

Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:

Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.

Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.

How to choose a kernel

The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.

others

The GP does not require any input normalization, but it can make sense to do so for numerical reasons.

reference

https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.

Posted 2023-10-11Updated 2023-10-17database8 minutes read (About 1203 words)

bigdata - Syntax

syntax

CSV

CSV is a textual format, in the sense that it can be opened in a text editor. CSV means comma-separated values. The main challenge with CSV files is that, in spite of a standard (RFC 4180), in practice there are many different dialects and variations, which limits interoperability. For example, another character can be used instead of the comma (tabulation, semi-colons, etc). Also, when a comma (or the special character used in its stead) needs to actually appear in a value, it needs to be escaped.

Data denormalization

Data denormalization makes a lot of sense in read-intensive scenarios in which not having to join brings a significant performance improvement.

The difference with CSV is that, in JSON, the attributes appear in every tuple, while in CSV they do not appear except in the header line. JSON is appropriate for data denormalization because including the attributes in every tuple allows us to drop the identical support requirement.

The generic name for denormalized data (in the same of heterogeneous and nested) is “semi-structured data”. Textual formats such as XML and JSON have the advantage that they can both be processed by computers, and can also be read, written and edited by humans. Another very important and characterizing aspect of XML and JSON is that they are standards: XML is a W3C standard. W3C, also known as the World Wide Web consortium, is the same body that also standardizes HTML, HTTP, etc. JSON is now an ECMA standard, which is the same body that also standardizes JavaScript.

Whichever syntax is used, they have in common the concept of well-formedness. A string is said to be well-formed if it belongs to the language. Concretely, when a document is well-formed XML, it means that it can be successfully opened by an editor as XML with no errors.

JSON

JSON stands for JavaScript Object Notation because the way it looks like originates from JavaScript syntax, however it is now living its own life completely independently of JavaScript.

JSON is made of exactly six building blocks: strings, numbers, Booleans, null, objects, and arrays. Strings are simply text. In JSON, strings always appear in double quotes. Obviously, strings could contain quotes and in order not to confuse them with the surrounding quotes, they need to be differentiated. This is called escaping and, in JSON, escaping is done with backslash characters ().

JSON generally supports numbers, without explicitly naming any types nor making any distinction between numbers apart from how they appear in syntax. The way a number appears in syntax is called a lexical representation, or a literal. JSON places a few restrictions: a leading + is not allowed. Also, a leading 0 is not allowed except if the integer part is exactly 0.

There are two Booleans, true and false. Arrays are simply lists of values. The concept of list is abstract and mathematical. The concept of array is the syntactic counterpart of a list. Objects are simply maps from strings to values. The concept of object is the syntactic counterpart of a map,i.e., an object is a physical representation of an abstract map that explicitly lists all string-value pairs. The keys of an object must be strings. The JSON standard recommends for keys to be unique within an object.

XML

XML stands for eXtensible Markup Language. It resembles HTML, except that it allows for any tags and that it is stricter in what it allows.

XML’s most important building blocks are elements, attributes, text and comments. XML is a markup language, which means that content is “tagged”. Tagging is done with XML elements. An XML element consists of an opening tag, and a closing tag. What is “tagged” is everything inbetween the opening tag and the closing tag. ags consist of a name surrounded with angle brackets < … >, and the closing tag has an additional slash in front of the name. We use a convenient shortcut to denote the empty element with a single tag and a slash at the end. For example, \ is equal to
\\.
Unlike JSON keys, element names can repeat at will.

Attributes appear in any opening elements tag and are basically keyvalue pairs. Values can be either double-quoted or single-quoted. The key is never quoted, and it is not allowed to have unquoted values. Within the same opening tag, there cannot be duplicate keys. Attributes can also appear in an empty element tag. Attributes can never appear in a closing tag. It is not allowed to create attributes that start with XML or xml, or any case combination.

Text, in XML syntax, is simply freely appearing in elements and without any quotes (attribute values are not text!). Within an element, text can freely alternate with other elements. This is called mixed content and is unique to XML.

Comments in XML look like so: \. XML documents can be identified as such with an optional text declaration containing a version number and an encoding, like \<?xml version=”1.0” encoding=”UTF-8”?>. The version is either 1.0 or 1.1. Another tag that might appear is the doctype declaration, like \<!DOCTYPE person>.

Remember that in JSON, it is possible to escape sequences with a backslash character. In XML, this is done with an ampersand (&) character. There are exactly five possible escape sequences pre-defined in XML:
Alt text . Escape sequences can be used anywhere in text, and in attribute values. At other places (element names, attribute names, inside comments), they will not be recognized.
There are a few places where they are mandatory:& and < MUST be escaped. ” and ‘ should also be escaped in quoted qttribute values.

Namespaces are an extension of XML that allows users to group their elements and attributes in packages, similar to Python modules, Java packages or C++ namespaces. A namespace is identified with a URI. A point of confusion is that XML namespaces often start with http://, but are not meant to be entered as an address into a browser. A namespace declaration is like: \. If you remember, we saw that attributes starting with xml are forbidden, and this is because this is reserved for namespace declarations. What about documents that use multiple namespaces? This is done by associating namespaces with prefixes, which act as shorthands for a namespace. Then, we can use the prefix shorthand in every element that we want to have in this namespace.
Alt text
So, given any element, it is possible to find its local name, its (possibly absent) prefix, and its (possibly absent) namespace. The triplet (namespace, prefix, localname) is called a QName

Attributes can also live in namespaces, that is, attribute names are generally QNames. However, there are two very important aspects to consider. First, unprefixed attributes are not sensitive to default namespaces: unlike elements, the namespace of an unprefixed attribute is always absent even if there is a default namespace. Second, it is possible for two attributes to collide if they have the same local name, and different prefixes but associated with the same namespace (but again, we told you: do not do that!).

references

https://ghislainfourny.github.io/big-data-textbook/

Posted 2023-09-19Updated 2024-10-25database13 minutes read (About 1882 words)

bigdata - introduction

big data

Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodat-ing for the issue of growing discrepancy between capacity, throughput and latency.
Alt text

data independence

Data independence means that the logical view on the data is cleanly separated, decoupled, from its physical storage.

architecture

Stack: storage, compute, model ,language

data model

what data looks like and what you can do with it.

data shapes

tables, trees, cubes

table

Row(Tuple), Column(Attribute), Primary Key,Value.

relational tables

schema: A set of attributes.
extension: A set/bag/list of tuples.
Three constraints: Relational integrity, domain integrity and atomic integrity.
Superkey, Candidate key(minimal superkey).

Database Normalization

In database management systems (DBMS), normal forms are a series of guidelines that help to ensure that the design of a database is efficient, organized, and free from data anomalies.
First Normal Form (1NF):In 1NF, each table cell should contain only a single value, and each column should have a unique name.
Second Normal Form (2NF): 2NF eliminates redundant data by requiring that each non-key attribute be dependent on the primary key.
Third Normal Form (3NF): 3NF builds on 2NF by requiring that all non-key attributes are independent of each other.

Denormalization

Denormalization is a database optimization technique in which we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. Note that denormalization does not mean ‘reversing normalization’ or ‘not to normalize’. It is an optimization technique that is applied after normalization.

SQL

SQL was originally named SEQUEL, for Structured English QUEry Language. SQL is a declarative language, which means that the user specifies what they want, and not how to compute it: it is up to the underlying system to figure out how to best execute the query.

View and Table

The view is a result of an SQL query and it is a virtual table, whereas a Table is formed up of rows and columns that store the information of any object and be used to retrieve that data whenever required. A view contains no data of its own but it is like a ‘window’ through which data from tables can be viewed or changed. The view is stored as a SELECT statement in the data dictionary. Creating a view fulfills the requirement without storing a separate copy of the data because a view does not store any data of its own and always takes the data from a base table. as the data is taken from the base table, accurate and up-to-date information is required.

SQL:1999 added the with clause to define “statement scoped views”. They are not stored in the database schema: instead, they are only valid in the query they belong to. This makes it possible to improve the structure of a statement without polluting the global namespace.With is not a stand alone command like create view is: it must be followed by select.

Natural Join and Inner Join

Natural Join joins two tables based on the same attribute name and datatypes. The resulting table will contain all the attributes of both the table but keep only one copy of each common column while Inner Join joins two tables on the basis of the column which is explicitly specified in the ON clause. The resulting table will contain all the attributes from both tables including the common column also.

data storage

Stack: Storage, Encoding, Syntax, Data models, Validation, Processing, Indexing, Data stores,Querying, User interfaces.

database and data lake

two main paradigms for storing and retrieving data:database and data lake. Data can be imported into the database (this is called ETL, for Extract-Transform-Load. ETL is often used as a verb).The data is internally stored as a proprietary format that is optimized to make queries faster. This includes in particular building indices on the data.
On the other hand, data can also just be stored on some file system.This paradigm is called the data lake paradigm and gained a lot of popularity in the past two decades. It is slower, however users can start querying their data without the effort of ETLing.

scaling up and scaling out

First, one can buy a bigger machine: more memory, more or faster CPU cores, a larger disk, etc. This is called scaling up. Second, one can buy more, similar machines and share the work across them. This is called scaling out.

Object stores

Amazon’s object storage system is called Simple Storage Service, abbreviated S3. From a logical perspective, S3 is extremely simple: objects are organized in buckets. Buckets are identified with a bucket ID, and each object within a bucket is identified with an Object ID.

CAP theorem

Consistency: at any point in time, the same request to any server returns the same result, in order words, all nodes see the same data;
Availability: the system is available for requests at all times with very high availability.
Partition tolerance: the system continues to function even if the network linking its machines is occasionally partitioned.
The CAP theorem is basically an impossibility triangle: a system cannot guarantee at the same time: usually are CP,AP or AC.

REST APIs

REST (Representational State Transfer) is an architectural style for designing networked applications.RESTful services often use HTTP as the communication protocol. A client and server communicated with the HTTP protocol interact in terms of methods applied to resources.A resource is referred to with what is called a URI. URI stands for Uniform Resource Identifier. A client can act on resources by invoking methods, with an optional body. The most important methods are: GET, PUT,DELETE,POST.

REST is not a standard or protocol, this is an approach to or architectural style for writing API.REST is an architectural style, and RESTful is the interpretation of it. That is, if your back-end server has REST API and you make client-side requests (from a website/application) to this API, then your client is RESTful. All requests you make have their HTTP status codes. There are a lot of them and they are divided into 5 classes. The first number indicates which of them a code belongs to:
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error

Amazon S3 and Azure Blob Storage

Alt text

Key-value store

A key-value store differs from a typical relational database in three aspects: • Its API is considerably simpler than that of a relational database (which comes with query languages) • It does not ensure atomic consistency; instead, it guarantees eventual consistency, which we covered earlier in this Chapter. • A key-value store scales out well, in that it is very fast also at large scales.

Amazon Dynamo

It is itself based (with some modifications) on the Chord protocol, which is a Distributed Hash Table.On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles. The first design principle is incremental stability. This means that new nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash.The second principle is symmetry: no node is particular in any way The third principle is decentralization: there is no “central node” that orchestrates the others.The fourth principle is heterogeneity: the nodes may have different CPU power, amounts of memory, etc.

A central aspect of the design of a distributed hash table, and part in particular of the Chord protocol, is that every logical key is hashed to bits that we will call IDs. In the case of Dynamo, the hash is made of 128 bits (7 bytes).In the chord protocol, a technology called a finger table is used. Each node knows the next node clockwise, and the second node, and the 4th node, and the 8th node. Dynamo changes this design to so-called “preference lists”: each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise).

Distributed hash tables, including Dynamo, are typically AP. A fundamental conceptual tool in AP systems is the use of vector clocks.Vector clocks are a way to annotate the versions when they follow a DAG structure.A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally.

vector clock

Lamport’s logical clock

Lamport’s Logical Clock was created by Leslie Lamport. It is a procedure to determine the order of events occurring. It provides a basis for the more advanced Vector Clock Algorithm. Due to the absence of a Global Clock in a Distributed Operating System Lamport Logical Clock is needed.

Implementation Rules：
[IR1]: If a -> b [‘a’ happened before ‘b’ within the same process] then, Ci(b) =Ci(a) + d
[IR2]: Cj = max(Cj, tm + d) [If there’s more number of processes, then tm = value of Ci(a), Cj = max value between Cj and tm + d]

Vector Clocks in Distributed Systems

Vector Clock is an algorithm that generates partial ordering of events and detects causality violations in a distributed system.
How does the vector clock algorithm work :

Initially, all the clocks are set to zero.
Every time, an Internal event occurs in a process, the value of the processes’s logical clock in the vector is incremented by 1.
Every time, a process receives a message, the value of the processes’s logical clock in the vector is incremented by 1, and moreover, each element is updated by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).

To sum up, Vector clocks algorithms are used in distributed systems to provide a causally consistent ordering of events but the entire Vector is sent to each process for every message sent, in order to keep the vector clocks in sync.

partial order relation

Note that vector clocks can be compared to each other with a partial order relation ≤. A partial order relation is any relation that is reflexive, antisymmetric, and transitive. A total order relation is a partial order in which every element of the set is comparable with every other element of the set. All total order relations are partial order relations, but the converse is not always true.

references

https://www.geeksforgeeks.org/normal-forms-in-dbms/
https://www.geeksforgeeks.org/denormalization-in-databases/
https://www.geeksforgeeks.org/difference-between-view-and-table/
https://modern-sql.com/feature/with
https://www.geeksforgeeks.org/sql-natural-join/
https://mlsdev.com/blog/81-a-beginner-s-tutorial-for-understanding-restful-api
https://ghislainfourny.github.io/big-data-textbook/
https://www.geeksforgeeks.org/lamports-logical-clock/

Posted 2023-09-19Updated 2024-10-25database6 minutes read (About 927 words)

bigdata - Distributed file systems

Distributed file systems

requirements of a distributed file system

Going back to our capacity-throughput-latency view of storage, a distributed file system is designed so that, in cruise mode, its bottleneck will be the data flow (throughput), not the latency. We saw that capacity increased much faster than throughput, and that this can be solved with parallelism. We saw that throughput increased much faster than latency decreased, and that this can be solved with batch processing. Distributed file systems support both parallelism and batch processing natively, forming the core part of the ideal storage system accessed by MapReduce or Apache Spark.The origins of such a system come back to the design of GoogleFS, the Google File System. Later on, an open source version of it was released as part of the Hadoop project, initiated by Doug Cutting at Yahoo, and called HDFS, for Hadoop Distributed File System.

HDFS

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. Unlike in S3, HDFS files are furthermore not stored as monolithic blackboxes, but HDFS exposes them as lists of blocks. As for the block size: HDFS blocks are typically 64 MB or 128 MB large, and are thus considerably larger than blocks on a local hard drive (around 4 kB).HDFS is designed to a run on a cluster of machines.

architecture

HDFS is implemented on a fully centralized architecture, in which one node is special and all others are interchangeable and connected to it.
In the case of HDFS, the central node is called the NameNode and the other nodes are called the DataNodes. Every file is divided into chunks called blocks. All blocks have a size of exactly 128 MB, except the last one which is usually smaller. Each one of the blocks is then replicated and stored on several DataNodes. How many times? This is a parameter called the replication factor. By default, it is 3.

The NameNode is responsible for the system-wide activity of the HDFS cluster. It store in particular three things: • the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems. • a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. • a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block. The DataNodes store the blocks themselves. These blocks are stored on their local disks. DataNodes send regular heartbeats to the NameNode. The frequency of these heartbeats is configurable and is by default a few seconds (e.g., 3s, but this value may change across releases). This is a way to let the NameNode know that everything is alright.Finally, the DataNode also sends, every couple of hours (e.g., 6h, but this value may change across releases), a full report including all the blocks that it contains. A NameNode never initiates a connection to a DataNode.

Finally, DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts. When a replication pipeline is ongoing and a new block is being written to the cluster, the content of the block is not sent in one single 128 MB packet. Rather, it is sent in smaller packets (e.g., 64 kB) in a streaming fashion via a network protocol.

replicas

Having this in mind, the first replica of the block, by default, gets written to the same machine that the client is running on. The second replica is written on a DataNode sitting in a different rack than the client, that we call B. The third replica is written to another DataNode on the same rack B.And further replicas are written mostly at random, but respecting two simple rules for resilience: at most one replica per node, and at most two replicas per rack.

Fault tolerance

HDFS has a single point of failure: the NameNode. If the metadata stored on it is lost, then all the data on the cluster is lost, because it is not possible to reassemble the blocks into files any more.For this reason, the metadata is backed up. More precisely, the file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. What is done is that updates to the file system arriving after the snapshot has been made are instead stored in a journal, called edit log, that lists the updates sorted by time of arrival. The snapshot and edit log are stored either locally or on a networkattached drive (not HDFS itself).

Logging and importing data

Two tools are worth mentioning: Apache Flume lets you collect, aggregate and move log data to HDFS. Apache Sqoop lets you import data from a relational database management system to HDFS.

references

https://ghislainfourny.github.io/big-data-textbook/
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

Posted 2023-08-10Updated 2023-08-10a few seconds read (About 22 words)

opencv

basic functions

reading and writing

cv2.VideoWriter_fourcc(‘M’, ‘P’, ‘4’, ‘V’)
cv2.VideoWriter(filename,fourcc,fps,frameSize[,isColor])
cv2.VideoWriter.write(image)

Posted 2023-08-03Updated 2023-08-276 minutes read (About 969 words)

object tracking

Multiple Object Tracking(MOT) is the task of detecting various objects of interest in a video, tracking these detected objects in subsequent frames by assigning them a unique ID, and maintaining these unique IDs as the objects move around in a video in successive frames.Generally, multiple object tracking happens in two stages: object detection and object association. Object detection is the process of identifying all potential objects of interest in the current frame using object detectors such as Faster-RCNN or YOLO. Object association is the process of linking objects detected in the current frame with its corresponding objects from previous frames, referred to as tracklets. Object or instance association is usually done by predicting the object’s location at the current frame based on previous frames’ tracklets using the Kalman Filter followed by one-to-one linear assignment typically using the Hungarian Algorithm to minimise the total differences between the matching results.

Metrics

MOTP (Multiple Object Tracking Precision)

MOTP (Multi-Object Tracking Precision) expresses how well exact positions of the object are estimated. It is the total error in estimated position for matched ground truth-hypothesis pairs over all frames, averaged by the total number of matches made. This metric is not responsible for recognizing object configurations and evaluating object trajectories.

MOTA (Multiple Object Tracking Accuracy)

MOTA (Multi-Object Tracking Accuracy) shows how many errors the tracker system has made in terms of Misses, False Positives, Mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of Misses, the ratio of False positives, and the ratio of Mismatches over all the frames.

IDF1 score (IDF1)

IDF1 score (IDF1) is the ratio of correctly identified detections over the average of ground truth and predicted detections.

Benchmarks

OTB

KITTI

MOT16

Methods(models)

IOU tracker

The Intersection-Over-Union (IOU) tracker uses the IOU values among the detector’s bounding boxes between the two consecutive frames to perform the association between them or assign a new target ID if no match found.

Simple Online And Realtime Tracking (SORT)

Simple Online And Realtime Tracking (SORT) is a lean implementation of a tracking-by detection framework.SORT uses the position and size of the bounding boxes for both motion estimation and data association through frames. SORT combines location and motion cues by adopting a Kalman filter to predict the location of the tracklets in the new frame, then computes the IoU between the detection boxes and the predicted boxes as the similarity.

DeepSORT

DeepSORT replaces the association metric with a more informed metric that combines motion and appearance information. In particular, a “deep appearance” distance metric is added. The core idea is to obtain a vector that can be used to represent a given image. DeepSort adopts a stand-alone RE-ID model to extract appearance features from the detection boxes. After similarity computation matching strategy assigns identities to the objects. This can be done by the Hungarian Algorithm or greedy assignment.

FairMOT

FairMOT is a new tracking approach built on top of the anchor-free object detection architecture CenterNet.It has a simple network structure that consists of two homogeneous branches for detecting objects and extracting re-ID features.

TransMOT

TransMOT is a new spatial-temporal graph Transformer that solves all these issues. It arranges the trajectories of all the tracked objects as a series of sparse weighted graphs that are constructed using the spatial relationships of the targets. TransMOT then uses these graphs to create a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial transformer decoder layer to model the spatial-temporal relationships of the objects.

ByteTrack

BYTE is an effective association method that utilizes all detection boxes from high scores to low ones in the matching process.BYTE is built on the premise that the similarity with tracklets provides a strong cue to distinguish the objects and background in low score detection boxes. BYTE first matches the high score detection boxes to the tracklets based on motion similarity. It uses Kalman Filter to predict the location of the tracklets in the new frame. The motion similarity is computed by the IoU of the predicted box and the detection box. Then, it performs the second matching between the unmatched tracklets.

The primary innovation of BYTETrack is keeping non-background low confidence detection boxes which are typically discarded after the initial filtering of detections and use these low-score boxes for a secondary association step. Typically, occluded detection boxes have lower confidence scores than the threshold, but still contain some information about the objects which make their confidence score higher than purely background boxes. Hence, these low confidence boxes are still meaningful to keep track of during the association stage.

Comparison of DeepSort and ByteTrack

DeepSort uses a pre-trained object detection model to detect objects in each frame and a Siamese network to match the detected objects based on their appearance features. It also uses Kalman filters to predict the locations of the objects in the next frame. ByteTrack, on the other hand, uses a lightweight Siamese network architecture that takes in two input frames and outputs a similarity score. It also uses a simple but effective data augmentation technique to improve its performance on challenging datasets.

using ByteTrack

ByteTracker initiates a new tracklet only if a detection is not matched with any previous tracklet and the bounding box score is higher than a threshold.

references

https://www.datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box
https://pub.towardsai.net/multi-object-tracking-metrics-1e602f364c0c
https://learnopencv.com/object-tracking-and-reidentification-with-fairmot/
https://medium.com/augmented-startups/top-5-object-tracking-methods-92f1643f8435
https://medium.com/@pedroazevedo6/object-tracking-state-of-the-art-2022-fe9457b77382

Posted 2023-07-20Updated 2023-07-272 minutes read (About 360 words)

Neural Networks

training nerual networks

suggestions

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data.

Our next step is to set up a full training + evaluation skeleton and gain trust in its correctness via a series of experiments. At this stage it is best to pick some simple model that you couldn’t possibly have screwed up somehow.
Tips & tricks for this stage:
Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome.
Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage.
When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it.
Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?

The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss). The reason I like these two stages is that if we are not able to reach a low error rate with any model at all that may again indicate some issues, bugs, or misconfiguration.

In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.

references

http://karpathy.github.io/2019/04/25/recipe/