Turing Machines and Undecidability

mathematical model of computation

Turing Machines

Turing machines are primarily a mathematical construction. A Turing machine (TM) consists of:A finite set of states: initial state(when the TM starts, it is in this state) and accepting states(if the TM ends in one of these states, it says yes); Transition rules that determine what to do based on current state and content of memory tape at current position; An alphabet, the set of symbols that can be stored on the memory tape.

The Church-Turing Hypothesis

Anything which can be computed by any automatic method, can also be computed by a Turing machine.

If a model can also do everything that a Turing machine can, then it is called a Turing-complete model of computation.

The RAM Model

Models the way modern computers operate, with registers and memory cells that can be immediately accessed.

Comparison-Based Sorting

We will prove that any such algorithm must make Ω(n log n)
comparisons (and thus take Ω(n log n) time)

Undecidability

decision problems

The answer is just a single bit of information,“yes” or “no”.A decision problem is decidable if there exists an algorithm for solving the problem without any efficiency considerations.

The Halting Problem

Turing machine M, and input x to M. Does M halt when run on the input x?
The Halting Problem is undecidable.

Halting On Empty Input

It is undecidable.

Halting on All Inputs

It is undecidable.

Recursively Enumerable Problems

There exists an “algorithm” which terminates whenever the answer is
“yes” but does not terminate on “no” instances.Problems which have algorithms like this are called recursively enumerable.

Turing reduction

A Turing reduction (also called Cook reduction) from a problem
X to a problem Y is an efficient algorithm for problem X which is
allowed to assume an algorithm for Y as a black box.

Vertex Cover

Independent Set

Set Cover

3-Sat

Karp Reductions

X ≤p Y if given any instance A of problem X, we can in
polynomial time construct an instance f (A) of problem Y such that
the answer to A is the same as the answer to f (A).

Turing reduction is a type of reduction that is stronger than Karp reduction. A problem A is Turing-reducible to problem B if there exists a Turing machine that can solve problem A by making a polynomial number of calls to a black-box subroutine for problem B. In other words, if we can use an algorithm for problem B to solve problem A, then A is Turing-reducible to B. Turing reduction preserves the complexity class of a problem, meaning that if A is Turing-reducible to B and B is in a certain complexity class, then A is also in that complexity class.

Karp reduction, also known as polynomial-time reduction, is a weaker form of reduction than Turing reduction. A problem A is Karp-reducible to problem B if there exists a polynomial-time algorithm that can transform any instance of problem A into an instance of problem B such that the answer to the transformed instance is the same as the answer to the original instance of problem A. In other words, if we can use an algorithm for problem B to solve a transformed version of problem A in polynomial time, then A is Karp-reducible to B. Karp reduction preserves the complexity class up to polynomial factors, meaning that if A is Karp-reducible to B and B is in a certain complexity class, then A is also in that complexity class up to polynomial factors.

Good reference

https://www.cs.rochester.edu/u/nelson/courses/csc_173/computability/undecidable.html
http://www2.lawrence.edu/fast/GREGGJ/CMSC515/chapt05/Reducibility.html
https://www.cs.princeton.edu/courses/archive/spring05/cos423/lectures.php

Randomized Algorithms

Randomized Algorithms and approximation algorithm

A randomized algorithm is an algorithm that uses randomness to
make some of its choices.

Las Vegas Algorithms

A Las Vegas algorithm is a randomized algorithm that always finds
the correct answer, but the running time of the algorithm might
vary significantly depending on the random choices made.

Monte Carlo Algorithms

A Monte Carlo algorithm is a randomized algorithm where the
output of the algorithm may be incorrect, but we have a guarantee
that this only happens with small probability.

Random Number Generator

True RNG: gets randomness from measuring physical phenomena
(e.g. background radiation) that we believe is sufficiently random
Pseudo-RNG: starting from a small random seed, new numbers are
generated in a completely deterministic way.

Randomized Min-Cut Algorithm

while G has more than 2 vertices:Pick a uniformly random edge of G and contract it.
the total runtime is $O(n^2m)$. However this algorithm can be refined and running time improved to $O(n^2log(n))$ (relatively simple algorithm) or
$O(m log^3(n))$ (more complicated algorithm).

approximation algorithm

The Maximum Cut Problem

Partition of vertices of G into two non-empty sets A and B such that number of edges between A and B is maximized. It is a NP-hard problem. Let us lower the ambition and try to find a fast algorithm that finds a “reasonably good” cut: Random algorithm:Put each vertex in either A or B independently with probability 1/2.The random assignment algorithm cuts (in expectation) at least half of all the edges in the graph.This is an example of an approximation algorithm.

finding “good but maybe not optimal” solutions

Heuristic Algorithms

Work well in some or even many cases, but with no guarantees
about how well they perform.

Approximation Algorithms

Algorithms with provable guarantee that the solution found is relatively good compared to the optimum solution, for all instances.
For a minimization problem, an algorithm has approximation ratio $\alpha ≥ 1$ if for every instance it holds that $ Alg ≤ \alpha Opt$.
For a maximization problem, the inequality goes the other way: we have approximation ratio $\alpha ≤ 1$ if $ Alg ≥ \alpha Opt$.

alpha-approximation algorithm

approximation ratio: $\alpha$.

Minimum Load Balancing

NP-hard problem.
Given Lengths t1; : : : ; tn of n jobs to be run on m machines, to find Minimum possible makespan of a schedule for the n jobs.
Approximation Algorithm for Load Balancing:Assign job i to the machine j with the smallest load.

how to prove it

The main difficulty in analyzing approximation
algorithms is to get some handle on the Opt value. We need to find Opt, but finding Opt is NP-hard, so we find lower bounds on Opt.

Minimum Vertex Cover

A vertex cover in a graph G is a set of vertices that “touches” every edge of G.What is the size of a minimum vertex cover of G? Approximation Algorithm for Minimum Vertex Cover:while there exists an edge e = (u; v) such that u /∈ S and v /∈ S, add (u,v) into S. The algorithm is a 2-approximation algorithm.
“Unique Games Conjecture”: it is known that Vertex Cover cannot be approximated better than
within a factor 2.

Minimum Set Cover

A collection of sets S1; : : : ; Sm ⊆ U over some universe U, What is minimum number of Si’s whose union equals U?

Analysis of Greedy Set Cover Finale?

Elasticsearch

ELK Stack

Elasticsearch, Logstash and Kibana

Elasticsearch

Elasticsearch is a NoSQL database.When you feed data into Elasticsearch, the data is placed into Apache Lucene indexes.

Apache Lucene

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java.

API

Logstash

Using more than 50 input plugins for different platforms, databases and applications, Logstash can be defined to collect and process data from these sources and send them to other systems for storage and analysis.

project

https://trecpodcasts.github.io/
https://doc.yonyoucloud.com/doc/mastering-elasticsearch/chapter-2/21_README.html
https://cloud.tencent.com/developer/article/1600163
https://www.elastic.co/cn/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries
https://www.elastic.co/guide/en/app-search/current/relevance-tuning-guide.html
https://medium.com/mlearning-ai/enhancing-information-retrieval-via-semantic-and-relevance-matching-64973ff81818
https://www.elastic.co/cn/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries
https://bigdataboutique.com/blog/optimizing-elasticsearch-relevance-a-detailed-guide-c9efd3
NDCG:
https://www.javatips.net/api/MyMediaLiteJava-master/src/org/mymedialite/eval/measures/NDCG.java

speech

signal

spectrogram: A spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis.

short-time Fourier analysis

Why use it?
Some regions of speech signals shorter than 100 milliseconds often appear to be periodic, so that we can use the exact definition of Fourier
transform.

spectral leakage

This phenomenon is called spectral leakage because the amplitude of one harmonic leaks over the rest and masks its value.

feature extraction

Representation of speech signals in the frequency domain is especially useful because the frequency structure of a phoneme is generally unique.

Sinusoids are important because speech signals can be decomposed as sums of sinusoids.

For voiced sounds there is typically more energy at low frequencies
than at high frequencies, also called roll-off. To make the spectrograms easier to read, sometimes the signal is first preemphasized (typically with a first-order difference FIR filter) to boost the high frequencies
to counter the roll-off of natural speech.

Digital Systems

Linear Time-Invariant Systems and Linear Time-Varying Systems.

The Fourier Transform

Z-Transform

digital filter

filterbank

A filterbank is a collection of filters that span the whole frequency spectrum.

short-time analysis

NP and reduction

P

P is the set of all decision problems that can be solved in
Polynomial time.

Extended Church-Turing Thesis

Any realistic model of computation can be efficiently
simulated by a Turing machine.
(The ECT is probably false!Probable counter-example: Quantum computing,However, the ECT is true for the classic models of computation
underlying our laptops)

NP problem

We do not know of polynomial-time algorithms for these problems, and we cannot prove that no
polynomial-time algorithm exists.

NP stands for Nondeterministic Polynomial time.

NP is a set of problems that there is a polynomial-time algorithm that verify if a solution to the problem is correct.
Note that showing that a problem is in NP does not mean that the problem can be solved in polynomial time. It only means that a proposed solution can be verified in polynomial time.

All problems in P are also in NP, we do not know if P=NP.

NP-hardness

We say that a problem Y is NP-hard if every problem in NP can be Karp-reduced to Y. To prove NP-hardness for Y we only have to find
one other NP-hard problem X and reduce X to Y.

NP-complete problem

A large class of problems in this “gray area” has been characterized,
and it has been proved that they are equivalent in the following sense: a polynomial-time algorithm for any one of them would imply the existence of a polynomial-time algorithm for all of them. These are the NP-complete problems.

Every problem in NP can be reduced to X. We will call such an X an NP-complete problem.

To prove a problem is np-complete, we need to prove it lies in Np and it is np-hard. In NP is proved by verifying the solution in polynomial time. Np-hard is proved by using Karp-reduction from known NP-hard problem.

the NP-complete problems are the hardest problems in NP

The Cook-Levin Theorem

The Sat problem is NP-hard.

Hamiltonian Cycle

Reduction from Sat to Hamiltonian Cycle.

Travelling Salesman

Reduction from Hamiltonian Cycle

Graph Coloring

Graph Coloring is NP-hard.
2-coloring is not NP-hard.

Vertex Cover

we say that a set of nodes S is a vertex cover if every edge e has at least one end in S.

Set Cover

Given a set U of n elements, a collection S1,…, Sm of subsets of U, and
a number k, does there exist a collection of at most k of these sets whose
union is equal to all of U?

CoNP

The complexity class CoNP consists of all problems where a “no” answer can be efficiently verified.
For every NP-complete problem there is a
corresponding CoNP-complete problem.

We do not know if NP=CoNP.

PSPACE

The complexity class PSPACE consists of all problems that can
be solved by an algorithm using at most a polynomial amount of
space.
It is strongly believed that NP != PSPACE, but we do not even
know with certainty whether P != PSPACE or not!

For many 2-player games like Geography, deciding if there is a
winning strategy from a given position is a PSPACE problem

BPP and ZPP

BPP

BPP (Bounded-error Probabilistic Polynomial-time)
consists of all decision problems for which there is a
polynomial-time randomized algorithms that is correct with
probability at least 2=3 on all instances.

ZPP

ZPP (Zero-error Probabilistic Polynomial Time)
consists of all decision problems for which there is a Las Vegas
algorithm running in expected polynomial time.
It is widely believed that P = ZPP = BPP but all three could be
different.

toolnotes

背景

记录平时学习和开发工程中使用工具的一些备忘点。

正文

用vim时,鼠标右键不能粘贴而是进入了visual模式,解决方法::set mouse-=a

远程jupyter配置
https://juejin.cn/post/7026371559971520525

For Debian / Ubuntu: .deb packages installed by apt and dpkg
For Rocky / Fedora / RHEL: .rpm packages installed by yum
For FreeBSD: .txz packages installed by pkg

ssh-keygen -t rsa -b 4096 -C “your_email@example.com”

Kill PyTorch Distributed Training Processes:
kill $(ps aux | grep YOUR_TRAINING_SCRIPT.py | grep -v grep | awk ‘{print $2}’)

git reset —soft HEAD^ 撤销commit
git reset —hard HEAD^ 撤销add

tokenizing in Unix: “tr” command
分词文件排序和统计
tr -sc ’A-Za-z’ ’\n’ < $file_name| sort | uniq -c

conda:
conda create -n python=3.7 yourenv pip

git find . -name “*.py”|xargs git add —

导出项目依赖:
pip install pipreqs
pipreqs ./ —encoding=utf-8 —force

inside docker container, cannot locate package with apt install:
echo “deb http://archive.debian.org/debian stretch main” > /etc/apt/sources.list

https://unix.stackexchange.com/questions/743839/apt-get-update-failed-to-fetch-debian-amd64-packages-while-building-dockerfile-f

PlotNeuralNet

https://pub.towardsai.net/creating-stunning-neural-network-visualizations-with-chatgpt-and-plotneuralnet-adab37589e5

remote develop

https://devblogs.microsoft.com/python/remote-python-development-in-visual-studio-code/

references”

https://leimao.github.io/blog/Kill-PyTorch-Distributed-Training-Processes/

PCA

关于PCA为什么要中心化

因为不做zero mean,根本做不了PCA。从线性变换的本质来说,PCA就是在线性空间做一个旋转(数据矩阵右乘协方差矩阵的特征向量矩阵),然后取低维子空间(实际上就是前ncomponents个特征向量张成的子空间)上的投影点来代替原本的点,以达到降维的目的,注意我说的,只做了旋转,没有平移,所以首先你要保证原本空间里的点是以原点为中心分布的,这就是zero mean的目的。另外如果自己手撸过PCA的算法就知道了,explained_variance和explainedvariance_ratio是怎么实现的?explainedvariance就是协方差矩阵的每个特征值除以sample数,而explained_variance_ratio是每个特征值除以所有特征值之和。为什么这么简单呢?这也和zero mean有关,如果你用最大投影长度的证法去证明PCA就会在过程中很自然的发现这一点,在这里我就不展开了。

链接:https://www.zhihu.com/question/40956812/answer/848527057

PCA 去中心化不一定是必需的,这一结论成立的前提是严格使用协方差矩阵的定义式 S=XXT−nμμT,而不是用 XXT 来当作协方差矩阵。

Centering is an important pre-processing step because it ensures that the resulting components are only looking at the variance within the dataset, and not capturing the overall mean of the dataset as an important variable (dimension). Without mean-centering, the first principal component found by PCA might correspond with the mean of the data instead of the direction of maximum variance.
https://towardsdatascience.com/tidying-up-with-pca-an-introduction-to-principal-components-analysis-f876599af383

Why is minimizing squared residuals equivalent to maximizing variance?
Consider a datapoint (row of ). Then the contribution of that datapoint to the variance is , or equivalently the squared Euclidean length . Applying the Pythagorean theorem shows that this total variance equals the sum of variance lost (the squared residual) and variance remaining. Thus, it is equivalent to either maximize remaining variance or minimize lost variance to find the principal components.

http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.
https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia

PCA focuses on “explaining” the data matrix using the sample means plus the eigencomponents. When the column mean is far from the origin, the first right singular value is usually quite highly correlated with column mean - thus using PCA concentrates on the second, third and sometimes higher order singular vectors. This is a loss of information when the mean is informative for the process under study. On the other hand, when the scatterplot of the data is roughly elliptical, the PCs typically align with the major axes of the ellipse. Due to the uncorrelatedness constraint, if the mean is far from the origin, the first singular vector will be close to the mean and the others will be tilted away form the major axes of the ellipse. Thus the first singular vector will not be informative about the spread of the data, and the second and third singular values will not be in the most informative directions. Generally, PCA will be more informative, particularly as a method for plotting the data, than uncentered SVD.
https://online.stat.psu.edu/stat555/node/94/

Since X is zero centered we can think of them as capturing the spread of the data around the mean in a sense reminiscent of PCA.
https://intoli.com/blog/pca-and-svd/

that reconstruction error is minimized by taking as columns of W some k orthonormal vectors maximizing the total variance of the projection.
https://stats.stackexchange.com/questions/130721/what-norm-of-the-reconstruction-error-is-minimized-by-the-low-rank-approximation

PCA is a regressional model without intercept1. Thus, principal components inevitably come through the origin. If you forget to center your data, the 1st principal component may pierce the cloud not along the main direction of the cloud, and will be (for statistics purposes) misleading.
https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca

Centering brings in a big difference. PCA with centering maximizes SS deviations from the mean (i.e. variance); PCA on raw data maximizes SS deviations from the zero point.
https://stats.stackexchange.com/questions/489037/principal-components-with-and-without-centering?noredirect=1&lq=1

SVD and PCA

https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

singular value decomposition

SVD is basically a matrix factorization technique, which decomposes any matrix into 3 generic and familiar matrices.

Eigenvalues and Eigenvectors

The concept of eigenvectors is applicable only for square matrices. The vector space spanned by an eigenvector is called an eigenspace.
A square matrix is called a diagonalizable matrix if it can be written in the format: $ A=PDP^{-1} $, D is the diagonal matrix comprises of the eigenvalues as diagonal elements
A Symmetric Matrix where the matrix is equal to the transpose of itself.
Special properties of a Symmetric Matrix with respect to eigenvalues and eigenvectors:Has only Real eigenvalues;Always diagonalizable;Has orthogonal eigenvectors.
A matrix is called an Orthogonal Matrix if the transpose of the matrix is the inverse of that matrix.
ince the eigenvectors of a Symmetric matrix are orthogonal to each other, matrix P in the diagonalized matrix A is an orthogonal matrix. So we say that any Symmetric Matrix is Orthogonally Diagonalizable.

https://towardsdatascience.com/singular-value-decomposition-and-its-applications-in-principal-component-analysis-5b7a5f08d0bd

% For the PCA derived from maximal preserved variance \cite{lee2007nonlinear}, we have the covariance
% of $\mathbf{y}$, which is
% \begin{equation}
% \mathbf{C}_{\mathbf{y} \mathbf{y}}=E\left{\mathbf{y} \mathbf{y}^T\right}
% \end{equation}
% This equation is valid only when $\mathbf{y}$ is centered.
The goal of PCA is to maximize the variance of the data along each of the principal components. Centering is an important step because it ensures that the resulting components are only looking at the variance of features, and not capturing the means of the features as important. Without mean-centering, the first principal component found by PCA might correspond with the mean of the data instead of the direction of maximum variance.

Network Representation Learning

background

Recording studying during KTH. First blog about network representation learning, a project belongs to machine learning, advanced course.

LINE

Reproduce paper “LINE: Large-scale Information Network Embedding”.

Alias Table Method

It’s a method of effiently drawing samples from discrete distribution.
reference:
https://www.keithschwarz.com/darts-dice-coins/
https://blog.csdn.net/haolexiao/article/details/65157026

Negative Sampling

word2vec

Original paper:
Efficient estimation of word representations in vector space.
reference:
word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method

Skip-Gram Model

Original papaer:Distributed Representations of Words and Phrases
and their Compositionality.
The idea behind the word2vec models is that the words that appear in the same context (near each other) should have similar word vectors. Therefore, we should consider some notion of similarity in our objective when training the model. This is done using the dot product since when vectors are similar, their dot product is larger.
reference:
https://www.baeldung.com/cs/nlps-word2vec-negative-sampling

graphSage

thoughts

机器学习(5)--不平衡的分类问题

什么是不平衡的分类问题

在机器学习中,不平衡的分类问题指的是类别之间的样本分布不均匀,其中某一类的样本数量远远超过另一类。这种不平衡可能会对模型训练和性能评估产生影响,因为模型可能更倾向于预测样本数更多的类别,而对样本数较少的类别进行较差的预测。

如何解决

为了解决不平衡分类问题,可以考虑以下方法:增加少数类别的样本数或减少多数类别的样本数,以平衡类别分布。这包括上采样(增加少数类别样本)和下采样(减少多数类别样本)。调整分类阈值,使模型更倾向于识别少数类别。这可以通过调整模型输出的概率阈值来实现。