speech embedding

Contrastive Predictive Coding

Contrastive Predictive Coding (CPC) learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It describes a form of unidirectional modeling in the feature space,
where the model learns to predict the near future frames in
an acoustic sequence while contrasting with frames from other
sequences or frames from a more distant time.

Autoregressive Predictive Coding

The APC approach uses an autoregressive model to encode
temporal information of past acoustic sequence; the model then
predicts future frames like a recurrent-based LM while
conditioning on past frames.

TERA

TERA, which stands for Transformer Encoder Representations from Alteration, is a self-supervised speech pre-training
method.

experiment design

Amount of labeled data needed to perform well.
with pre-trained and without pre-trained.

speech

signal

spectrogram: A spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis.

short-time Fourier analysis

Why use it?
Some regions of speech signals shorter than 100 milliseconds often appear to be periodic, so that we can use the exact definition of Fourier
transform.

spectral leakage

This phenomenon is called spectral leakage because the amplitude of one harmonic leaks over the rest and masks its value.

feature extraction

Representation of speech signals in the frequency domain is especially useful because the frequency structure of a phoneme is generally unique.

Sinusoids are important because speech signals can be decomposed as sums of sinusoids.

For voiced sounds there is typically more energy at low frequencies
than at high frequencies, also called roll-off. To make the spectrograms easier to read, sometimes the signal is first preemphasized (typically with a first-order difference FIR filter) to boost the high frequencies
to counter the roll-off of natural speech.

Digital Systems

Linear Time-Invariant Systems and Linear Time-Varying Systems.

The Fourier Transform

Z-Transform

digital filter

filterbank

A filterbank is a collection of filters that span the whole frequency spectrum.

short-time analysis