basic ideas
Zero-Shot Learning
zero-shot learning, in which your model learns how to classify classes that it hasn’t seen before.
Contrastive Language-Image Pretraining (CLIP)
Just like traditional supervised models, CLIP has two stages: the training stage (learning) and the inference stage (making predictions).
In the training stage, CLIP learns about images by “reading” auxiliary text (i.e. sentences) corresponding to each image. CLIP aims to minimize the difference between the encodings of the image and it’s corresponding text.
In the inference stage, we setup the typical classification task by first obtaining a list of all possible labels.Each label will then be encoded by the pretrained text encoder from Step 1.Now that we have the label encodings, T₁ to Tₙ, we can take the image that we want to classify, feed it through the pretrained image encoder, and compute how similar the image encoding is to each text label encoding using a distance metric called cosine similarity.
contrastive learning
Contrastive learning is a machine learning technique used to learn the general features of a dataset without labels by teaching the model which data points are similar or different.It looks at which pairs of data points are “similar” and “different” in order to learn higher-level features about the data, before even having a task such as classification or segmentation.
SimCLRv2
The entire process can be described concisely in three basic steps:
For each image in our dataset, we can perform two augmentation combinations (i.e. crop + resize + recolor, resize + recolor, crop + recolor, etc.). We want the model to learn that these two images are “similar” since they are essentially different versions of the same image.
To do so, we can feed these two images into our deep learning model (Big-CNN such as ResNet) to create vector representations for each image. The goal is to train the model to output similar representations for similar images.
Lastly, we try to maximize the similarity of the two vector representations by minimizing a contrastive loss function.
Meta-learning
The idea of meta-learning is to learn the learning process.
In-context Learning
uring in-context learning, we give the LM a prompt that consists of a list of input-output pairs that demonstrate a task. At the end of the prompt, we append a test input and allow the LM to make a prediction just by conditioning on the prompt and predicting the next tokens.
Instruction learning
Instruction learning is an idea proposed by the team led by Quoc V. Le at Google DeepMind in a paper titled ‘Finetuned Language Models Are Zero-Shot Learners’ in 2021. The purpose of instruction learning and prompt learning is to explore the knowledge inherent in language models. The difference is that prompts aim to stimulate the completion ability of the language model, such as generating the second half of a sentence based on the first half or filling in the blanks. Instructions aim to stimulate the understanding ability of the language model by providing more explicit instructions, enabling the model to take correct actions. The advantage of instruction learning is that after fine-tuning through multitask learning, it can also perform zero-shot learning on other tasks, while prompt learning is specific to one task. Its generalization ability is not as strong as instruction learning.
Diffusion Model
In machine learning, the Diffusion Model refers to a class of algorithms or models that utilize diffusion processes for various tasks, such as data clustering, image segmentation, or graph-based learning. The basic principle of the Diffusion Model in machine learning is to propagate information or labels through the connections or edges of a graph or network. The diffusion process starts with initial information or labels assigned to some nodes in the graph, and it gradually spreads and influences the neighboring nodes based on certain rules or algorithms.
Stable Diffusion
Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions.
Prompt engineering
Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics.Prompt engineering focuses on crafting the optimal textual input by selecting the appropriate words, phrases, sentence structures, and punctuation.
RLHF(Reinforcement Learning from Human Feedback)
generation
Auto-regressive language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. The length T of the word sequence is usually determined on-the-fly and corresponds to the timestep
t=T the EOS token is generated from the probability distribution.
decoding methods
Greedy search
Greedy search is the simplest decoding method. It selects the word with the highest probability as its next word.
Beam search
Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.
The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!
When using transformers library:
beam_output = model.generate(**model_inputs,max_new_tokens=40,num_beams=5,no_repeat_ngram_size=2,early_stopping=True)
Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.
In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!
sampling
In its most basic form, sampling means randomly picking the next word according to its conditional probability distribution.
temperature
a temperatureparameter to adjust the probability distribution of the output. The larger the parameter value, the smoother the distribution looks, that is, the gap between high probability and low probability is narrowed (not so sure about the output); of course, the smaller it is, the more obvious the gap between high probability and low probability (more sure about the output). If it tends to 0, it is the same as Greedy Search.
Top-K
In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.
Top-P
Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words.
While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.
sample_outputs = model.generate(**model_inputs, max_new_tokens=40,do_sample=True,top_k=50,top_p=0.95,num_return_sequences=3)
models:
LLaMA
LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
FastChat
other models
https://github.com/baichuan-inc/baichuan-7B
LLM benchmarks
MMLU
The MMLU benchmark covers 57 general knowledge areas such as “Humanities”, “Social Sciences”, and “STEM”. Each question in it contains four possible options, and each question has only one correct answer.
there are two main ways to get information from a model to evaluate it:
Get the output probabilities for a particular set of tokens and compare them to the alternatives in the sample;
Take the text generated by the model (iteratively generated one by one using the method described above), and compare these texts with the alternatives in the sample.
C-Eval
A Chinese knowledge and reasoning test set covering four major fields: humanities, social sciences, natural sciences, and other disciplines. It consists of 52 subjects, including calculus, linear algebra, and more, covering topics from secondary school to university-level studies, graduate studies, and professional examinations. The test set comprises a total of 13,948 questions.
code generation benchmarks
HumanEval
HumanEval is proposed to evaluate the functional correctness on a set of 164 handwritten programming problems with unit tests.
Functional correctness is measured for synthesizing programs from docstrings.Each problem includes a function signature, docstring, body, and several unit tests. pass@k metric, is used where k code samples are generated per problem see if any sample passes the unit tests.
MBPP (Mostly Basic Python Programming)
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.
APPS(Automated Programming Progress Standard)
The APPS dataset consists of 5000 training and 5000 test examples of coding problems. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis.The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions.
MultiPL-E
MultiPL-E is a multi-programming language benchmark for evaluating the code generation performance of large language model (LLMs) of code.
DS-1000
a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas.
references
https://towardsdatascience.com/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab
https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607
http://ai.stanford.edu/blog/understanding-incontext/
https://www.8btc.com/article/6813626
https://en.wikipedia.org/wiki/Stable_Diffusion
LLaMA: Open and Efficient Foundation Language Models
https://www.promptingguide.ai/
https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76
https://nl2code.github.io/
https://yaofu.notion.site/C-Eval-6b79edd91b454e3d8ea41c59ea2af873
https://huggingface.co/blog/zh/evaluating-mmlu-leaderboard
https://github.com/datawhalechina/hugging-llm/blob/main/content/ChatGPT%E5%9F%BA%E7%A1%80%E7%A7%91%E6%99%AE%E2%80%94%E2%80%94%E7%9F%A5%E5%85%B6%E4%B8%80%E7%82%B9%E6%89%80%E4%BB%A5%E7%84%B6.md
https://huggingface.co/blog/how-to-generate