Pretrained large language models aren’t particularly good at responding in concise, coherent sentences out of the box. At a minimum, they have to be finetuned in order to respond in a particular style. General users expect LLMs to respond in the style of a general-purpose “AI assistant,” with polite, paragraph-length responses to open-ended queries. Researchers and developers, on the other hand, want LLMs to do well on benchmarks such as MMLU and BigBench that require a different style with a multiple-choice, terse format. How should LLMs be instruction finetuned to respond in two completely different styles?
While there are many ways to finetune LLMs for general-purpose question answering, one particularly intriguing approach is that of supervised finetuning on a small number of high-quality samples. The recent LIMA (“less is more for alignment”) study boldly claimed that general-purpose instruction following could be achieved by simply finetuning on 1,000 diverse, high-quality question-answering pairs; several other contemporaneous studies also argue that this type of “style alignment” can be achieved with a small number of high-quality samples (e.g. Alpaca, Vicuna, Alpagasus, and Tülü, but see also The False Promise of Imitating Proprietary LLMs). Throughout this blog post, we use the term “style alignment” to refer to the general observation that LLMs can be finetuned on a small number of samples. While the more widespread term “instruction finetuning” can include approaches like LIMA, it also encompasses datasets such as FLANv2, which contains more than 15 million examples of question-answer pairs extracted from a wide swath of traditional NLP datasets and organized into instruction templates for 1,836 tasks.
A key consideration is how to evaluate LLM finetuning. The two styles highlighted above (conversational vs. concise) are evaluated using two completely different approaches. NLP evaluation benchmarks such as MMLU contain short, academic-style questions and expect an exact token match. However, it is difficult to evaluate the quality and style of a model’s responses to more general, open-ended questions when using traditional perplexity-based NLP benchmarks. With the advent of easily accessible high-quality LLMs like LLaMA and ChatGPT, it became possible to evaluate model quality by using another LLM as a judge (e.g. AlpacaEval and MTBench). In this model-based evaluation paradigm, an LLM is prompted with an instruction and is asked to judge a pair of corresponding responses. Given the success of low-sample finetuning, we asked whether it was possible to optimize for two different styles AND do well on both perplexity-based and model-based evaluation paradigms?
In a paper we presented at a NeurIPS workshop in December, we considered the style alignment approach from LIMA and examined whether a small amount of high-quality instruction finetuning samples could improve performance on both traditional perplexity-based NLP benchmarks and open-ended, model-based evaluation. Despite differences between perplexity-based and LLM-based evaluation paradigms, we found that careful construction of finetuning datasets can boost performance on both paradigms.
In our research, we finetuned open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We found that subsets of 1k-6k instruction finetuning samples were able to achieve good performance on both traditional NLP benchmarks and model-based evaluation (aka “LLM-as-a-judge”). Finally, we found that mixing textbook-style and open-ended QA finetuning datasets optimized performance on both evaluation paradigms.
So what should you do if you want to effectively and cheaply “instruction finetune” an LLM?
- Finetune on a small number of diverse, high-quality samples for style alignment. When instruction finetuning your LLM, make sure to include diverse samples with different styles and from different sources that also match your evaluation paradigm. For example, we found it particularly helpful to combine trivia-like multiple choice QA samples (in a style similar to benchmarks like MMLU) with longer, more open-ended “AI-assistant” style QA examples (which are favored by LLMs “as-a-judge”). We found that mixing 2,000 – 6,000 samples was surprisingly effective at the 7B and 30B model scale.
- Use multiple evaluation paradigms. Make sure to evaluate your model using both traditional NLP evaluation paradigms such as MMLU as well as newer paradigms like LLM-as-a-judge.
Doing these two things will optimize your LLM for standard NLP benchmarks as well as more fluid “AI assistant”-like conversation evaluation methods. This work is extensively detailed in our arXiv paper LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms (Jha et al. 2022). For a technical summary of our experiments, keep on reading!
Technical Deep Dive
We chose to focus on two open-source models from the Mosaic MPT family, MPT-7B and MPT-30B, as well as three open-source instruction tuning datasets: Instruct-v1 (59.3k samples, also called dolly_hhrlhf) and Instruct-v3 (56.2k samples), which are the corresponding instruction finetuning datasets for MPT-7B-Instruct and MPT-30B-Instruct, and the LIMA dataset (1k samples). We evaluate model performance using (1) Mosaic’s efficient, open-source Eval Gauntlet, which is based on traditional NLP benchmarks such as MMLU and BIG-bench, as well as (2) AlpacaEval’s suite for model-based evaluation using GPT-4 as the judge.
ML practitioners sometimes use the phrases “finetuning” and “instruction finetuning” interchangeably. However, we would like to disentangle the idea of introducing new knowledge to a pretrained model via finetuning on new data from the process of enabling a base model to do general question answering by instruction finetuning. There is ample evidence that “more data is better” when it comes to introducing new knowledge to a model. In our work, we were concerned more specifically with the question of “style alignment” by finetuning on question answering examples with various instructions.
Instruction Finetuning Datasets
We used three publicly available finetuning datasets. The LIMA training and test sets have high-quality samples of open-ended questions and multi-paragraph answers written in the tone of a general-purpose AI assistant. The MPT Instruct-v1 and MPT Instruct-v3 training (and test sets) contain trivia-like questions and answers that tend to be shorter than one paragraph. In our paper, we explored the differences between these datasets in detail; each of the three datasets is briefly described below.
LIMA Dataset: The LIMA training set contains 1,000 samples (750,000 tokens) curated from Reddit, Stack Overflow, wikiHow, Super-NaturalInstructions, and examples manually written by the paper authors. The examples were selected after strict filtering to ensure diversity and quality. LIMA’s authors sampled an equal number of prompts from various categories within StackExchange (programming, math, English, cooking, etc.) and selected the top answer for each prompt, which then went through additional filtering based on length and writing style. In this study, we only used the single-turn examples.
You can get a sense of what kind of questions and responses are in this dataset by looking at a clustered version of maintained by Lilac AI:
Browsing this dataset reveals creative writing prompts, such as “Write a poem from the perspective of a dog,” marketing questions such as “How to make your LinkedIn profile stand out?” as well as programming questions, relationship advice, fitness and wellness tips, etc. Most of the curated responses span multiple paragraphs.
MPT Instruct-v1 Dataset (a.k.a “Dolly-HHRLHF”): This training set was used to train the MPT-7BInstruct model. The MPT Instruct-v1 dataset contains the Databricks Dolly-15k dataset and a curated subset of Anthropic’s Helpful and Harmless (HH-RLHF) datasets, both of which are open source and commercially licensed. Mosaic’s MPT-7B-Instruct model was finetuned using this dataset. It contains 59.3k examples; 15k are derived from Dolly-15k and the rest are from Anthropic’s HH-RLHF dataset. Dolly-15k contains several classes of prompts including classification, closed-book question answering, generation, information extraction, open QA, and summarization. Anthropic’s HH-RLHF dataset contains crowd-sourced conversations of workers with Anthropic’s LLMs. Only the first turn of multi-turn conversations was used, and chosen samples were restricted to be helpful and instruction-following in nature (as opposed to harmful).
MPT Instruct-v3 Dataset: This training set was used to train the MPT-30B-Instruct model. It contains a filtered subset of MPT Instruct-v1, as well as several other publicly available datasets: Competition Math, DuoRC, CoT GSM8k, Qasper, SQuALITY, Summ Screen FD and Spider. As a result, Instruct-v3 has a large number of reading comprehension examples, where the instructions contain a long passage of text followed by questions related to the text (derived from DuoRC, Qasper, Summ Screen FD, SQuALITY). It also contains math problems derived from CompetitionMath and CoT GSM8K, as well as text-to-SQL prompts derived from Spider. Instruct-v3 has a total of 56.2k samples. Both Instruct-v1 and Instruct-v3 were purpose-built to improve performance on traditional NLP benchmarks.
You can also get a sense of how different instruct-v3 is from the LIMA dataset by looking at topic clusters here:
This dataset contains thousands of math problems such as “Solve for x: 100^3 = 10^x” as well as questions on plot summaries, sports and entertainment trivia, and history trivia such as “What year did the first cold war start?” When skimming through the examples, it is clear that multiple examples are often included in the instructions (i.e. a form of in-context learning) and that correct responses are usually shorter than a single sentence.
We first asked whether LIMA’s 1,000 sample dataset could deliver on its promise: could we finetune a MPT base model on LIMA that delivers optimal performance on traditional NLP benchmarks AND performs well when evaluated by an LLM? We finetuned MPT-7B and MPT-30B base models on LIMA and evaluated the resulting models using Mosaic’s Eval Gauntlet as well as AlpacaEval’s model-based evaluation suite with GPT-4 as the “judge” (Figures 3 and 4). While the resulting models were judged favorably by GPT-4, we found that they did not perform on par with MPT-7B and MPT-30B trained on much larger instruction finetuning datasets (Instruct-v1 and Instruct-v3, respectively).
We suspected that the LIMA dataset was slightly out-of-domain with respect to MMLU and BIGbench, so we investigated whether a random subset of 1,000-5,000 “in-domain” samples from Instruct-v1 and Instruct-v3 could reach parity on the Eval Gauntlet with the full datasets. We were pleasantly surprised to find that this small subset of finetuning data had similar performance on the Eval Gauntlet, corroborating the general small-sample approach of LIMA (Figure 3). However, these same models did poorly when evaluated by GPT-4 (Figure 4).
Finally, we considered whether we could get the best of both worlds—i.e. get good performance on both evaluation paradigms—by finetuning on a subset of a few thousand Instruct and LIMA samples (Figure 5). We found that this indeed led to good performance across both paradigms. While we were initially skeptical that effective finetuning could be achieved with less than 1,000 samples, our results replicated LIMA and built on the “less is more” approach to style alignment.
The shift away from instruction finetuning on larger and larger datasets was catalyzed by the open-source release of the LLaMA models and by the closed-source launch of GPT-3 and chatGPT. AI researchers quickly realized that open source LLMs such as LLaMA-7B could be effectively finetuned on high-quality instruction following data generated by state-of-the-art GPT models.
For example, Alpaca is a 7 billion parameter LLaMa model finetuned on 56,000 examples of question-response samples generated by GPT-3 (text-davinci-003). The authors of this study found that Alpaca responded in a similar style to the much larger GPT-3 model. While the methods used in the Alpaca study were somewhat questionable (the human preference evaluation was done by the 5 authors themselves), further finetuning studies such as Vicuna, MPT-7B-chat, Tülü, Baize, Falcon-40B arrived at similar conclusions. Unfortunately, results across many of these papers are based on different evaluation paradigms, making them difficult to compare side-by-side.
Fortunately, additional research has begun to fill out the full picture. The Tülü paper “How Far Can Camels Go? Exploring the State of INstruction Tuning on Open Resources” (Wang et al. 2023) argued that finetuned LLMs should be tested using traditional fact-recall capabilities with benchmarks like MMLU, along with model-based evaluation (using GPT-4) and crowd-sourced human evaluation. Both the Tülü and the follow up Tülü 2 paper found that finetuning LLaMA models over different datasets promotes specific skills and that no one dataset improves performance over all evaluation paradigms.
In “The false promise of imitating proprietary LLMs” (Gudibande et al. 2023), the authors showed that while finetuning small LLMs with “imitation” data derived from ChatGPT conversations can improve conversational style, this method does not lead to improved performance on traditional fact-based benchmarks like MMLU and Natural Questions. However, they did note that training on GPT-4-derived “imitation” data in the domain of Natural-Questions-like queries improves performance on the Natural Questions benchmark. Finally, the “AlpaGasus: Training A Better Alpaca with Fewer Data” (Chen et al. 2023) and “Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning” (AlShikh et al. 2023) directly addressed the question of how many finetuning examples are necessary for good downstream performance. Our results align nicely with the above studies.
Acknowledgements and Code
This line of work was inspired by LIMA as well as the Tulu papers from AI2 (How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources and Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2).
LLM Model Weights
This research was led by Aditi Jha and Jacob Portes with advice from Alex Trott. Sam Havens led the development of Instruct-v1 and Instruct-v3 datasets, and Jeremy Dohmann developed the Mosaic evaluation harness.
The work detailed in this blog was performed for research purposes, and not for Databricks product development.