Thoughts, summaries and notes on recently read research papers. A research tinkering place, if you will. Summaries below in the readme, and the notes for reach paper are in the papers directory.
- Read: Feb 2024
- Institution: Pennsylvania State University and Temple University
- Mental Reference: Aggregation of math based approaches for using LLMs to solve math.
- Link: https://arxiv.org/pdf/2402.00157.pdf
There has not been a comprehensive study on what methods are currently used and to what effect they have in assisting LLMs to do math.
They checked multiple methods such as:
- Raw prompting frozen SOTA LLMs: GPT-3, ChatGPT, GPT-4, GPT-4V and Bard.
- Strategies enhancing frozen LLMs: Methods such as Chain of Thought, using words instead of numbers (An et al., 2023a) and using external tools.
- Fine tuned LLMs: (1) Give in-context examples to assist models such as GPT-3 that has issues with this without having in-context examples, (2) generating intermediate steps using "scratchpad", (3) learning from an enhanched dataset and (4) Teacher-Student knowledge distillation where in short a teacher model looks at the student model and assesses what the student model is lacking and subsequently generates examples to improve that area.
My takeaways:
- Although these current methods such as using adversarial samples in few-shot prompts can make the predictions more robust, they still are not able to be grounded. Especially when it comes to longer or more complex problems.
- They mention that "token frequency in pre-training and the method of tokenization are key to arithmetic proficiency", which seems to me mean that you would need to have specialised tokenization to do math, which might help if that's the only task the LLM should do, but doesn't feel on an engineering/deployment side very efficient.
- Giving access to code/LaTeX in pre-training apparently gives it pretty good improvement in arithmetic, but still not grounded.
- The thing that annoys me the most about these results are that prompting makes them more robust, which inherently means you need to guide it. Reference: "The nature of input prompts greatly affects LLMs’ arithmetic performance (Liu et al., 2023a; Lou et al., 2023). Without prompts, performance drops (Yuan et al., 2023). Models like Chat-GPT, which respond well to instructional system-level messages, demonstrate the importance of prompt type. Instruction tuning in pre-training also emerges as a significant factor (Yue et al., 2023)."
- This paper support my thesis that we need an external tool to do math and not rely on the models cognitive ability to do arithmetic.
- Although they can become more robust through some of the specified examples, they are still not grounded, which my hypothesis is will be true when introducing calculators to the equation instead of relying on the LLMs non-deterministic cognitive ability.
- Read: Jan 2024
- Published: Feb 2023
- Institution: Meta AI
- Mental Reference:
- Link:
To give LLMs the ability to use tools in a reliable manner, previous approaches rely on giving human annotated examples or limit the tool use to task specific settings.
There is a need for something that learns to use tools in a;
- Self-supervised way: removing the criteria of needing large human annotated examples, which helps with cost and also the fact that humans might find useful might not be useful to an LLM to do the task
- General way: decide for itself when and why to use a tool, as it enables a more comprehensive use of tools that are not tied soley to the task.
A LLM that is finetuned on task oriented data.
Paper: "Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis"
- Read: Jan 2024
- Published: Oct 2022
- Institution: Cambridge, MIT, Stanford, Carnegie Mellon
- Mental Reference:
- Link:
We need to know and understand when we can reliably use the prediction, and when we can't.
- They look at (minimising) the calibration error in the LLM pipeline. Calibration error is the difference between:
predicted_confidence <-> actual accuracy of prediction
. Minimising this will therefore give us more reliable predicitions.
They give 4 recommendations based on their work:
- Use ELECTRA for encodings:
- Use larger models if possible:
- Use Temp Scaling as the uncertainty quantifier:
- Use Focal Loss for fine-tuning:
- Incorporating, analysing and minimising the calibration error seems to be a good metric for understanding the uncertainty of our predictions as we try to minimise it.
- The 2nd point of use Larger Models if possible would make sense in 2022. I would probably still keep the new relatively smaller models from Mistral still in the loop at evaluation, but compare them to an adjusted GPT-4 model.
- Read: Jan 2024
- Institution: Microsoft Research
- Mental Reference: Open source framework for Agents to converse, with or without humans in the loop to solve tasks. Can have a hierarchical structure of a "boss agent" or simply back and fourth chat.
- Link:
- Single-agent has issues with divergent thinking (Liang et al., 2023), factuality and reasoning (Du et al., 2023), and provide validation (Wu et al., 2023).
- The authors ask: how can we facilitate the development of LLM applications that could span a broad spectrum of domains and complexities based on the multi-agent approach?
A multi-agent framework for agents to converse with each other to solve tasks. They highlight that this works well since due to recent developemnts in:
- LLMs (such as GPT-4) have been chat optimised have shown to be able to incorporate feedback. Usually this is from humans in a chat-based format, but why wouldn't it work to have another agent to do the feedback instead on areas of: provide and seek reasoning, observations, critiques, and validation?
- LLMs have shown (with the right prompting) to be good on multiple types of domain tasks, making them flexible for conversations.
- LLMs have been shown to be better at digesting smaller sub-tasks (like humans :) ), which multi-agent partitioning can help with.
Link: example below, more examples
llm_config = {"config_list": config_list_gpt4, "cache_seed": 42}
user_proxy = autogen.UserProxyAgent(
name="User_proxy",
system_message="A human admin.",
code_execution_config={"last_n_messages": 2, "work_dir": "groupchat"},
human_input_mode="TERMINATE",
)
coder = autogen.AssistantAgent(
name="Coder",
llm_config=llm_config,
)
pm = autogen.AssistantAgent(
name="Product_manager",
system_message="Creative in software product ideas.",
llm_config=llm_config,
)
groupchat = autogen.GroupChat(agents=[user_proxy, coder, pm], messages=[], max_round=12)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
# Start query to model:
user_proxy.initiate_chat(
manager, message="Find a latest paper about gpt-4 on arxiv and find its potential applications in software."
)
# type exit to terminate the chat
- This might be a step in the right direction where you basically don't leave full autonomy of what types of agents will solve the problem, instead we specify what types of "helpers" (such as a PM agent + analyst) then let them converse and solve it. This would help with troubleshooting as it will show which agent is not doing its job, similarly how we analyse human performance.
- Might be a good thing to benhmark against, as this performs better using their built in agents from Autogen compared to Multi-Agent Debate (Liang et al., 2023), LangChain ReAct (LangChain, 2023), vanilla GPT-4, and commercial products ChatGPT + Code Interpreter, and ChatGPT + Plugin (Wolfram Alpha), on the MATH (Hendrycks et al., 2021) dataset.
- Be wary of inference costs here. Endless chats can cost a significant amount (at least if using API calls). Could work when hosting the model on AWS, but would take up a fair amount of resources, so need to check if the acutal output is better than more simple alternatives.
Paper: "Medprompt: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine"
- Read: Jan 2024
- Institution: Microsoft, OpenAI
- Mental Reference: Making foundation models into spcialist models using general prompt engineering techniques.
- Link:
Retraining and fine-tuning foundation models are expensive. Researchers are looking for a way to take on less up-front cost, while making models being applicable to specific doamin use-cases, such as medicine in this paper.
Medprompt = three techniques combined into one:
- Dynamic few-shot selection: showing a few demonstrations helps the model to adapt to a specific domain and learn to follow the task format. But hand-cafted examples are difficult to migrate over multiple types of problems. Medprompt uses a k-NN clustering in the embedding space using
text-embedding-ada-002
to take k training examples from the model training set. Basically they generate examples for the model, without any hand-curated examples. - Self-generated chain of thought: automated way of creating chain of thought, by asking GPT-3 to generate CoT using the prompt:
## Question: {{question}} \n {{answer_choices}} \n ## Answer \n model generated chain of thoughr explanation. \n Therefore, the answer is [final model answer (e.g. A,B,C,D)]
. This seems to be better than hand-curated exampels by experts as GPT-4 gives longer and finer-grained step-by-step reasoning logic, which is in alignment with other recent research findings that foundation models write better prompts than experts. - Choice shuffle ensembling: The issue is that mutliple choice questions can have position bias, where the model would favor certain options, regardless of the option content. So this technique shuffles the answers and checks if the model is consistent for multiple choice: choice shuffle and self-consistency prompting.
- Challenge 1: Section 4.4 shows that all of the proposed solutions to this would have significant inference costs, so high frequency API calls would have a difficult time to adopt all three of these.
- Challenge 2: This looks perhaps good for multiple choice, as for instance the Choice shuffle ensembling technique is tailored towards optimising performance for multiple choice. Need to test this on more congitive tasks, can imagine the that the self-generated CoT and the dynamic few-shot approaches might generally help on other tasks though.
- Benchmark against agent based approaches: It would be worth testing if it gives better results than Agent based approaches, as those takes a fair bit of cost at inference.
- Read: Jan 2024
- Institution: UC Berkeley, Open Philantropy
- Mental Reference: Summary of current post-training enhancements to improve LLM performance (such as CoT), quantified on a compute to performance basis.
- Link: https://arxiv.org/pdf/2312.07413.pdf
It is difficult to compare how a 8% improvement in GSM8K means to other domains outside of this benchmark. As the author says: "It is hard to meaningfully compare the benefits of post-training enhancements that apply to different domains. For example, how does 10% greater accuracy on the MATH benchmark (Hendrycks et al., 2021) compare to 10% greater accuracy in a multiple choice knowledge test, or to 10% lower perplexity in a language modeling task?"
They translate performance gains from different benchmarks into a "common currency": Common Compute Gain: "how much additional training compute would have been needed to improve benchmark performance by as much as the post-training enhancement".
- Great visualisation of how different approaches shows more cost-efficient solutions and how they compare.
- Helps to show that current methodologies that shows improved results might have a significant compute cost, which might not be worth the gain in performance depending on budget and performance requirements.
- Look and verify if you can apply and ensemble some cost-efficient solutions together in the LLM production projects and reserach.
- Agents shows to have 80x Inference cost for LATS and HumanEval. Take this into budget considerations for agent based solutions.
- For programming tasks (SQL/python) check out "Parcel", quote: "The model decomposes a complex task into natural language function descriptions, generates modular implementations for each, and searches over combinations of these implementations by testing against constraints." https://arxiv.org/abs/2212.10561
- Few-shot, LATS and CoT are great in terms of no additional compute cost. However what is important to notice is the additional runtime cost goes up 10-100 times = for production environments this might not be the optimal long-term solutions.
- Majority voting is both high in added compute as well as additional runtime cost compared to performance improvements. Need to deep dive why.
- Look at Category 3 for Agent enhancements to verify your Agent projects and the current methodologies:
- Tool enhancements: teaching an AI system to use new tools, like a web browser.
- Prompting enhancements: changing the text-based input to the model to steer its behavior and reasoning, e.g. including an example response to a similar question.
- Scaffolding enhancements: programs that structure the model’s reasoning and the flow of information between different copies of the model (e.g. producing AI agents).
- Solution choice enhancements: techniques for generating and then choosing between multiple candidate solutions to a problem.
- Data enhancements: techniques for generating more, higher-quality data for fine-tuning.
- On the 2nd page they mention they have not conducted these experiements themselves but instead relied on the results from other reserach papers -> they don't have high confidence in this metric that they have created. Quote: "we don’t have high confidence in each individual CEG estimate, but we think that in aggregate they are informative about the typical benefit produced by an enhancement." -> Take these with a pinch of salt, as some might not have accurate compute estimates.
- Read: Jan 2024
- Institution: Google Deepmind
- Summary: Using LLMs to figure out what the optimal prompt is > prompt engineering.
People today have to figure out themselves what the optimal prompt is by doing prompt engineering for a specific task, which is time consuming and slightly random.
Intuitive example for prompt optimization, quote: "the initial instruction is “Let’s solve the problem” with a (approximated, and same below) training accuracy of 60.5. We observe that the optimization curve shows an overall upward trend with several leaps throughout the optimization process, for example:
- “Let’s think carefully about the problem and solve it together.” at Step 2 with the training accuracy 63.2;
- “Let’s break it down!” at Step 4 with training accuracy 71.3;
- “Let’s calculate our way to the solution!” at Step 5 with training accuracy 73.9;
- “Let’s do the math!” at Step 6 with training accuracy 78.2.
Prompt Optimization experimentation setup and eval on GSM8K: sample 3.5% random questions of the GSM8K training set, and evaluate on full test set.
- Semantic Optimsation Paper: Smart that they first try on travelling salesman before testing on prompt optimisation, to see it's real optimisation capabilities. Consider adding this to paper.
- Production environments: if it's a repetitive task that will occur >10M times a week, finding the most optimal prompt by itself could potentially be really helpful and reduce time on prompt engineering.
- tested on GSM8K, 8% improvement can be a benchmark to our approach of how well it should improve the score.
- Improvement: the generated prompts feels a bit lackluster (Take a deep breath and work on this problem step-by-step.) compared to their benchmark being the classic Let's think step by step.) Needs to be tested on more unique problems to see the benefit.
- Was there any prompt that was "novel" at any point? Or was it simply their best optimisation generating: "Take a deep breath and work on this problem step-by-step.", which doesn't feel to novel compared to simply giving this directly to the model?
- In section 5.1 for models, why did they pick specifically these models for the optimizer and scorer? Reference: "• Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4. • Scorer LLM: Pre-trained PaLM 2-L and text-bison. With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates A_begin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Q_begin and Q_end instructions when text-bison is used as the scorer.
- Read: Oct 2023
- Institution: Microsoft Research
- Summary: How to create Small Language Models (SLMs) efficiently with high quality data generated by LLMs.
- Link: https://arxiv.org/pdf/2311.11045.pdf
Task Diversity and Data Scaling, it often captures the style of models being trained on Chat-GPT data, but it’s less on logic/reasoning and more on imitation.
The key contributions of the paper are:
- Explanation Tuning: they basically do the same {query, response} style, but make GPT-4 to give detailed explinations of the logic/reasoning as well. For instance: “explain like I’m five, think step by step and justify your response”. Quote: "We augment ⟨query, response⟩ pairs with detailed responses from GPT-4 that explain the reasoning process of the teacher as it generates the response. These provide the student with additional signals for learning. We leverage system instructions (e.g.., explain like I’m five, think step-by-step and justify your response, etc.) to elicit such explanations. This is in contrast to vanilla instruction tuning, which only uses the prompt and the LFM response for learning, providing little opportunity for mimicking the LFM’s “thought” process.
- Scaling and Task instructions: they use Flan FLAN-v2 collection (paper link) and pick out out of 10s of millions instructions → use a sample from the task collection to form a diverse mix of tasks. 5 million ChatGPT responses used → sample 1 million of them to acquire GPT-4 responses. → demonstrate how ChatGPT as a teacher assistant helps in progressive learning. Quote: We utilize the Flan 2022 Collection [19] as it provides an extensive public assortment of tasks and instructions. Particularly, we use FLAN-v2, supplemented with high-quality templates, advanced formatting patterns, and data augmentations. Even though FLAN holds tens of millions of instructions, we selectively sample from the task collection to form a diverse mixture of tasks, which we then further sub-sample to generate complex prompts. These prompts are used to query LFMs like ChatGPT and GPT-4, thus creating a rich and diverse training set. We collect 5 million ChatGPT responses, from which 1 million is further sampled to acquire GPT-4 responses. We demonstrate how ChatGPT as a teacher assistant helps in progressive learning.
Not applicable directly, as we're not dealing with Small Language Models (SLMs) at the moment. But, good for future reference on when developing SLMs further.
- Read: Nov 2023
- Mental Reference: HuggingFace making smaller but more efficient model. Zephyr is Mistral but with AIF + dDPO.
distilled Supervised Fine-Tuning (dSFT) do not respond well to "natural prompts". So this refers to that dSFT models (like Alpaca and Vicuna) are trained on instructions based datasets. But since they've been trained/fine-tuned on instruction based data, it isn't as promising to give them "natural" prompts. For instance it would be rare that we ask ChatGPT in a instruction-based way such as: ", instead we would simply ask it: "How do I select the 5th column in a pandas dataframe?"
link to Alpaca data: link
Vicuna example fine-tuning data:
{
"instruction": "Give three tips for staying healthy.",
"input": "",
"output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
}
Use preference data from AI Feedback to improve dSFT.
-
What does this actually mean/do:
-
The main step is to utilize AI Feedback (AIF) from an ensemble of teacher models as preference data, and apply distilled direct preference optimization as the learning objective (Rafailov et al., 2023). We refer to this approach as dDPO. Notably, it requires no human annotation and no sampling compared to using other approaches like proximal preference optimization (PPO) (Schulman et al., 2017). Moreover, by utilizing a small base LM, the resulting chat model can be trained in a matter of hours on 16 A100s (80GB) = ~10k per gpu
Know how to apply (direct) distilled Supervised Fine-Tuning and fine-tune w. preference data from AI feedback.
Paper: “Semantic Uncertainty; Linguistic Invariances for Uncertainty Estimation in Natural Language Generation”
- Read: Sep 2023
- Institution: Oxford OATML
- Summary: Oxford making it easier to make sure that semantically similar words make the entropy look lower. As "Sweden" is the same meaning as "It's Sweden".
Hard to evaluate LLM models due to “semantic equivalence”. It’s Paris and Paris is not the same for regular LLMs = entropy is still the same.
Proposes Deals with measuring uncertainty using their proposed “semantic entropy”, incorporate liugnustic invariances created to shared meanings. So basically: It's Paris and Paris = same, lower entropy in the dataset.
This feels very generally useful for most applications, but still need to understand the implementation of the code better to see how complex this is to apply for production environments.
- Read: Sep 2023
- Institution: Oxford OATML
Users give ambiguous questions that are hard for LLMs to answer with certianty as they don’t know what is being asked, so the give the most likely answer instead of asking a question back (humans asking vague questions such as “when did he land on the moon?” When meaning to ask “when did Alan Bean land on the moon?)
A LLM framework to ask clarifying questions for ambigous questions.
In applied operations you want something to be able to handle this possible ambiguity in the user input, especially in a production environment.
- Read: Oct 2023
Improve the concept of "Chain of Thought"-prompting for outputs from LLMs, as its not grounded in external world and uses its own internal representation to generate reasoning. This limits the ability to reactively explore and reason or update its knowledge.
Basically gives a prompt template to the LLM to show a pattern of Reasoning first, then Actioning it. It also helps with interpretability as you can see how the model is reasoning through to the answer.
- Together with LangChain, this is useful for more reasoning is of high priority/importance.
- It helps to make sure the model picks up the reasoning and aligns it with a more similar way to how we humans solve problems: by reasoning "hmm should I
- Read: Oct 2023
Main goal: Improving outputs from LLMs
Two concepts as baseline:
- Few-shot learning = Give a few examples in the prompt of what it should look like. in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt” the model with a few input–output exemplars demonstrating the task → but this doesn’t work that well when it requires a bit of reasoning.
- Arithemtic reasoning = Spell out the logic of thinking of how to solve the problem. It has shown that arithmetic reasoning can benefit from natural language rationale (He had 4 apples, now he has 2, how many did he loose? Well it was 4 and now it’s 2, so he lost 4-2 = 2.) → but this is costly to develop these high quality rationales.
- CoT: Combines the two of showing some examples of the thought process.
- Give hand holding to the LLM by giving it examples of both reasoning and examples of how it would look in different scenarios.
Template:
- Read:
- Institution:
- Mental Reference:
- Link: