/llm_comparison

Primary LanguageJupyter Notebook

A simple comparison af various LLM models.

The models

  • tiiuae/falcon-7b-instruct
  • mosaicml/mpt-7b-chat
  • meta-llama/Llama-2-7b-chat-hf
  • openai/gpt-3.5-turbo

The infrastructure

We used openai/gpt-3.5-turbo as a reference model. But let's not forget that it isn't a 7B model and has been tested via the commercial API.

On the other side, tiiuae/falcon-7b-instruct, mosaicml/mpt-7b-chat and meta-llama/Llama-2-7b-chat-hf are all 7B models and have been tested on a Google Colab instance with a Tesla V100

The prompts

  • tiiuae/falcon-7b-instruct :

""" You are agent_2, a teenager. Answer agent_1 with an open-ended question and try to use the word 'orange'. Context: \n{context}agent_2:"""

  • mosaicml/mpt-7b-chat :

""" You are agent_2, a teenager. Answer agent_1 with an open-ended question and try to use the word 'orange'. Context: \n{context}agent_2:"""

  • meta-llama/Llama-2-7b-chat-hf

""" You are agent_2, a teenager. Answer agent_1 with an open-ended question and try to use the word 'orange'. No explanation, no code, no note. Context: \n{context}agent_2:"""

  • openai/gpt-3.5-turbo

""" You are a teenager. Answer with an open-ended question and try to use the word 'orange'.""" followed by context.

The test_set

To test the models, I extracted 50 dialogues from the Topical-Chat.

  • all selected dialogues ended with a question
    • (so that we can more accurately see if the model follows the unnatural request to answer with another question)
  • all selected dialogues were cut after 3 rounds
    • (user/assistant/user)

@inproceedings{Gopalakrishnan2019, author={Karthik Gopalakrishnan and Behnam Hedayatnia and Qinlang Chen and Anna Gottardi and Sanjeev Kwatra and Anu Venkatesh and Raefer Gabriel and Dilek Hakkani-Tür}, title={{Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={1891--1895}, doi={10.21437/Interspeech.2019-3079}, url={http://dx.doi.org/10.21437/Interspeech.2019-3079} }

The (simple) metrics

  • mean_inference_time: the average time needed for returning an answer
  • mean_response_size: the average length of the answers
  • has_question: the total number of answers containing a question mark (This was required in the provided context)
  • has_orange: the total number of answers containing the word 'orange' (This was optional in the provided context)

The results

model mean_inference_time mean_response_size has_question has_question_% has_orange has_orange_%
mosaicml/mpt-7b-chat 6.272755 115.62 26 52.0% 3 6.0%
tiiuae/falcon-7b-instruct 9.881873 100.22 26 52.0% 4 8.0%
meta-llama/Llama-2-7b-chat-hf 2.369238 130.74 34 68.0% 36 72.0%
openai/gpt-3.5-turbo 1.636220 142.56 50 100.0% 8 16.0%

Llama2 seems to be very promising according to these metrics:

  • it has good inference time (2.36s) considering the provided infrastructure,
  • it has successfully included open questions in most of the answers (68%),
  • it has successfully included the 'orange' word in most of the answers (72%) but the sentence may seem weird,
  • the sentences seem to consider the provided Context (human review),
  • the sentences seem to be grammatically correct (human review).

Llama2 is commercially available in several versions (7B, 13B, 70B, instruct, chat ...) and it would be interesting to try more advanced models.

But don't let those metrics fool us, they are simple demo metrics which aren't sufficient to really evaluate the capacity of such models. The other 7B models have lower "has_question" and "has_orange" scores, but Llama2 answers with the 'orange' word often seem unnatural, so maybe the model misuses the 'orange' word (and we need other metrics to determine this).