5th of July Updates
Opened this issue · 2 comments
alexandersimoes commented
5th of July Updates
alebjanes commented
Updates on my side:
- Evaluation set: the evaluation set we'll use with all approaches with 100 questions is ready, along with the correct answers and correct values that should go in the answers.
- New content to the corpus: In order for the RAG to work, I added the content that is needed to answer these questions to the corpus (for all years available). This includes content with broad product categories (like dairy, salmon, chicken, etc.) that are composed by multiple hs codes.
- RAG evaluation: Now I'm running the RAG evaluation which will take each of this 100 questions, fetch the top k results using similarity search and then pass this as context to an LLM. I'll try a few combinations changing the top k results (5 or 10), the embedding model, and the final LLM model (to evaluate the costs of using gpt-4 or gpt-3.5 here). As an initial result here, the first evaluation got 81/100 questions correct.
pippo-sci commented
Fine-tuning results with both random sample and Ale's test set (in RAG only questions):
Model | Accuracy |
---|---|
TinyLlama 1epoch | 0% |
TinyLlama 10 epoch | 0% |
TinyLlama 50 epoch | 0% |
Llama2 1epoch | 0% |
The main issue is the model learns the text around the numbers but it gets the numbers wrong. Actually, It changes the number every time is queried. Side effect, the tinyllama models lost their capabilities to answer other inputs.
Next steps:
- Apply another metric to discriminate how far off the values are, to check if there is any difference between models
- Test fine tuning of api URLs
- Test with stripped version of the multilayer approach