28th of June Update
Opened this issue · 2 comments
Fine-tuning:
- 4 models are working on Ollama (3 tinyLlama verisons with 1, 10, 50 epoch)
- I was able to train an Llama2 model (1 epoch only)
- Llama.cpp depricated some functionality which made transform safetensors to gguf format
- Running evaluation on small set locally
Company Names: Clustering using Semi Supervised Learning
I did a proof of concept of the usage of a clustering model that uses semi supervised learning to group Company names by looking at their name and address (further variables could be included in the future).
I used the data set that we have been manually cleaning on the spreadsheet to perform a test.
Here I trained the model over a subset of the manually validated names:
- The companies that most show up
- For each company, only take the top 15 most repeated matches.
Overall, the model saw 3910 rows of data such as:
raw_name_id | raw_name | raw_address |
---|---|---|
58517298 | MELISSA & DOUG, LLC | 10 WESTPORT ROAD WILTON CT 06897 US |
34062848 | MELISSA & DOUG, LLC | 141 DANBURY RD WILTON CT 06897-441 US |
20411 | MELISSA & DOUG LLC | 141 DANBURY ROAD WILTON CT 068 USA |
819001 | MELISSA & DOUG, LLC | 141 DANBURY ROAD WILTON CT 06897 USA |
68400195 | MELISSA & DOUG LLC | 10 WESTPORT ROAD WILTON CT 06897 US |
Which correspond to 276 manually identified companies.
Then, out of those 3910 rows (which produce 15.288.100 pairs) I trained on a semisupervised setting the model, by reviewing manually 130 pairs of rows and marking them as (114) "the same company" or (17) "not the same company".
After this, the model applied the clustering and outputted for each row, a cluster ID, to which it belongs. It found 331 clusters. In the following table, the output of the model, with the cluster id and confidence score is shown.
Cluster ID | confidence_score | raw_name_id | raw_name | raw_address |
---|---|---|---|---|
80 | 0.859248 | 58517298 | MELISSA & DOUG, LLC | 10 WESTPORT ROAD WILTON CT 06897 US |
80 | 0.859248 | 34062848 | MELISSA & DOUG, LLC | 141 DANBURY RD WILTON CT 06897-441 US |
80 | 0.859263 | 20411 | MELISSA & DOUG LLC | 141 DANBURY ROAD WILTON CT 068 USA |
80 | 0.859264 | 819001 | MELISSA & DOUG, LLC | 141 DANBURY ROAD WILTON CT 06897 USA |
80 | 0.859250 | 68400195 | MELISSA & DOUG LLC | 10 WESTPORT ROAD WILTON CT 06897 US |
Now to assess the result, the precision and recall of the process were calculated against the ground truth that we manually created on the spreadsheet.
- precision: 0.9169
- recall: 0.9452
To further test the model and check if it's not overfitting, the same test was applied on data that didn't belong to the training set, which consisted of 9000 companies, which were grouped into 61 clusters, but were actually identified as 30 companies. Here the precision and recall went a bit lower, but still not bad.
- precision: 0.9999
- recall: 0.7227
The goal now is to make this scale, so it finds more clusters. This was done on RAM and using a CSV, but the library allows for it to connect to a postgres database and work with more rows.
Consider that this only is taking a look at 9000 companies, and the entire dataset has around 10.000.000 company names only for consignee names.
true_id | true_name | validated_name | raw_name_id | raw_name | raw_address | count_value |
---|---|---|---|---|---|---|
63 | MELISSA & DOUG, LLC | Melissa & Doug | 58517298 | MELISSA & DOUG, LLC | 10 WESTPORT ROAD WILTON CT 06897 US | 607 |
63 | MELISSA & DOUG, LLC | Melissa & Doug | 34062848 | MELISSA & DOUG, LLC | 141 DANBURY RD WILTON CT 06897-441 US | 497 |
63 | MELISSA & DOUG, LLC | Melissa & Doug | 20411 | MELISSA & DOUG LLC | 141 DANBURY ROAD WILTON CT 068 USA | 295 |
63 | MELISSA & DOUG, LLC | Melissa & Doug | 819001 | MELISSA & DOUG, LLC | 141 DANBURY ROAD WILTON CT 06897 USA | 249 |
63 | MELISSA & DOUG, LLC | Melissa & Doug | 68400195 | MELISSA & DOUG LLC | 10 WESTPORT ROAD WILTON CT 06897 US | 170 |
- In terms of the embedding evaluation, Llama3 was taking a lot of time, and results were not improving so I decided to stop it earlier. For the other 4 models (the ones with better results), I ran them again but this time keeping track of some more metrics we’ll use when evaluating other approaches. Updated results for the embeddings:
1.1 Question ID = 1 (6,231 questions of type "How much did Exporter Country
export in Year
?")
Model | Question ID | Correct matches | Accuracy (%) |
---|---|---|---|
Mixtral | 1 | 333 | 5.3 |
Llama3 | 1 | 398 | 6.4 |
all-mpnet-base-v2 | 1 | 4425 | 71.0 |
multi-qa-MiniLM-L6-cos-v1 | 1 | 4260 | 68.4 |
multi-qa-mpnet-base-cos-v1 | 1 | 4858 | 78.0 |
all-MiniLM-L12-v2 | 1 | 4027 | 64.6 |
1.2 Question ID = 2 (20,000 out of 46,872 questions of type "How much did Exporter Country
export of HS
in 2022?")
Model | Question ID | Correct matches | Accuracy (%) |
---|---|---|---|
Mixtral | 2 | 288 | 1.4 |
Llama3 | 2 | 516 / 10325 | ~5.0 |
all-mpnet-base-v2 | 2 | 19734 | 98.7 |
multi-qa-MiniLM-L6-cos-v1 | 2 | 19639 | 98.2 |
multi-qa-mpnet-base-cos-v1 | 2 | 19782 | 98.9 |
all-MiniLM-L12-v2 | 2 | 19700 | 98.5 |
1.3 Question ID = 3 (15,000 out of 34,955 questions of type "How much HS
was traded in Year
?")
Model | Question ID | Correct matches | Accuracy (%) |
---|---|---|---|
Mixtral | 3 | 732 | 4.9 |
Llama3 | 3 | 646 / 8052 | 8.0 |
all-mpnet-base-v2 | 3 | 13499 | 90.0 |
multi-qa-MiniLM-L6-cos-v1 | 3 | 13862 | 92.4 |
multi-qa-mpnet-base-cos-v1 | 3 | 14435 | 96.2 |
all-MiniLM-L12-v2 | 3 | 12860 | 85.7 |
- Now I’m finishing up a new evaluation set of 100 questions we’ll use across all approaches. This set has simple questions that are present in the corpus, and also some more complex ones (like growth, or top exporters, etc) that the models might be able to answer. We want to keep this evaluation set small in order to get fast results, keep the costs low and also be able to manually evaluate the model’s answers. The approaches we’ll be evaluating with these are: RAG, multi-layer, simple GPT API call (like asking chatGPT), and fine-tuning.
For next week (on my side):
- Add more context to the corpus to answer these new questions
- Evaluate RAG
Then for the week after:
3. Evaluate the multi-layer approach we were working on initially