Datawheel/template-chatbot

28th of June Update

Opened this issue · 2 comments

Fine-tuning:

  • 4 models are working on Ollama (3 tinyLlama verisons with 1, 10, 50 epoch)
  • I was able to train an Llama2 model (1 epoch only)
  • Llama.cpp depricated some functionality which made transform safetensors to gguf format
  • Running evaluation on small set locally

Company Names: Clustering using Semi Supervised Learning

I did a proof of concept of the usage of a clustering model that uses semi supervised learning to group Company names by looking at their name and address (further variables could be included in the future).

I used the data set that we have been manually cleaning on the spreadsheet to perform a test.

Here I trained the model over a subset of the manually validated names:

  • The companies that most show up
  • For each company, only take the top 15 most repeated matches.

Overall, the model saw 3910 rows of data such as:

raw_name_id raw_name raw_address
58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US
34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US
20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA
819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA
68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US

Which correspond to 276 manually identified companies.

Then, out of those 3910 rows (which produce 15.288.100 pairs) I trained on a semisupervised setting the model, by reviewing manually 130 pairs of rows and marking them as (114) "the same company" or (17) "not the same company".

After this, the model applied the clustering and outputted for each row, a cluster ID, to which it belongs. It found 331 clusters. In the following table, the output of the model, with the cluster id and confidence score is shown.

Cluster ID confidence_score raw_name_id raw_name raw_address
80 0.859248 58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US
80 0.859248 34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US
80 0.859263 20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA
80 0.859264 819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA
80 0.859250 68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US

Now to assess the result, the precision and recall of the process were calculated against the ground truth that we manually created on the spreadsheet.

  • precision: 0.9169
  • recall: 0.9452

To further test the model and check if it's not overfitting, the same test was applied on data that didn't belong to the training set, which consisted of 9000 companies, which were grouped into 61 clusters, but were actually identified as 30 companies. Here the precision and recall went a bit lower, but still not bad.

  • precision: 0.9999
  • recall: 0.7227

The goal now is to make this scale, so it finds more clusters. This was done on RAM and using a CSV, but the library allows for it to connect to a postgres database and work with more rows.

Consider that this only is taking a look at 9000 companies, and the entire dataset has around 10.000.000 company names only for consignee names.

true_id true_name validated_name raw_name_id raw_name raw_address count_value
63 MELISSA & DOUG, LLC Melissa & Doug 58517298 MELISSA & DOUG, LLC 10 WESTPORT ROAD WILTON CT 06897 US 607
63 MELISSA & DOUG, LLC Melissa & Doug 34062848 MELISSA & DOUG, LLC 141 DANBURY RD WILTON CT 06897-441 US 497
63 MELISSA & DOUG, LLC Melissa & Doug 20411 MELISSA & DOUG LLC 141 DANBURY ROAD WILTON CT 068 USA 295
63 MELISSA & DOUG, LLC Melissa & Doug 819001 MELISSA & DOUG, LLC 141 DANBURY ROAD WILTON CT 06897 USA 249
63 MELISSA & DOUG, LLC Melissa & Doug 68400195 MELISSA & DOUG LLC 10 WESTPORT ROAD WILTON CT 06897 US 170
  • In terms of the embedding evaluation, Llama3 was taking a lot of time, and results were not improving so I decided to stop it earlier. For the other 4 models (the ones with better results), I ran them again but this time keeping track of some more metrics we’ll use when evaluating other approaches. Updated results for the embeddings:

1.1 Question ID = 1 (6,231 questions of type "How much did Exporter Country export in Year?")

Model Question ID Correct matches Accuracy (%)
Mixtral 1 333 5.3
Llama3 1 398 6.4
all-mpnet-base-v2 1 4425 71.0
multi-qa-MiniLM-L6-cos-v1 1 4260 68.4
multi-qa-mpnet-base-cos-v1 1 4858 78.0
all-MiniLM-L12-v2 1 4027 64.6

1.2 Question ID = 2 (20,000 out of 46,872 questions of type "How much did Exporter Country export of HS in 2022?")

Model Question ID Correct matches Accuracy (%)
Mixtral 2 288 1.4
Llama3 2 516 / 10325 ~5.0
all-mpnet-base-v2 2 19734 98.7
multi-qa-MiniLM-L6-cos-v1 2 19639 98.2
multi-qa-mpnet-base-cos-v1 2 19782 98.9
all-MiniLM-L12-v2 2 19700 98.5

1.3 Question ID = 3 (15,000 out of 34,955 questions of type "How much HS was traded in Year?")

Model Question ID Correct matches Accuracy (%)
Mixtral 3 732 4.9
Llama3 3 646 / 8052 8.0
all-mpnet-base-v2 3 13499 90.0
multi-qa-MiniLM-L6-cos-v1 3 13862 92.4
multi-qa-mpnet-base-cos-v1 3 14435 96.2
all-MiniLM-L12-v2 3 12860 85.7
  • Now I’m finishing up a new evaluation set of 100 questions we’ll use across all approaches. This set has simple questions that are present in the corpus, and also some more complex ones (like growth, or top exporters, etc) that the models might be able to answer. We want to keep this evaluation set small in order to get fast results, keep the costs low and also be able to manually evaluate the model’s answers. The approaches we’ll be evaluating with these are: RAG, multi-layer, simple GPT API call (like asking chatGPT), and fine-tuning.

For next week (on my side):

  1. Add more context to the corpus to answer these new questions
  2. Evaluate RAG

Then for the week after:
3. Evaluate the multi-layer approach we were working on initially