Develop a model capable of extracting product names from furniture store websites.
- A list of URLs from furniture store sites.
- A list of product names extracted from each URL.
Veridion provides the most comprehensive database of company data, gathered by AI with human precision.
Upon downloading a data sample, I needed to clarify whether the product names to extract were specific ("Hamar Plant Stand") or generic ("Plant Stand"). Inspection of the data sample led to the conclusion that "Plant Stand" is the target.
Veridion Data Sample: Data Dictionary - Product & Services
Veridion Data Sample: Products & Services Sample
This challenge offers an opportunity to improve the extraction process, as some product names are currently not captured correctly.
Veridion Data Sample: Products & Services Sample - Wrong product name example
Veridion Entity Recognizers - the basis for building the model to identify 'PRODUCTS' entities.
- Create a NER (Named Entity Recognition) model.
- Train the NER model to find 'PRODUCT' entities.
- Use ~100 pages from the URLs list for training data.
- Develop a method to tag sample products.
- Use the model to extract product names from unseen pages.
- Showcase the solution.
-
URL Verification:
- Verified URLs to ensure they were functional using verify_urls.py, producing valid_urls.csv and invalid_urls.csv.
-
Data Scraping:
- Used scraper.py to scrape data from the valid URLs, resulting in extracted_product_data.csv.
-
Data Cleaning:
- Cleaned the scraped data with clean_data.py.
-
Data Organization:
- Automated the labeling process in an unorthodox manner to avoid manual annotation with organize_data.py, producing organized_product_data.csv. There's a long story behind it.
- Converted organized data to a list format using to_list.py, resulting in product_names.txt.
-
Text Annotation:
- Annotated text using product_names.txt and extracted_product_data.csv with ner_tags.py, inspired by the wnut17 dataset structure.
-
Data Splitting:
- Split the annotated data into training and validation sets (80%/20%) using split_data.py, resulting in train_data.json and val_data.json.
-
Model Training:
- Fine-tuned
distilbert-base-uncased
on the dataset Fine_tune_distilbert_NER_Furniture.ipynb.
- Fine-tuned
-
Model Testing and Solution Showcase:
- Used the fine-tuned model to extract product names from the valid URLs and created some graphs about the products testing_ner.ipynb.
The model and the dataset can be found on Hugging Face:
- created my first dataset from scratch;
- fine-tuned my first LLM model;
- deployed both on HuggingFace;
- applied to my first machine learning internship;
- confidence in working constantly with bash, vim, hf, different types of data;
- understood how fine-tuning works for NER;
- understood how LLM are processing data;