Yolantele/ML-data-clasifier

Jupyter Notebook

Tabular Waste Data Classifications Model POCs

Fast AI multi-label classification POC - EN
SpaCy & Scikit-learn multi-label classifications POC - EN, NL
SPaCy Entities Anlysis POC - EN, NL

See the ML model here

Uses Dutch Waste Data

Construction and demolition waste
Packaging waste and recyclables
Electronic and electrical equipment
Vehicle and oily wastes
Healthcare and related wastes

data ommitted within repo for data sensitivity reasons

Training of the Fast AI Machine Learning classification model:

This project uses Fast AI Tabular Neural Nets for ML classification model:

Using neural nets for analyzing tabular data
Loading data into Pandas DataFrame
Using categorical variables for entity embedings(more on embedings)
using continuous variables (numeric values) for neural nets
using 3 data sets: train, validation and test data

*unfortunately for data privacy reasons the data required is not included in this repo. Please reach out or message if you will

Treating The Data:

1.Translation services

Google tranlsate API and service account
client was set up to provide the translations from nl to en

2. Augmenting data

Treating Boolean-like field value overwrrides - fields of 2 options of strings become integers - 0 and 1
Fields such as pureOrMixed string values of pure and mixed become integers 1 or 0 to be set later as continuous variables in tabular learner
Prefilling the fields where possible - such as waste description field, prefilled with euralCodeDescription when underdefined

3. Creating 3 sets of data: train, validate and test data

loaded to pandas DataFrame
for training of ML model - uses train and validation data with rich fields
for testing of ML model - uses test data with missing fields