This repo is a collection of scripts we use at Xeneta to qualify sales leads with machine learning. Read more about this project in the Medium article Boosting Sales With Machine Learning.
Our simple scikit-learn Random Forest algorithm obtains an accuracy of 86,4% on the testing data.
Unfortunately, we can not provide you with the raw text dataset we've used to train our algorithm, as it contains sensitive company information (who our customers are). However, we have added our own vectorized & transformed data in the xeneta_dataset folder (which is the final input to the RF algorithm). Feel free to play around with this data and see if you can beat our accuracy.
To run create your own lead qualifier, you'll need to run these scripts on your own customer data, which you can do using the FullContact API.
Start off by running the following command:
pip install -r requirements.txt
You'll also need to download the stopword from the nltk package. Run the Python interpreter and type the following:
import nltk
nltk.download('stopwords')
This script trains an algorithm on your input data. It expects two excel sheets named qualified and disqualified in the input folder. These sheets need to contain two columns:
- URL
- Description
Run the script:
python run.py
It'll dump three files to your disc:
- algorithm
- vectorizer
- tfidf_vectorizer
Copy paste these into the 'qualify_leads/algorithms' folder, and it'll be ready to start predicting.
This is the script that actually predicts the quality of your leads. Add an excel sheet named data in the input folder. Use the same format as the example file that's already there.
Run the script:
python run.py
Got questions? Email me at per@xeneta.com.