nlp-categorize is using a natural language analyzer to categorize loose texts with the help keyword(-chains) into 3 levels of categories.
The keyword file-structure must be a CSV file (export for LibreOffice for example): With the following strucutre:
category-level1, category-level2, category-level3, keyword(s), keyword(s), ..., keyword(s)
E.g.:
Technology,IT,Web Programming,php,js,javascript
Technology,IT,Low-Level C Programming,c compiler,linux kernel
Technology,IT,Programmer,programmer,compiler,programming
(Note: Here "c compiler" (or linux kernel) is a keyword-chain. As match only count if both words are present and the natural-language analyzer sees them in a direct relation)
The file must stored as keywords.csv
in the directory where the script is run.
As a CSV-file as well. Just one column per line, containing the text to be analyzed.
"I like programming, C++ and C. I love the Linux Kernel.
Would match the 'Technology -> IT -> Low-Level C Programming' and 'Technology IT Programmer' categories.
For the moment a simple output, again in a CSV format, is generated:
Orignal Text, Match_CAT0, Match_CAT1, ..., CATN, detail cat, ...
The aggregated first column of the keywords-file is considered the main column. The above example would produce:
text,Technology
"I like programming, C++ and C. I love the Linux Kernel.",1,IT - Low-Level C Programming, IT - Programmer
By default a French language package is used (fr_core_news_sm
).
Keywords must be in their infinitif and masculin-form. For exmaple, permis conduire becomes permettre conduire or realisatrice has to be stored (and will be referenced) as realisateur. When using the output, be careful to get back the raw text to check the original gender.
You need a working python3 environment on your PC/MAC.
On MAC, refer to this guide. The section
Doing it right should be enough. You need git
as well: brew install git
Then git clone this repository to your PC:
cd place/to/store/project
git clone https://github.com/pboettch/nlp-categorize.git
Install the dependencies:
cd nlp-categorize
pip3 install -r requirements.txt
Install Spacy's language package
python3 -m spacy download fr_core_news_sm
Create a keywords.csv
and an input.csv
as described above. Then run the script:
./nlp-categorize.py
It'll take some time and with no furthur printed output, a output.csv
-file
should have been generated.