Machine Learning System to classify a Patent Application based on it's title and resume, using INPI-Brazil RPI Patent text data for training the model.
- download - downloads the last available RPIs on http://revistas.inpi.gov.br
- import - parses all .txt or .TXT on ./input and generates ./output/ <RPI>.csv, <RPI>_parsed.txt and full.csv; also generates ./import.log;
- pre_processing - based on full.csv generates dataset_ipc_first.csv containing title|resume|ipc, with stemmed text using nltk.stem.RSLPStemmer() and considering only the first IPC label of multi-label patent applications.
- ./input/*.txt or ./input/*.TXT - RPI (Revista da Propriedade Industrial) one RPI per file;
- ./import.log - record execution informations;
- ./output/<RPI>_parsed.txt - processed RPI file (debug);
- ./output/<RPI>.csv - records/observations extracted from .txt file;
- ./output/full.csv - all records/observations extracted from all .txt;
- ./imported/<RPI>.txt - archives all imported .txt files, one file per RPI;
- ./output/dataset.csv - pre-processed records/observations.
- http://revistas.inpi.gov.br - download Patent section of RPI (.txt's)
- http://dados.gov.br/dataset/revista-da-propriedade-industrial-rpi/resource/4288c07c-f9bd-45d7-8fc0-56b4fc1f5c82 - informations on how to get RPI files