KeBaSiC
A tool for keyword extraction from webpages
Installation
KeBaSic is written in Python 3, and is required at least Python 3.5. In order to install the required dependencies run:
pip install -r requirements.txt
The tool also requires installing the following programs:
-
Stanford CoreNLP Server. It requires different files:
- The server from https://stanfordnlp.github.io/CoreNLP/;
- The language specific jar file from https://stanfordnlp.github.io/CoreNLP/download.html;
- The language properties file from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties
Place the language specific jar and properties file in the server folder.
-
TreeTagger from http://www.cis.uni-muenchen.de/%7Eschmid/tools/TreeTagger/
-
WEKA from https://www.cs.waikato.ac.nz/ml/weka/downloading.html, and install the LibSVM package from WEKA's package managere.
Usage
Keyword Extraction
- Edit the config.json file with the input and the output files
- In the composed.py file change the reader and the writer class according to the desired input/output (see kebasicio module for all the possible IO and interfaces)
- Start CoreNLP server with
java -Xmx12g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -timeout 60000
- Run
python application.py -input_path="../data/test.json" [-out_path="../output/keywords_extraction/results_test.json" -std_out -log -log_path="../log"]
Where the arguments in the square brackets are optional. The arguments are:
- out_path: The path where the results will be saved, if is not specified there will be no saved results.
- std_out: Enables the print of the results on standard output, if not specified, no result will be printed on standard output
- log: Enables the logging, if non specified the logging will not be enabled
- log_path: Defines the path where the logs will be saved, if not defined the log will only be printed on standard output
NOTES:
- CoreNLP server requires at least 8GB of RAM; higher is better for allowing to process larger webpages.
- Set CoreNLP timeout to at least a minute in order to allow the processing of large webpages
Dataset Creation
-
Create the queries by running the scripts file with the following command:
python scripts.py query --template "../resources/ES/query/templates/bing.txt" --keys "../resources/ES/query/keywords.txt" --out "result_query.txt"
-
Exploit Bing for searching relevant web links to the queries using: https://github.com/NikolaiT/GoogleScraper. Run:
GoogleScraper -m http --keyword-file keywords.txt --num-workers 10 --proxy-file proxies.txt --search-engines "bing" --output-filename output.json
3 - Collect the webpages, prerocess them and generate the files for the model training. For the best result check if the following classes are in the preprocessing:
"textprocessing.cleaner.MailCleaner",
"textprocessing.cleaner.URLCleaner",
"textprocessing.cleaner.WordsWithNumbersCleaner",
"textprocessing.cleaner.NonPunctuationSymbolsCleaner",
"textprocessing.cleaner.CommonWords",
"textprocessing.cleaner.LocationsCleaner",
"textprocessing.cleaner.StopWordsCleaner",
"textprocessing.cleaner.PunctuationSpacesCleaner",
"textprocessing.cleaner.MultipleSpacesCleaner",
"textprocessing.cleaner.Clean4SQL",
"textprocessing.cleaner.PunctuationCleaner",
"textprocessing.stemmer.Stemmer"
The run:
python scripts.py dataset --input "../data/GoogleScraper_test_results.json" --out "../data/train_dataset" --workers 8
Classification Training:
Train the model:
python scripts.py train --input "../data/train_dataset_lvl1.csv" --out "../data/model_lvl1_test.model"
Modules Description
Basic description for the different modules. See docs for more info.
kebasic.domain
Contains the classes that represent the domain specific objects
kebasic.execution
Contains the different pipelines used in the program
kebasic.feature
Contains the different algorithms for keyword extraction and score merging
kebasic.kebasicio
Contains the classes responsible for the IO
kebasic.scraper
Contains the scrapy code for crawling webpages
kebasic.textprocessing
Contains the different algorithms for cleaning and processing of the text
kebasic.utils
Contains various code used for heterogeneous task
Resources
In this section are presented the editable resources in order to allow the customization of the tool. These files can be language dependant or independent. The language specific files are placed in the language folder.
Stopwords
The file stopwords contains common words of the specific language. This file can be modified to remove unwanted words to be extracted as keywords.
NOTE: The tool uses stopwords as separator of keywords, so the words present in this file will not imply that the word will not be present in the results.
Commonwords
The common words file contains undesired and words that are common on websites (e.g Facebook/share/login and more). These words will be removed before preprocessing with other algorithms. These words will not be present in the keywords.
Location Names
This files contains the names of possible location that are undesired.