/wikidominer

Wikipedia Domain-specific Miner

Primary LanguageJupyter NotebookMIT LicenseMIT

WikiDoMiner: Wikipedia Domain-specific Miner

WikiDoMiner is a tool that automatically generates domain-specific corpora by crawling Wikipedia.

Installation

Clone and install the required libraries

git clone github.com/SNTSVV/WikiDoMiner.git
cd WikiDoMiner
pip install -r requirements.txt 

Usage example

CLI:

python WikiDoMiner.py --doc Xfile.txt --output-path ../research/nlp --wiki-depth 1

checkout available arguments using

python WikiDoMiner.py --help

Run the notebook Open In Colab

# extract keywords
keywords = getKeywords(document, spacy_pipeline)

# query wikipedia to get your corpus
corpus = getCorpus(keywords, depth=1)

# locally save your corpus 
saveCorpus(corpus, parent_dir='Documents', folder='Corpus')

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT