wikipedia-corpus
There are 30 repositories under wikipedia-corpus topic.
howl-anderson/chinese-wikipedia-corpus-creator
Corpus creator for Chinese Wikipedia
GermanT5/wikipedia2corpus
Wikipedia text corpus for self-supervised NLP model training
uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
todd-cook/ML-You-Can-Use
Practical ML and NLP with examples.
ayushidalmia/Wikipedia-Search-Engine
Involves building a search engine on the Wikipedia Data Dump using the data dump of 2013 of size 43 GB. The search results returns in real time.
macbre/mediawiki-dump
Python package for working with MediaWiki XML content dumps
kohjiaxuan/Wikipedia-Article-Scraper
A complete Python text analytics package that allows users to search for a Wikipedia article, scrape it, conduct basic text analytics and integrate it to a data pipeline without writing excessive code.
OlehOnyshchak/pyWikiMM
Collects a multimodal dataset of Wikipedia articles and their images
wolfgarbe/WikipediaExport
Convert Wikipedia XML dump files to JSON or Text files
kylemin/DeViSE
Implementation of DeViSE, including wordnet word2vec using gensim library (NIPS 2013)
ksipos/polysemy-assessment
Code and data for the paper 'Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings'
LeviMatheus/tcc-readability-score-level
Repositório para disponibilização de bases de dados do Wikipedia e Simple Wikipedia pré-processadas, além de scripts de pré-processamento e geração de bases em Python.
quqixun/ReadWiki-ZH
Convert WIKI dumped XML (Chinese) to human readable documents in markdown and txt.
TomerAberbach/wikipedia-ngrams
📚 A Kotlin project which extracts ngram counts from Wikipedia data dumps.
ArisPan/wiki-query
A desktop application that searches through a set of Wikipedia articles using Apache Lucene.
bashkirtsevich-llc/wiki-dump-parser
Wiki dump parser (jupyter)
OmerCohen71/IR-Wikipedia-Search-Engine
IR search Engine for Wikipedia app
vikash212000yadav/Basic-Chatbot
Interactive chatbot using python :)
Affenmilchmann/lingwiki
(Ongoing module in development) Getting Wikipedia articles parsed content. Created for getting text corpuses data fast and easy. But can be freely used for other purpuses too
afuschetto/wiki-extractor
Command line tool to extract plain text from Wikipedia database dumps
etcetra7n/wikibot
RNN model trained from wikipedia corpus
IDS-Mannheim/Wikipedia-Corpus-Builder
Builds Wikipedia corpora in I5 (a TEI-based format)
jksware/ai-spanish-wikipedia-clustering
Clustering of Spanish Wikipedia articles.
moodser/splitter-transliteration
Python script to split the text generated by 'wikipedia parallel title extractor' into separate text files (separate file for each language)
PJ-Duo/wiki-corpus
Create a wiki corpus using a wiki dump file for Natural Language Processing
rajatyadav1994/Wise--WikiPedia-Search-Engine
A Search Engine built based on Wikipedia dump of 75GB. Involves creation of Index file and returns search results in real time
Triansh/Wiki-Searcher
A search engine trained from a corpus of wikipedia articles to provide efficient query results.
macbre/faroese-corpus
Some Faroese language statistics taken from fo.wikipedia.org content dump