A curated list of free resources dedicated to Hungarian Natural Language Processing
Maintainers - GyΓΆrgy Orosz
- Tools
- Datasets
- Journals / Conferences / Institutes / Events
- Courses / Tutorials
- Blogs / Communities
- Other Hungarian related resource collections
Notations:
- π Easy to install and use
- π Commercial-friendly license
- π― Pretrained models are available or not needed
- huntoken πππ― Hungarian word and sentence splitter
- quntoken πππ― New Hungarian tokenizer based on quex, huntoken
- emMorph (Humor) π― Hungarian morphological analyzer based on Humor
- emMorphPy ππ―A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer
- hunmorph ππ― is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.
- hunmorph-foma ππ― Hungarian morpholical analyzer and generator based on hunmorph.
- hunspell πππ― is an open-source spell-checker, stemmer and morphological analyzer
- lara-hungarian-nlp πππ― LARA is a lightweight Python NLP library for ChatBots in Hungarian.
- Lemmagen πππ― project aims at providing standardized open source multilingual platform for lemmatisation. (Python package for v2 | C# project for v3)
- hunpos πππ― Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
- PurePos ππ Open source morphological tagger based on HunPos
- purepos.py ππ Python wrapper for PurePos
- HunTag ππ A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
- HunTag3 ππ Improved version of the original HunTag
- SzegedNER πππ― Named Entity Recognition tool for Hungarian and English
- DBpedia Spotlight πππ― DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text. Docker image
- emBERT πππ― is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.
- magyarlanc ππ― A toolkit for the basic linguistic processing of Hungarian
- magyarlanc_spark ππ― Spark wrapper for magyarlanc
- spaCy πππ― Industrial-strength Natural Language Processing (NLP) with Python and Cython (Hungarian models)
- huNLP ππ― Unified Java and REST API for magyarlanc and szegedNER
- hunlp-GATE π― GATE plugin containing Hungarian NLP tools as GATE processing resources
- Trendminer Hungarian Processing Pipeline π Hungarian NLP pipeline for social media text analysis (TrendMiner project)
- Google Syntaxnet ππ― Neural Models of Syntax
- UDPipe πππ― is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
- polyglot πππ― is a natural language pipeline that supports massive multilingual applications.
- emtsv ππ― is a text processing system with inter-module communication via tsv + REST API
- StanfordNLP ππ― is a Python NLP Library for Many Human Languages including Hungarian
- spaCy StanfordNLP ππ― wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline
- trankit πππ― A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
- whatlies πππ― Pretrained language models and word embeddings for Scikit-Learn. Also supports Hungarian backends.
- hunpars ππ― A rule based Hungarian syntactical analyzer
- HunParse ππ― An NLTK-based parser using KR-style morphological annotation
- Anagramma Parser A parser based on psycholinguistics principles
- benepar A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.
- SentimentAnalysisHUN πππ― is an open-source sentiment analysis tool for Hungarian language, written in Python.
- hun-date-parser πππ― A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text.
- emLam πππ― Preprocessing scripts for Hungarian Language Modeling
- pywnxml πππ― Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)
- Hun-appointment-chatbot πππ― A simple Hungarian chatbot for booking an appointment using the Rasa framework.
- neural-punctuator πππ― Automatic punctuation restoration with BERT models for English and Hungarian
- hunaccent πππ― Small Footprint Diacritic Restoration for Hungarian
- Hungarian Webcorpus With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.
- Hungarian Webcorpus 2.0 The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.
- OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)
- emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
- Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.
- web2corpus Automatically create multilingual web corpus
- CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe, together with word embeddings of dimension 100 computed from lowercased texts by word2vec
- OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
- The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.
- Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
- Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.
- Universal Dependencies
- Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.
- KorKorpusz is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations
- NerKor is a gold standard named entity annotated corpus containing 1 million tokens.
- hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
- Mazsola database contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
- PrevCons is a database of 21K hapaxes of verbs with verbal prefixes
- Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
- HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.
- Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.
- SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
- HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.
- CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl
- CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian
- CC-100 Monolingual Datasets from Web Crawl Data
- Hungarian-Russian Prisoner of War Database
- Hungarian sentiment corpus (HuSent) is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]
- TED talks transcripts parallel corpus sentence aligned TED talks including Hungarian.
- TaPaCo Corpus is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database
- FasText Wikipedia pre-trained word vectors for 90 languages, trained on Wikipedia using fastText.
- FasText Common Crawl & Wikipedia pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model.
- FastText_multilingual Multilingual word vectors in 78 languages
- polyglot vectors polgyglot embeddings on Wikipedia
- wordvectors Pre-trained word2vec and fasttext word vectors on wikipedia of 30+ languages
- hunembed0.0 A word2vec word embedding trained on the concatenation of the Hungarian Webcorpus and the Hungarian National Corpus in 600 dimensions with a cut-off of 10 words.
- Szeged word vectors Word embeddings (word2vec & fasttext) for Hungarian trained on 4.3 billion tokens
- questions-words-hu Hungarian analogical questions following Mikolov et al.
- Conceptnet Numberbatch Conceptnet numbermatch multi- and cross-lingual semantic word embeddings
- Multi-sense word embeddings
- BytePair Embeddings pretrained Subword Embeddings, downloadable in many formats
- ELMo Representations Deep contextualized word representation trained for many languages
huBERT
Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia- HIL* Transformer models Pretrained transformer models provided by HILANCO
- morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
- huwn Hungarian Wordnet
- Hungarian Sentiment Lexicon The dictionaries were manually created on the basis of Wordnet-Affect lexicons.
- 4lang Concept dictionary using Eilenberg machines
- Named Entity lists for Hungarian
- Mazsola ISZ lists 500K verb frames extracted from the Mazsola database
- Manocska merges verb frames existing databases
- PrevLex List of phrasel verbs
- panmorph Tagsets and description of Hungarian morphological analysers.
- hun_ner_checklist CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition
- Wikipedia dumps
- DBPedia dumps
- huwn.rdf Hungarian WordNet in RDF format for the Linked Open Data cloud
- Conceptnet An open, multilingual knowledge graph (with partial Hungarian support)
- MSZNY (Conference on Hungarian Computational Linguistics) 2018 2017 2016 2015 2014 2013 2011 2010 2009
- Natural Language Processing Group of the PΓ‘zmΓ‘ny PΓ©ter Catholic University Faculty of Information Tehnology and Bionics
- Department of Language Technology and Applied Linguistics, RIL-MTA
- Human Language Technology Research Group of the Budapest University of Technology and Economics
- Natural Language Processing Group of the Szeged University
- BME - Laboratory of Speech Acoustics
TBD
- KeresΕ vilΓ‘g Official blog of Precognox Inc.
- Hungarian NLP Meetup
- Deep Learning Reading Seminar Meetup
- EENLP The broad index of NLP resources for Eastern European languages.
- European Language Grid
- Hugging Face Dataset