List PyThaiNLP 2.0
wannaphong opened this issue · 1 comments
wannaphong commented
- Since there will be some API changes, this will be released as PyThaiNLP 2.0 (#154)
- Thai Text Classification Benchmarks https://github.com/PyThaiNLP/classification-benchmarks
New evaluation corpus
- Prachatai newspaper, from https://prachatai.com - with news category tags
- Wisesight-Sentiment Corpus, Thai Social Media Sentiment Dataset
- truevoice-intent, Intent Dataset from Customer Service Phone Calls Transcribed by TrueVoice's Mari
New features
- thai2fit 0.3 (formerly thai2vec) (started with commit 0a8a60d)
- Use fastai 1.0.22
- Pretrained model and inference will now use the same frozen set of words, for better accuracy
pythainlp.transliterate.transliterate
grapheme to phoneme (#139)- New
NorvigSpellChecker
class - can be initialized with custom dictionary (#119, #137) pythainlp.util.thai_strftime
for date and time formatting (use standarddatetime.strftime
directives) (#160)- Installation options for extra dependency packages (#153, #157)
- Run
pip install pythainlp
for minimum dependency, just enough to run core functions of PyThaiNLP. Runpip install pythainlp[full]
to install every packages that required for extended functions (like machine-learnet name entity recognizer that rely on keras).
- Run
pythainlp.util.thaicheck
- Thai check #171- Add Orchid to Universal Dependencies 166b671
Bug fixes
- Fix
metasound
soundex to work as described in the Snae & Brückner (2009) paper. (#135) - Fix Peter Norvig's spell checker probability of candidate words (#90)
Other improvements and optimizations
- (Upgrade ULMFiT-related codes to fastai 1.0) (#136)
- Frequently used regular expressions are now precompiled [should be faster, need benchmark here] (#124, #133, #138)
- Consolidate documentation files (#128, #129)
- Remove Python 2 compatibility code (deprecated in 1.7 - #107) (#134)
- Refactoring: reduce redundant and unused code, merged common code (#125, #132, #146, #148, #149)
- Remove temporary files, experiment files, and obsoleted files (#126, #140, #143)
- More consistent indentations in source code
- Handling None, empty value, errors, and unexpected cases:
- Check for None and empty values and make appropriate return when necessary (#151, etc.)
- Raise
ImportError
, if there is import error, instead of sys.exit() - functions like
tokenize
,summarize
, etc. will always return something even the engine specified is not found (will fall back to default engine) (#131)
- More and improved examples (#122, #127)
- Improved test coverages with more test cases (#147, #156)
Name changes in API
- Rearrangement of utility functions. Most of them, like
rank
,find_keyword
,collate
, and functions related to date and time, are now inpythainlp.util
module. (#160) - Some class and function names are changed from 1.7 to make it aligned with PEP8 (Style Guide for Python Code), make it more explicit about what they are doing, or make it more consistent with other related classes/functions. For examples:
thainer
andthai2rom
classes are nowThaiNameTagger
andThaiTransliterator
(CapWords for class name)pythainlp.soundex.LK82
,pythainlp.soundex.Udom83
, andpythainlp.MetaSound
functions are nowpythainlp.soundex.lk82
,pythainlp.soundex.udom83
, andpythainlp.soundex.metasound
(small caps for function name, also move metasound to soundex module)collation
,correction
, andromanization
functions are nowcollate
,correct
, andromanize
-- in a verb (action) form, and in line withtokenize
andsummarize
functions.
pythainlp.corpus.alphabets
,pythainlp.corpus.tone
, etc. constants are nowpythainlp.thai_consonants
,pythainlp.thai_tonemarks
, etc.- They are also now
str
instead ofset
. - This is to follow the example of
string.ascii_letters
, etc.str
also iterate a little bit faster in one character for one member use cases that these constants are usually used for.
- They are also now
- These changes will resulted in breaking code if your code directly invoke those classes/functions. In general, the change should be only at the level of class or function name, there should be no change at the arguments passing to the class or the function. Please refer to the API doc.
- Internally, there are also name changes of corpus files (#141) but this should not has any effect to the API.
wannaphong commented
PyThaiNLP 2.0 documentation #178