PyThaiNLP/pythainlp

List PyThaiNLP 2.0

wannaphong opened this issue · 1 comments

New evaluation corpus

New features

  • thai2fit 0.3 (formerly thai2vec) (started with commit 0a8a60d)
    • Use fastai 1.0.22
    • Pretrained model and inference will now use the same frozen set of words, for better accuracy
  • pythainlp.transliterate.transliterate grapheme to phoneme (#139)
  • New NorvigSpellChecker class - can be initialized with custom dictionary (#119, #137)
  • pythainlp.util.thai_strftime for date and time formatting (use standard datetime.strftime directives) (#160)
  • Installation options for extra dependency packages (#153, #157)
    • Run pip install pythainlp for minimum dependency, just enough to run core functions of PyThaiNLP. Run pip install pythainlp[full] to install every packages that required for extended functions (like machine-learnet name entity recognizer that rely on keras).
  • pythainlp.util.thaicheck - Thai check #171
  • Add Orchid to Universal Dependencies 166b671

Bug fixes

  • Fix metasound soundex to work as described in the Snae & Brückner (2009) paper. (#135)
  • Fix Peter Norvig's spell checker probability of candidate words (#90)

Other improvements and optimizations

  • (Upgrade ULMFiT-related codes to fastai 1.0) (#136)
  • Frequently used regular expressions are now precompiled [should be faster, need benchmark here] (#124, #133, #138)
  • Consolidate documentation files (#128, #129)
  • Remove Python 2 compatibility code (deprecated in 1.7 - #107) (#134)
  • Refactoring: reduce redundant and unused code, merged common code (#125, #132, #146, #148, #149)
  • Remove temporary files, experiment files, and obsoleted files (#126, #140, #143)
  • More consistent indentations in source code
  • Handling None, empty value, errors, and unexpected cases:
    • Check for None and empty values and make appropriate return when necessary (#151, etc.)
    • Raise ImportError, if there is import error, instead of sys.exit()
    • functions like tokenize, summarize, etc. will always return something even the engine specified is not found (will fall back to default engine) (#131)
  • More and improved examples (#122, #127)
  • Improved test coverages with more test cases (#147, #156)

Name changes in API

  • Rearrangement of utility functions. Most of them, like rank, find_keyword, collate, and functions related to date and time, are now in pythainlp.util module. (#160)
  • Some class and function names are changed from 1.7 to make it aligned with PEP8 (Style Guide for Python Code), make it more explicit about what they are doing, or make it more consistent with other related classes/functions. For examples:
    • thainer and thai2rom classes are now ThaiNameTagger and ThaiTransliterator (CapWords for class name)
    • pythainlp.soundex.LK82, pythainlp.soundex.Udom83, and pythainlp.MetaSound functions are now pythainlp.soundex.lk82, pythainlp.soundex.udom83, and pythainlp.soundex.metasound (small caps for function name, also move metasound to soundex module)
    • collation, correction, and romanization functions are now collate, correct, and romanize -- in a verb (action) form, and in line with tokenize and summarize functions.
  • pythainlp.corpus.alphabets, pythainlp.corpus.tone, etc. constants are now pythainlp.thai_consonants, pythainlp.thai_tonemarks, etc.
    • They are also now str instead of set.
    • This is to follow the example of string.ascii_letters, etc. str also iterate a little bit faster in one character for one member use cases that these constants are usually used for.
  • These changes will resulted in breaking code if your code directly invoke those classes/functions. In general, the change should be only at the level of class or function name, there should be no change at the arguments passing to the class or the function. Please refer to the API doc.
  • Internally, there are also name changes of corpus files (#141) but this should not has any effect to the API.

PyThaiNLP 2.0 documentation #178