jbesomi/texthero

๐Ÿ‘ฉโ€๐Ÿ’ป API next steps: checklist

jbesomi opened this issue ยท 1 comments

The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.


  1. Version 1.10

    • Every representation function to receive as input a TokenSeries #44
    • Decouple TF-IDF L2-normalization and TF-IDF #76
    • Rename term_frequency to count() + add functionterm_frequency #61
    • Introduce HeroSeries
    • Add ~ hero.norm(RepresetationSeries, "l1"/"l2")
    • Can we avoid the use of VectorSeries/TokenSeries?
    • All representation functions to deal with HeroSeries + (DocumentTermDF) #43
    • Update README + getting-started.md
    • Push a new version to PyPi
  2. Performance: speed-up the library

    • Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
    • Make spaCy function faster + Dask vs Spacy #65
    • Depending on the previous task, evaluate if we want to have as default tokenizer spaCy: #131
  3. Software development:

    • Integrate checking for correct Series types (#60, #55, ...)
    • Check hero functions work with np.nan #86
  4. Support Embeddings through Flair

    • Add hero.embed(s, flairEmbedding)
  5. Add Topic Modeling

    • Add topic modeling support under representation #42
      This include also "topic modeling visualization" to get insights out of it
    • Add a blog article on how topic modeling with Texthero works
  6. Extra

    1. test coverage
    2. expand multilingual: more languages; recognize languages and select correct one
    3. (low priority) Text summarization (#38) and characteristic terms (#2)

Merge-Plan

  • #156 (representation series to multicolumn). Branched from Texthero Master
  • #157 (hero types in representation & DocumentTermDF in _types). Branched from #156
    #158 (add pandas setitem support for DocumentTermDF). Branched from #156
  • #174 (Fix type checks). Branched from Texthero Master
  • #117 (getting-started), #118 (README), #135 (getting-started hero-types). Branched from Texthero Master
  • RELEASE NEW VERSION
  • #146 (Flair Embeddings). Branched from Texthero Master
  • #160 (Travis Clean-Up). Branched from Texthero Master
  • #161 (Pre-commit hook). Branched from Texthero Master
  • #162 (Speed-Up Preprocessing+NLP). Branched from Texthero Master
  • #163 (Topic Modelling w/ Visualizations). Branched from Texthero Master
  • #165 (Fix term_frequency). Branched from Texthero Master
  • #167 (Train-Test Split). Branched from Texthero Master
  • #168 (Describe DF). Branched from Texthero Master
  • #169 (filter extremes). Branched from Texthero Master
  • #170 (ClusterSeries Type). Branched from Texthero Master
  • #175 (Visualization Tutorial). Branched from Texthero Master
  • #176 (NLP Tutorial). Branched from Texthero Master
  • #177 (Show DataFrame). Branched from Texthero Master
  • #178 (Visualize Describe DF). Branched from #168