๐ฉโ๐ป API next steps: checklist
jbesomi opened this issue ยท 1 comments
jbesomi commented
The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.
-
Version 1.10
- Every representation function to receive as input a
TokenSeries
#44 - Decouple TF-IDF L2-normalization and TF-IDF #76
- Rename
term_frequency
tocount()
+ add functionterm_frequency
#61 - Introduce
HeroSeries
- Add ~ hero.norm(RepresetationSeries, "l1"/"l2")
- Can we avoid the use of
VectorSeries
/TokenSeries
? - All
representation
functions to deal withHeroSeries
+ (DocumentTermDF) #43 - Update README + getting-started.md
- Push a new version to PyPi
- Every representation function to receive as input a
-
Performance: speed-up the library
- Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
- Make spaCy function faster + Dask vs Spacy #65
- Depending on the previous task, evaluate if we want to have as default tokenizer
spaCy
: #131
-
Software development:
-
Support Embeddings through Flair
- Add hero.embed(s, flairEmbedding)
-
Add Topic Modeling
- Add topic modeling support under representation #42
This include also "topic modeling visualization" to get insights out of it - Add a blog article on how topic modeling with Texthero works
- Add topic modeling support under representation #42
-
Extra
henrifroese commented
Merge-Plan
- #156 (representation series to multicolumn). Branched from Texthero Master
- #157 (hero types in representation & DocumentTermDF in _types). Branched from #156
#158 (add pandas setitem support for DocumentTermDF). Branched from #156 - #174 (Fix type checks). Branched from Texthero Master
- #117 (getting-started), #118 (README), #135 (getting-started hero-types). Branched from Texthero Master
- RELEASE NEW VERSION
- #146 (Flair Embeddings). Branched from Texthero Master
- #160 (Travis Clean-Up). Branched from Texthero Master
- #161 (Pre-commit hook). Branched from Texthero Master
- #162 (Speed-Up Preprocessing+NLP). Branched from Texthero Master
- #163 (Topic Modelling w/ Visualizations). Branched from Texthero Master
- #165 (Fix term_frequency). Branched from Texthero Master
- #167 (Train-Test Split). Branched from Texthero Master
- #168 (Describe DF). Branched from Texthero Master
- #169 (filter extremes). Branched from Texthero Master
- #170 (ClusterSeries Type). Branched from Texthero Master
- #175 (Visualization Tutorial). Branched from Texthero Master
- #176 (NLP Tutorial). Branched from Texthero Master
- #177 (Show DataFrame). Branched from Texthero Master
- #178 (Visualize Describe DF). Branched from #168