We keep stable (last CRAN release) version in master. Current active development is in 0.4 branch.
To learn how to use this package, see the package vignettes.
- Text vectorization:
vignette("text-vectorization", package = "text2vec")
- GloVe word embeddings:
vignette("glove", package = "text2vec")
See also the text2vec articles on my blog.
text2vec is a package that provides an efficient framework with a concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim, an excellent Python library for NLP.
The core functionality at the moment includes
- Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
- State-of-the-art GloVe word embeddings.
The core of this package is carefully written in C++, which means text2vec is fast and memory friendly. Some parts (GloVe training) are fully parallelized using the excellent RcppParallel package. This means that parallel processing works on OS X, Linux, Windows and Solaris (x86) without any additional hacking or tricks. In addition, there is a higher-level parallelization for text vectorization and vocabulary construction on top of the foreach package, and text2vec has a streaming API so that users don't have to load all of the data into RAM.
The API is built around the iterator abstraction. The API is concise, providing only a few functions which do their job well. The package does not (and probably will not in the future) provide trivial very high-level functions. But other packages can build on top of the framework that text2vec provides.
The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.
Contributors are welcome. You can help by
- testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail.
- forking and contributing. Vignettes, docs, tests, and use cases are very welcome.
- by giving me a star on project page :-)