Tokenizers package
lmullen opened this issue · 19 comments
-
- What does this package do? (explain in 50 words or less)
Provides tokenizers for natural language.
-
- Paste the full DESCRIPTION file inside a code block below.
Package: tokenizers
Type: Package
Title: Tokenize Text
Version: 0.1.1
Date: 2016-04-03
Description: Convert natural language text into tokens. The tokenizers have a
consistent interface and are compatible with Unicode, thanks to being built
on the 'stringi' package. Includes tokenizers for shingled n-grams, skip
n-grams, words, word stems, sentences, paragraphs, characters, lines, and
regular expressions.
License: MIT + file LICENSE
LazyData: yes
Authors@R: c(person("Lincoln", "Mullen", role = c("aut", "cre"),
email = "lincoln@lincolnmullen.com"),
person("Dmitriy", "Selivanov", role = c("ctb"),
email = "selivanov.dmitriy@gmail.com"))
URL: https://github.com/lmullen/tokenizers
BugReports: https://github.com/lmullen/tokenizers/issues
RoxygenNote: 5.0.1
Depends:
R (>= 3.1.3)
Imports:
stringi (>= 1.0.1),
Rcpp (>= 0.12.3),
SnowballC (>= 0.5.1)
LinkingTo: Rcpp
Suggests: testthat
-
- URL for the package (the development repository, not a stylized html page)
https://github.com/lmullen/tokenizers
-
- What data source(s) does it work with (if applicable)?
Natural language text
-
- Who is the target audience?
Users of R packages for NLP
-
- Are there other R packages that accomplish the same thing? If so, what is different about yours?
Virtually every R package for NLP implements a couple of tokenizers. The point of this package is to collect all the tokenizers that one could conceivably want to use in a single package, and make sure that all the packages have a consistent interface. The package also aims to have fast and correct tokenizers implemented on top of stringi and Rcpp.
-
- Check the box next to each policy below, confirming that you agree. These are mandatory.
- This package does not violate the Terms of Service of any service it interacts with.
- The repository has continuous integration with Travis CI and/or another service
- The package contains a vignette
- The package contains a reasonably complete README with
devtools
install instructions - The package contains unit tests
- The package only exports functions to the NAMESPACE that are intended for end users
-
- Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
- Are there any package dependencies not on CRAN?
- Do you intend for this package to go on CRAN?
- Does the package have a CRAN accepted license?
- Did
devtools::check()
produce any errors or warnings? If so paste them below.
-
- Please add explanations below for any exceptions to the above:
The package does not contain a vignette, but it does contain an extensive README. Since all the tokenizers work in the same basic way, a vignette seems unnecessary. But if rOpenSci wants one, I can easily adapt the README into a standalone vignette.
This package is already on CRAN.
-
- If this is a resubmission following rejection, please explain the change in circumstances.
Thanks for the submission! Looking for reviewer
Reviewers: @kevinushey
Due: 2016-04-25
General Comments
Overall, everything is neat, tidy, and well put together. Nice! Most of my comments are stylistic, especially re: the C++ code included within. The near-100% test coverage is great to see indeed!
I can't comment as well as to the utility / scope of the package since I'm not as familiar with NLP, but overall the package feels quite well focused.
Tests
I ran R CMD check
on a locally built R-devel (r70486) with El Capitan + Apple Clang 7.3.0. I can also confirm that tests run fine with lldb
attached (ie, I saw no memory errors or anything like that reported from running the unit tests).
Comments
Please add a vignette, for potential users like me who are less familiar with NLP in general. :)
- https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L2: Generally,
using namespace std;
is considered somewhat bad practice; mainly you just want to double-check that you don't get burned with name collisions with things instd
vs other symbols, and prefer just usingstd::
as necessary. - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L21: I don't think
RCPP_UNORDERED_SET
andRCPP_UNORDERED_MAP
should really be considered part of the public Rcpp API; since you're already opting into C++11, you can likely just usestd::unordered_map
andstd::unordered_set
. - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L24: This is mostly stylistic, but if you want to pass an argument by reference, and mutate it within that function, you should consider passing it using a raw pointer. This might seem strange, but the idea is that at the call site it's much more obvious that a function parameter might get mutated. (This is a habit I picked up from https://google.github.io/styleguide/cppguide.html#Reference_Arguments)
- https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L53: Missing braces for
else
? (Or, prefer being consistent when using braces within each block within anif-else
block) - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L9: If you want to support long vectors in R, you likely want to return and use
R_xlen_t
for indexing. - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L12: Is it possible for both
n
andi
to be large? (Ie, could the result overflow anint
?) If so, you might want to use along
instead, or check for overflow explicitly. (similar: https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L21) - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/utils.R#L3: Prefer using scalar
&&
inif
conditions -- it expresses the intent slightly more clearly. (Similarly for|
vs.||
at e.g. https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/ngram-tokenizers.R#L66) - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/tests/testthat/data-for-tests.R: Consider just naming this e.g.
helper-data.R
;testthat
will then automatically source that file for you before running each test unit. - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L21: Missing
.
at the end of description? (also https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L22). - https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/DESCRIPTION#L3: Consider a slightly more evocative title?
@kevinushey: Thanks very much for this detailed code review. As you doubtless surmised, I'm at the very beginning of learning C++ and Rcpp, so I appreciate the clear guidance about how to improve that code. I didn't know that testthat automatically sourced helper-data.R
: very useful!
I agree with all the changes listed here, and I'll make these changes soon. (Unfortunately, probably not till the first week of May.) Then I'll ping this issue again.
@kevinushey thank you for review. Probably I can say that I'm an author of shingle_ngrams.cpp
. I'm not a professional c++ developer so find your comments related to c++ part very useful. Especially comments related to exporting std
namespace and passing mutable objects by raw pointer.
One thing related to RCPP_UNORDERED_MAP
and RCPP_UNORDERED_SET
. This came from my experience with text2vec package. I noticed that std::unordered_set
can't be compiled on windows - I checked it on win builder after several bug reports. This was with old gcc 4.6.3 toolchain, pretty sure it will work with new 4.9 toolchain. But interest thing is that there were no problems with std::unordered_map
!
After googling a little bit I found post on Rcpp mailing list which suggests to use RCPP_UNORDERED_SET
instead of std::unordered_set
.
Thanks for the ping, @sckott. I just need to get around to fixing @kevinushey's very straightforward list of changes (thanks again!). I'm a bit out of my depth with some of the C++. But let me take a stab at most of them, and I'll check in with @dselivanov on the C++ parts that he wrote.
thanks, sounds good
I've made fixes for most of the issues that @kevinushey pointed out. Here is a line by line accounting.
Please add a vignette, for potential users like me who are less familiar with NLP in general. :)
Added a vignette: ropensci/tokenizers@4a2ce76
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L2: Generally, using namespace std; is considered somewhat bad practice; mainly you just want to double-check that you don't get burned with name collisions with things in std vs other symbols, and prefer just using std:: as necessary.
Fixed: ropensci/tokenizers@023f801
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L53: Missing braces for else? (Or, prefer being consistent when using braces within each block within an if-else block)
Fixed: ropensci/tokenizers@b16b146
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L12: Is it possible for both n and i to be large? (Ie, could the result overflow an int?) If so, you might want to use a long instead, or check for overflow explicitly. (similar: https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L21)
I don't think we would ever overflow an int. In skip grams n and k (which becomes i) are usually in the single digits.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/utils.R#L3: Prefer using scalar && in if conditions -- it expresses the intent slightly more clearly. (Similarly for | vs. || at e.g. https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/ngram-tokenizers.R#L66)
Fixed: ropensci/tokenizers@b7be1d4
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/tests/testthat/data-for-tests.R: Consider just naming this e.g. helper-data.R; testthat will then automatically source that file for you before running each test unit.
I didn't know testthat automatically sourced helper-data.R
. Very useful! Fixed: ropensci/tokenizers@fa4b89e
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L21: Missing . at the end of description? (also https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L22).
Fixed: ropensci/tokenizers@efa0605
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/DESCRIPTION#L3: Consider a slightly more evocative title?
More descriptive title: ropensci/tokenizers@4c49a62
This issue @dselivanov already responded to.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L21: I don't think RCPP_UNORDERED_SET and RCPP_UNORDERED_MAP should really be considered part of the public Rcpp API; since you're already opting into C++11, you can likely just use std::unordered_map and std::unordered_set.
That leaves these two issues. Since they deal with code that @dselivanov wrote, can you take a look at them please, Dmitriy?
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L24: This is mostly stylistic, but if you want to pass an argument by reference, and mutate it within that function, you should consider passing it using a raw pointer. This might seem strange, but the idea is that at the call site it's much more obvious that a function parameter might get mutated. (This is a habit I picked up from https://google.github.io/styleguide/cppguide.html#Reference_Arguments)
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L9: If you want to support long vectors in R, you likely want to return and use R_xlen_t for indexing.
Assuming that the fixes above and Dmitriy's response to the unordered sets and maps questions is adequate, that leaves one stylistic question and one substantive question about long vectors.
Is the package ready, or do you want to wait till Dmitry looks at those two? I'm happy either way.
Thanks @lmullen - For those 2 outstanding issues, can you open issues for those in your repo, and we can consider these approved. I think you know what to do from here :)
thanks!
I transfered ownership to rOpenSci. Hope that's okay.
@lmullen I will have a look on Sunday-Monday.
sounds good, thanks, closing now, thanks again for submitting!
would you be interested in doing a blog post on our blog about this?
Sure, sounds like a good idea. I will likely be pretty short. How about I send it to you by the end of next week, to give me time to get a new version on CRAN?
Sounds good. And feel free to say no
It would be nice if more packages relied on tokenizers, so a little
publicity might help.
On Fri, Aug 12, 2016 at 1:02 PM, Scott Chamberlain <notifications@github.com
wrote:
sounds good, thanks, closing now, thanks again for submitting!
would you be interested in doing a blog post on our blog about this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALNeIGY8hBmllGzCXxfMw1Ro4-wRLmvks5qfKdCgaJpZM4H-wBl
.
Lincoln Mullen
Assistant Professor, Department of History & Art History
George Mason University
true, would get more eyes on it, and hopefully uses in pkgs