Tokenizers package

Question

Tokenizers package

lmullen opened this issue 9 years ago · 19 comments

1. What does this package do? (explain in 50 words or less)

Provides tokenizers for natural language.

1. Paste the full DESCRIPTION file inside a code block below.

Package: tokenizers
Type: Package
Title: Tokenize Text
Version: 0.1.1
Date: 2016-04-03
Description: Convert natural language text into tokens. The tokenizers have a
    consistent interface and are compatible with Unicode, thanks to being built
    on the 'stringi' package. Includes tokenizers for shingled n-grams, skip
    n-grams, words, word stems, sentences, paragraphs, characters, lines, and
    regular expressions.
License: MIT + file LICENSE
LazyData: yes
Authors@R: c(person("Lincoln", "Mullen", role = c("aut", "cre"),
        email = "lincoln@lincolnmullen.com"),
        person("Dmitriy", "Selivanov", role = c("ctb"),
        email = "selivanov.dmitriy@gmail.com"))
URL: https://github.com/lmullen/tokenizers
BugReports: https://github.com/lmullen/tokenizers/issues
RoxygenNote: 5.0.1
Depends:
  R (>= 3.1.3)
Imports:
  stringi (>= 1.0.1),
  Rcpp (>= 0.12.3),
  SnowballC (>= 0.5.1)
LinkingTo: Rcpp
Suggests: testthat

1. URL for the package (the development repository, not a stylized html page)

https://github.com/lmullen/tokenizers

1. What data source(s) does it work with (if applicable)?

Natural language text

1. Who is the target audience?

Users of R packages for NLP

1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

Virtually every R package for NLP implements a couple of tokenizers. The point of this package is to collect all the tokenizers that one could conceivably want to use in a single package, and make sure that all the packages have a consistent interface. The package also aims to have fast and correct tokenizers implemented on top of stringi and Rcpp.

1. Check the box next to each policy below, confirming that you agree. These are mandatory.
- This package does not violate the Terms of Service of any service it interacts with.
- The repository has continuous integration with Travis CI and/or another service
- The package contains a vignette
- The package contains a reasonably complete README with devtools install instructions
- The package contains unit tests
- The package only exports functions to the NAMESPACE that are intended for end users
1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
- Are there any package dependencies not on CRAN?
- Do you intend for this package to go on CRAN?
- Does the package have a CRAN accepted license?
- Did devtools::check() produce any errors or warnings? If so paste them below.
1. Please add explanations below for any exceptions to the above:

The package does not contain a vignette, but it does contain an extensive README. Since all the tokenizers work in the same basic way, a vignette seems unnecessary. But if rOpenSci wants one, I can easily adapt the README into a standalone vignette.

This package is already on CRAN.

1. If this is a resubmission following rejection, please explain the change in circumstances.

sckott commented 8 years ago

thanks!

Answer 1 · 2016-04-04T16:46:03.000Z

Thanks for the submission! Looking for reviewer

Answer 2 · 2016-04-04T16:57:40.000Z

Reviewers: @kevinushey
Due: 2016-04-25

Answer 3 · 2016-04-18T04:41:34.000Z

General Comments

Overall, everything is neat, tidy, and well put together. Nice! Most of my comments are stylistic, especially re: the C++ code included within. The near-100% test coverage is great to see indeed!

I can't comment as well as to the utility / scope of the package since I'm not as familiar with NLP, but overall the package feels quite well focused.

Tests

I ran R CMD check on a locally built R-devel (r70486) with El Capitan + Apple Clang 7.3.0. I can also confirm that tests run fine with lldb attached (ie, I saw no memory errors or anything like that reported from running the unit tests).

Comments

Please add a vignette, for potential users like me who are less familiar with NLP in general. :)

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L2: Generally, using namespace std; is considered somewhat bad practice; mainly you just want to double-check that you don't get burned with name collisions with things in std vs other symbols, and prefer just using std:: as necessary.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L21: I don't think RCPP_UNORDERED_SET and RCPP_UNORDERED_MAP should really be considered part of the public Rcpp API; since you're already opting into C++11, you can likely just use std::unordered_map and std::unordered_set.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L24: This is mostly stylistic, but if you want to pass an argument by reference, and mutate it within that function, you should consider passing it using a raw pointer. This might seem strange, but the idea is that at the call site it's much more obvious that a function parameter might get mutated. (This is a habit I picked up from https://google.github.io/styleguide/cppguide.html#Reference_Arguments)
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L53: Missing braces for else? (Or, prefer being consistent when using braces within each block within an if-else block)
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L9: If you want to support long vectors in R, you likely want to return and use R_xlen_t for indexing.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L12: Is it possible for both n and i to be large? (Ie, could the result overflow an int?) If so, you might want to use a long instead, or check for overflow explicitly. (similar: https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L21)
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/utils.R#L3: Prefer using scalar && in if conditions -- it expresses the intent slightly more clearly. (Similarly for | vs. || at e.g. https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/ngram-tokenizers.R#L66)
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/tests/testthat/data-for-tests.R: Consider just naming this e.g. helper-data.R; testthat will then automatically source that file for you before running each test unit.
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L21: Missing . at the end of description? (also https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L22).
https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/DESCRIPTION#L3: Consider a slightly more evocative title?

Answer 4 · 2016-04-20T13:49:51.000Z

@kevinushey: Thanks very much for this detailed code review. As you doubtless surmised, I'm at the very beginning of learning C++ and Rcpp, so I appreciate the clear guidance about how to improve that code. I didn't know that testthat automatically sourced helper-data.R: very useful!

I agree with all the changes listed here, and I'll make these changes soon. (Unfortunately, probably not till the first week of May.) Then I'll ping this issue again.

Answer 5 · 2016-04-21T08:43:23.000Z

@kevinushey thank you for review. Probably I can say that I'm an author of shingle_ngrams.cpp. I'm not a professional c++ developer so find your comments related to c++ part very useful. Especially comments related to exporting std namespace and passing mutable objects by raw pointer.

One thing related to RCPP_UNORDERED_MAP and RCPP_UNORDERED_SET. This came from my experience with text2vec package. I noticed that std::unordered_set can't be compiled on windows - I checked it on win builder after several bug reports. This was with old gcc 4.6.3 toolchain, pretty sure it will work with new 4.9 toolchain. But interest thing is that there were no problems with std::unordered_map!
After googling a little bit I found post on Rcpp mailing list which suggests to use RCPP_UNORDERED_SET instead of std::unordered_set.

Answer 6 · 2016-08-10T17:47:04.000Z

@lmullen Anything we can help with?

Answer 7 · 2016-08-11T14:16:36.000Z

Thanks for the ping, @sckott. I just need to get around to fixing @kevinushey's very straightforward list of changes (thanks again!). I'm a bit out of my depth with some of the C++. But let me take a stab at most of them, and I'll check in with @dselivanov on the C++ parts that he wrote.

Answer 8 · 2016-08-11T19:25:28.000Z

thanks, sounds good

Answer 9 · 2016-08-11T23:03:12.000Z

I've made fixes for most of the issues that @kevinushey pointed out. Here is a line by line accounting.

Please add a vignette, for potential users like me who are less familiar with NLP in general. :)

Added a vignette: ropensci/tokenizers@4a2ce76

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L2: Generally, using namespace std; is considered somewhat bad practice; mainly you just want to double-check that you don't get burned with name collisions with things in std vs other symbols, and prefer just using std:: as necessary.

Fixed: ropensci/tokenizers@023f801

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L53: Missing braces for else? (Or, prefer being consistent when using braces within each block within an if-else block)

Fixed: ropensci/tokenizers@b16b146

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L12: Is it possible for both n and i to be large? (Ie, could the result overflow an int?) If so, you might want to use a long instead, or check for overflow explicitly. (similar: https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L21)

I don't think we would ever overflow an int. In skip grams n and k (which becomes i) are usually in the single digits.

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/utils.R#L3: Prefer using scalar && in if conditions -- it expresses the intent slightly more clearly. (Similarly for | vs. || at e.g. https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/ngram-tokenizers.R#L66)

Fixed: ropensci/tokenizers@b7be1d4

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/tests/testthat/data-for-tests.R: Consider just naming this e.g. helper-data.R; testthat will then automatically source that file for you before running each test unit.

I didn't know testthat automatically sourced helper-data.R. Very useful! Fixed: ropensci/tokenizers@fa4b89e

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L21: Missing . at the end of description? (also https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/R/basic-tokenizers.R#L22).

Fixed: ropensci/tokenizers@efa0605

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/DESCRIPTION#L3: Consider a slightly more evocative title?

More descriptive title: ropensci/tokenizers@4c49a62

This issue @dselivanov already responded to.

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L21: I don't think RCPP_UNORDERED_SET and RCPP_UNORDERED_MAP should really be considered part of the public Rcpp API; since you're already opting into C++11, you can likely just use std::unordered_map and std::unordered_set.

That leaves these two issues. Since they deal with code that @dselivanov wrote, can you take a look at them please, Dmitriy?

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/shingle_ngrams.cpp#L24: This is mostly stylistic, but if you want to pass an argument by reference, and mutate it within that function, you should consider passing it using a raw pointer. This might seem strange, but the idea is that at the call site it's much more obvious that a function parameter might get mutated. (This is a habit I picked up from https://google.github.io/styleguide/cppguide.html#Reference_Arguments)

https://github.com/lmullen/tokenizers/blob/d240cddbb1d91146b3a30b90f6fc25abd6919edf/src/skip_ngrams.cpp#L9: If you want to support long vectors in R, you likely want to return and use R_xlen_t for indexing.

Assuming that the fixes above and Dmitriy's response to the unordered sets and maps questions is adequate, that leaves one stylistic question and one substantive question about long vectors.

Is the package ready, or do you want to wait till Dmitry looks at those two? I'm happy either way.

Answer 10 · 2016-08-12T16:09:27.000Z

Thanks @lmullen - For those 2 outstanding issues, can you open issues for those in your repo, and we can consider these approved. I think you know what to do from here :)

Answer 11 · 2016-08-12T16:24:41.000Z

@sckott: I've added them as issues.

Answer 12 · 2016-08-12T16:34:30.000Z

I transfered ownership to rOpenSci. Hope that's okay.

Answer 13 · 2016-08-12T16:36:50.000Z

@lmullen I will have a look on Sunday-Monday.

Answer 14 · 2016-08-12T17:02:57.000Z

sounds good, thanks, closing now, thanks again for submitting!

would you be interested in doing a blog post on our blog about this?

Answer 15 · 2016-08-12T18:02:27.000Z

Sure, sounds like a good idea. I will likely be pretty short. How about I send it to you by the end of next week, to give me time to get a new version on CRAN?

Answer 16 · 2016-08-12T18:03:57.000Z

Sounds good. And feel free to say no

Answer 17 · 2016-08-12T18:05:32.000Z

It would be nice if more packages relied on tokenizers, so a little
publicity might help.

On Fri, Aug 12, 2016 at 1:02 PM, Scott Chamberlain <notifications@github.com

wrote:

sounds good, thanks, closing now, thanks again for submitting!

would you be interested in doing a blog post on our blog about this?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AALNeIGY8hBmllGzCXxfMw1Ro4-wRLmvks5qfKdCgaJpZM4H-wBl
.

Lincoln Mullen
Assistant Professor, Department of History & Art History
George Mason University

Answer 18 · 2016-08-12T18:11:50.000Z

true, would get more eyes on it, and hopefully uses in pkgs