gutenbergr -- download and processing public domain works from Project Gutenberg

Question

gutenbergr -- download and processing public domain works from Project Gutenberg

dgrtwo opened this issue 9 years ago · 13 comments

1. gutenbergr offers tools to download and process public domain works in the Project Gutenberg collection. Includes metadata for all Project Gutenberg works, so that they can be searched and retrieved.
1. Paste the full DESCRIPTION file inside a code block below.

Package: gutenbergr
Type: Package
Title: Download and Process Public Domain Works from Project Gutenberg
Version: 0.1
Date: 2016-05-02
Authors@R: person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre"))
Description: Download and process public domain works in the Project
    Gutenberg collection. Includes metadata for all Project Gutenberg works,
    so that they can be searched and retrieved.
License: MIT + file LICENSE
LazyData: TRUE
Maintainer: David Robinson <admiral.david@gmail.com>
VignetteBuilder: knitr
Depends: R (>= 2.10)
Imports:
    dplyr,
    readr,
    purrr,
    urltools,
    rvest,
    xml2,
    stringr
RoxygenNote: 5.0.1
Suggests: knitr,
    rmarkdown,
    testthat,
    tidytext,
    ggplot2,
    tidyr

1. URL for the package (the development repository, not a stylized html page)

https://github.com/dgrtwo/gutenbergr

1. What data source(s) does it work with (if applicable)?

Project Gutenberg

1. Who is the target audience?

People interested in text mining and analysis. Especially interesting to those analyzing historical and literary works, but this can also serve as example datasets for text mining problems.

1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

No.

1. Check the box next to each policy below, confirming that you agree. These are mandatory.
- [x ] This package does not violate the Terms of Service of any service it interacts with.
- [x ] The repository has continuous integration with Travis CI and/or another service
- [x ] The package contains a vignette
- [x ] The package contains a reasonably complete README with devtools install instructions
- [x ] The package contains unit tests
- [x ] The package only exports functions to the NAMESPACE that are intended for end users
1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
- Are there any package dependencies not on CRAN?
- [x ] Do you intend for this package to go on CRAN?
- [x ] Does the package have a CRAN accepted license?
- Did devtools::check() produce any errors or warnings? If so paste them below.
1. Please add explanations below for any exceptions to the above:
1. If this is a resubmission following rejection, please explain the change in circumstances.

Answer 1 · 2016-05-02T17:04:56.000Z

Thanks for your submission @dgrtwo - seeking reviewers now

Answer 2 · 2016-05-04T06:09:19.000Z

Reviewers: @masalmon
Due date: 2016-05-24

Answer 3 · 2016-05-04T14:45:24.000Z

General comments

This package offers access to a playground of data for text mining which is awesome, and the output are data.frames that one can use with dplyr, tidytext, etc. which makes the package even cooler. Nice job! 👍 😸

The package works as expected on my PC with Windows and provides tools for filtering works and then downloading them. The examples and vignette help getting a good overview of the functionalities. The package is already very good so I feel very picky writing comments!

R CHECK

When running R Check I get a note, "found 13591 marked UTF-8 strings" but it is written in the cran-comments.md.

spell check

Using devtools::spell_check() I found no spelling issues.

Continuous integration

Why not add AppVeyor CI too?

Unit tests

I would recommend using codecov.io use via covr, it's quite nice to be able to see which parts of the code are tested. I ran covr::package_coverage() and the results were

gutenbergr Coverage: 89.47%
R/gutenberg_download.R: 86.05%
R/gutenberg_works.R: 100.00%
R/utils.R: 100.00%

so why not add a badge to show this?

Maybe you could add tests for the data.frames, testing whether they are data.frames, the class of columns, etc.

data-raw folder, data preparation

In the readme of this folder you mention "This set of scripts produces the three .rda files in the data directory. (...) There are no guarantees about how often the metadata will be updated in the package."

Do you know how often the project Gutenberg itself updates its metadata?
Would there be a way to automatically run the scripts every, say, week?
And could you state more explicitely in the description and vignette that the available metadata in the package is "all metadata up to the "? (I see you do so in the readme) You might also want to write a bit about it near the installation instructions of the readme (instead of the bottom of the readme), because this makes the Github version more attractive if it has the most recent metadata.

Functionality

Why export read_zip_url?
Maybe the gutenberg_download function could have a verbose argument, so that when it's true you print the selected mirror? You have a message saying the mirror is being determined so a message saying which mirror has been selected could be nice.
I am not sure whether it's intuitive to write gutenberg_works(author == "Austen, Jane") instead of gutenberg_works(author = "Austen, Jane"). Moreover gutenberg_works(grepl("Austen", author)) works well. Maybe the user could input anything and a function would do the regexp work for them? E.g. the user would write either gutenberg_works(author = "Austen, Jane") or gutenberg_works(author = "Austen"). I have the same questions for other arguments of gutenberg_works. They really are questions, I have no opinion.
For language the input format is "Language code, separated by / if multiple". It might be user-friendlier to have language as a character vector and then do the reformatting with paste inside the function? Moreover the current format gives different result when looking for languages = "fr/en" or languages = "en/fr". Having c("fr", "en") would be easier. Having "language1 OR language2" might involve two calls to your function, but I guess it's ok: you provide basic filters with the function and for the rest the users can use dplyr themselves as you mention in the examples. Maybe in the vignette you could state more clearly what kind of queries can be done with the shortcuts and what kind of queries cannot?
In gutenberg_works.R and gutenberg_download.R you write "utils::data("gutenberg_metadata", package = "gutenbergr", envir = environment())". Couldn't you have globalVariables("gutenberg_metadata") in e.g. utils.R or a globalVariables.R file instead? Then I think you could use gutenberg_metadata without further effort.
Why do you use filter instead of filter_? Please note I have no idea whether it would make a difference here.

Documentation

When describing arguments or the columns of the output data.frame, always/never put a period after the definition (this was something Julia Silge asked me to do in the opencage review). Also you could mention the class of the argument, e.g. logical, although one sees it in the examples.
In gutenberg_works.Rd there is a missing period in "Get a table of Gutenberg work metadata that has been filtered by some common (settable) defaults, along with the option to add additional filters"
Could you add an example for gutenberg_strip in the Rd? Or should it actually be exported?
What do you mean by "For space only metadata from people that have been the single author of a work (not multiple authors, contributors, etc) are included."?
When mentioning tbl_df maybe you should mention this is a data.frame (dplyr tbl_df)?
If the language code is an ISO code, could you mention which ISO nomenclature it follows and put a link to the corresponding Wikipedia page?
In the Rd of gutenberg_metadata you write "note that gutenberg_works() function is a shortcut to some of the above": what part of the example could not be done directly with gutenberg_works?
In gutenberg_subjects you mention "Library of Congress Classifications (lcc) and Library of Congress Subject Headings (lcsh)". Could you provide links to these classifications in the Rd? I see you have them in the vignette.
In "gutenberg_downloads.Rd" you write "load dplyr first to display tbl_dfs appropriately" but one needs dplyr for using %>% too.

Vignette

Should the title be title case?
I think it'd be nicer to have message and warning chunk options = FALSE.
At the beginning of the readme you've got an useful "includes" part. You could start the vignette with the same small list. Then I'd find it more logical to start with functions for searching metadata and then to explain how to download texts, because this would probably be the workflow of users: first selecting IDs based on metadata, and then downloading.
In the vignette and in the readme you show data.frames by printing them. Why not use knitr::kable() in combination with head()? Then you would no longer have "Variables not shown".

Links to other packages

At the end of the vignette/readme you mention tidytext. You could add links to other tools:

The CRAN task view Natural Language Processing
gutenberg_authors data.frame contains a wikipedia column, maybe you could mention the possible use of the wikipediatrends package to assess an author popularity?
Regarding the columns with the name of the author, you could mention the package humanparser and the rOpenSci package gender as tools to play with the author names (and dates of birth).
Maybe you know a good link for people that would use your package and get encoding issues? This is a very vague question but I wondered whether this might be a problem depending on the locale and put some people off if they do not have experience with this.

Answer 4 · 2016-05-05T17:32:38.000Z

I sincerely thank Maëlle both for the helpful comments and the kind words about the gutenbergr package- I've learned a lot from this advice! Also grateful it was received so quickly! 👍 👍

It's worth noting I made one major change not in response to feedback, in switching the package license from MIT to GPL-2. (I realized the Project Gutenberg catalog is released under GPL so I should distribute it only under a GPL-compatible license). All this is in the NEWS.md file.

Continuous Integration

I've added AppVeyor CI

Unit tests

I've added covr. (For some reason it's also showing me high coverage when I run it locally but only 42% coverage on the badge- I've asked about it on Twitter but it shouldn't hold us up)
I've added tests for the class of the returned tbl_df and its columns in test-download.R

data-raw folder, data preparation

Do you know how often the project Gutenberg itself updates its metadata?

I don't know how often the metadata of old books is changed (I would guess quite rarely, and probably very rarely for classic books). I do know that every day there are books added to the collection, along with metadata. For example, as of this writing there were about 23 books published yesterday, May 3. While there is a feed for books added yesterday, there is not such a feed for additions over a longer period or for edits, nor does it contain the full metadata I would need for an update.

Would there be a way to automatically run the scripts every, say, week?

The problem is that the scripts take about 30 minutes to run (on my machine) and require downloading a ~60 MB file and unpacking it to ~750MB. This isn't a small ask of resources so I can't think of a system I could use to automate this.

The main advantage of regular updates would be including new e-books released each day. The reason I don't see this as a priority is that I think the vast majority of analyses will be on a small fraction of very popular books, most of which were added long ago. I think just as any dataset will be incomplete to some degree, I'm OK treating the dataset as "frozen" every few months. Of course I'm open to suggestions for easy automation!

And could you state more explicitely in the description and vignette that the available metadata in the package is "all metadata up to the "? (I see you do so in the readme) You might also want to write a bit about it near the installation instructions of the readme (instead of the bottom of the readme), because this makes the Github version more attractive if it has the most recent metadata.

Great point about making the dataset explicit: I've added an attribute to each dataset like attr(gutenberg_metadata, "date_updated") that contains the date the data was updated. This allows me to include that information in the README and vignette automatically, e.g. with text like:

As of now, these were generated from the Project Gutenberg catalog on r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y').

This fact is also noted in data-raw/README.md and in each of the metadata documentation.

I'm not sure about mentioning it near the installation instructions, however, simply because I don't know how much more often the dataset will be updated on GitHub than on CRAN. I'm sure I'll update it before each CRAN release.

Functionality

Why export read_zip_url?

read_zip_url isn't exported. It does have an .Rd file, but that's to help with package internal development

Maybe the gutenberg_download function could have a verbose argument, so that when it's true you print the selected mirror? You have a message saying the mirror is being determined so a message saying which mirror has been selected could be nice.

verbose argument has been added, which affects both the message "Determining mirror" and a new message "Using mirror"

I am not sure whether it's intuitive to write gutenberg_works(author == "Austen, Jane") instead of gutenberg_works(author = "Austen, Jane"). Moreover gutenberg_works(grepl("Austen", author)) works well. Maybe the user could input anything and a function would do the regexp work for them? E.g. the user would write either gutenberg_works(author = "Austen, Jane") or gutenberg_works(author = "Austen"). I have the same questions for other arguments of gutenberg_works. They really are questions, I have no opinion.

This was a hard UI decision for me as well. The main issue is that I noticed I was doing

gutenberg_works() %>%
  filter(author == "Austen, Jane")

on almost every call, such that I allowed filtering conditions to be included.

I considered having the behavior gutenberg_works(author = "Austen") work as a regular expression, but I really wanted to keep the ability to have any arbitrary expression. (For example, author %in% c('Dickens, Charles', 'Austen, Jane') or birthdate < 1800). At that point I was concerned I was listing too many special cases in the documentation (it does this with named arguments and that with unnamed arguments, and so on).

To help with this ambiguity, I have made two adjustments:

I added an example with str_detect(author, "Austen") to the vignette (str_detect examples already occur in most of the documentation)
I set up gutenberg_download to give the following informative error when given a named argument other than the defaults: Use == expressions, not named arguments, as extra arguments to gutenberg_works. E.g. use gutenberg_works(author == 'Dickens, Charles') not gutenberg_works(author = 'Dickens, Charles').

For language the input format is "Language code, separated by / if multiple". It might be user-friendlier to have language as a character vector and then do the reformatting with paste inside the function? Moreover the current format gives different result when looking for languages = "fr/en" or languages = "en/fr". Having c("fr", "en") would be easier. Having "language1 OR language2" might involve two calls to your function, but I guess it's ok: you provide basic filters with the function and for the rest the users can use dplyr themselves as you mention in the examples. Maybe in the vignette you could state more clearly what kind of queries can be done with the shortcuts and what kind of queries cannot?

Great point. I have made two changes:

The order of the languages is now always alphabetized within each field, so that filtering for "en/fr" returns the French + English results and "fr/en" returns nothing.
I've added two arguments: all_languages and only_languages. These allow the user much greater control of language filtering. For example, the user can request works in either English and French (or both) with:

gutenberg_works(languages = c("en", "fr"))

and request works in both English and French with:

gutenberg_works(languages = c("en", "fr"), all_languages = TRUE)

The examples and tests give the full details of the behavior. It's possible that there are corner cases (e.g. either English or French, but not both) that will require additional, minimal processing on the part of the user.

In gutenberg_works.R and gutenberg_download.R you write "utils::data("gutenberg_metadata", package = "gutenbergr", envir = environment())". Couldn't you have globalVariables("gutenberg_metadata") in e.g. utils.R or a globalVariables.R file instead? Then I think you could use gutenberg_metadata without further effort.

This is true but I'm trying to follow the "Good practice" section of ?data, and in any case I think utils::data("gutenberg_metadata", package = "gutenbergr", envir = environment()) is especially explicit about where the data comes from.

Why do you use filter instead of filter_? Please note I have no idea whether it would make a difference here.

filter_ would require putting the filter condition in the quote function, e.g. filter_(ret, quote(language %in% languages)). The only reason I use distinct_ later in the function is that it is so similar and saves me from putting those in globals.R- that may be misleading but I think it's a small decision

Documentation

When describing arguments or the columns of the output data.frame, always/never put a period after the definition (this was something Julia Silge asked me to do in the opencage review). Also you could mention the class of the argument, e.g. logical, although one sees it in the examples.

I've removed all periods at the end of these lines. I generally shy away from adding the class of all arguments, since I think it clutters documentation and that anyone would look at the class before they did anything interesting.

In gutenberg_works.Rd there is a missing period in "Get a table of Gutenberg work metadata that has been filtered by some common (settable) defaults, along with the option to add additional filters"

Fixed

Could you add an example for gutenberg_strip in the Rd? Or should it actually be exported?

I have added an example. The main reason I export it is to be especially transparent about how text-stripping is done. This also helps users if they want to do strip = FALSE, check or preprocess some of the texts, and then use gutenberg_strip on the others.

What do you mean by "For space only metadata from people that have been the single author of a work (not multiple authors, contributors, etc) are included."?

I've now clarified this: "Although the Project Gutenberg raw data also includes metadata on contributors, editors, illustrators, etc., this dataset contains only people who have been the single author of at least one work."

When mentioning tbl_df maybe you should mention this is a data.frame (dplyr tbl_df)?

Agreed, each time I mention tbl_df I have now changed it to tbl_df (see tibble or dplyr packages)

If the language code is an ISO code, could you mention which ISO nomenclature it follows and put a link to the corresponding Wikipedia page?

Done: added info and link to ISO-639.

In the Rd of gutenberg_metadata you write "note that gutenberg_works() function is a shortcut to some of the above": what part of the example could not be done directly with gutenberg_works?

I've clarified this to:

#' # note that the gutenberg_works() function filters for English
#' # non-copyrighted works and does de-duplication by default:

In gutenberg_subjects you mention "Library of Congress Classifications (lcc) and Library of Congress Subject Headings (lcsh)". Could you provide links to these classifications in the Rd? I see you have them in the vignette.

Done, added to Details in the gutenberg_subjects Rd.

In "gutenberg_downloads.Rd" you write "load dplyr first to display tbl_dfs appropriately" but one needs dplyr for using %>% too.

Comment has been removed

Vignette

Should the title be title case?

Not in the examples I've seen; e.g. none of the dplyr vignettes, or the vignettes of rOpenSci packages I've looked at.

I think it'd be nicer to have message and warning chunk options = FALSE.

Absolutely, fixed

At the beginning of the readme you've got an useful "includes" part. You could start the vignette with the same small list. Then I'd find it more logical to start with functions for searching metadata and then to explain how to download texts, because this would probably be the workflow of users: first selecting IDs based on metadata, and then downloading.

Done- added includes section and switched order, I agree that it makes sense. The exception is that I left the gutenberg_subjects and gutenberg_authors instructions after gutenberg_download, because I wanted to start with the two most universally applicable needs and I suspect most users won't need author and subject information.

In the vignette and in the readme you show data.frames by printing them. Why not use knitr::kable() in combination with head()? Then you would no longer have "Variables not shown".

I considered this, but some of the tables end up being so wide that they don't look good in HTML output anyway. Furthermore, almost every line would require knitr::kable, and if users aren't familiar with the kable function they may think it's necessary that they use it as well, or at the very least get distracted from the package functions.

Links to other packages

At the end of the vignette/readme you mention tidytext. You could add links to other tools: ...

I've added links to the CRAN view and wikitrends/WikipediR packages to the README and the vignette. One issue with humanparser is that the columns are in Lastname, Firstname order, which in my understanding it can't necessarily handle. (I didn't include gender for a similiar reason).

Maybe you know a good link for people that would use your package and get encoding issues? This is a very vague question but I wondered whether this might be a problem depending on the locale and put some people off if they do not have experience with this.

I think this is a general text analysis question that falls outside the scope of an API package like this. What encoding issues they'd run into would depend on what they were doing :-)

Answer 5 · 2016-05-05T17:53:43.000Z

Well I've learnt a lot too! 👌

This all looks good to me. 👍 I have been watching your repo and find all changes well done. The warning for = instead of == seems useful! Thanks for all your answers, I now agree with everything. ☺

As regards names, it's clearly not important but I saw today that this parser can reverse names https://github.com/Ironholds/humaniformat/blob/master/vignettes/Introduction.Rmd

For covr I saw that Jim will have a look at it so I'm confident that your badge will soon be green!

Answer 6 · 2016-05-12T16:16:38.000Z

Having a quick look over this now...

Answer 7 · 2016-05-12T16:45:59.000Z

@dgrtwo Looks great. Just a minor thing:

gutenberg_get_mirror() throws a warning due to xml2, at this line https://github.com/dgrtwo/gutenbergr/blob/master/R/gutenberg_download.R#L213

Warning message:
In node_find_one(x$node, x$doc, xpath = xpath, nsMap = ns) :
  101 matches for .//a: using first

Wonder if it's worth a suppressWarnings() there?

Answer 8 · 2016-05-12T20:08:27.000Z

@sckott Done!

I'm ready for any next steps.

Answer 9 · 2016-05-12T20:42:53.000Z

Great!

Add the footer to your README:

[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)

Could you add URL and BugReports entries to DESCRIPTION, so people know where to get sources and report bugs/issues
Update installation of dev versions to ropenscilabs/gutenbergr and any urls for the github repo to ropenscilabs instead of dgrtwo
Go to the Repo Settings --> Transfer Ownership and transfer to ropenscilabs - Note that all our newer pkgs go to ropenscilabs first, then when more mature we'll move to ropensci

Answer 10 · 2016-05-12T23:10:22.000Z

Done!

(Just needs Travis + AppVeyor activated for its new repo I suppose)

Once it's settled I'm planning to submit 0.1.1 to CRAN (it's currently on CRAN as 0.1) including the new URLs.

Answer 11 · 2016-05-13T01:22:22.000Z

Nice, builds on at travis https://travis-ci.org/ropenscilabs/gutenbergr/ - You can keep appveyor builds under your acct, or I can start on mine, let me know

Answer 12 · 2016-05-13T14:49:19.000Z

@sckott Happy to transfer to you/ropensci! That's where the badge is currently pointing

Answer 13 · 2016-05-13T16:06:31.000Z

updated badge link, started an appveyor account with ropenscilabs as account name - sent PR - though the build is failing, something about getting the current gutenberg url https://ci.appveyor.com/project/sckott/gutenbergr/build/1.0.1#L650