GL-Li/totalcensus

Verifying completeness after download_census()

sheffe opened this issue · 2 comments

I've just noticed that download_census() can fail and require restart, generally after a 503 service unavailable response. I can restart the download, which restarts from the beginning of a query, but that's inefficient at the scale totalcensus enables. (Half a terabyte and counting!)

Is there any way to verify the completeness of the downloaded records? I can't find anything directly in the package, and I'm not even sure that the file structure of the Census enables a programmatic check.

(A related issue: would you accept a PR adding a Sys.sleep() of user-elected time between calls to the downloads? As it stands, it's possible to hit the downloads quite hard at a time when interest in the 2017 release is likely to cause high baseline traffic.)

GL-Li commented

You are welcome to submit PRs to improve the package.

I am thinking of turn download_census() into an internal function, as the downloading can be done automatically in read_xxxx() functions. For example, if you want to download all 2017 ACS 5 year data (just added to the new version), simply run read_acs5year(2017, c(states_DC, "US", "PR"). You will be asked to download data of states that are not in your computer. You can resume downloading those not downloaded if the internet is down with read_acs5year(2017, c(states_DC, "US", "PR"). The download_census() function in old version can do this kind of check but I am not sure how useful it is.

That's an interesting design change -- I think it could make the package easier to use for newcomers, with one hitch. I'm likely to be a weird user -- I use totalcensus for pulling large batches of data (often takes many hours) and build it into data pipelines for separate projects. Waiting until it finds missing data for a download request makes it harder to use outside interactive sessions, which is why I was thinking about a pre-verification that all required files are present. Perhaps it would be possible to specify an argument for "Download any file I ask for automatically" (defaulting to FALSE) when converting download_census() to internal?