NEONScience/NEON-utilities

stackEddy fails when folder contains .gz files

rfiorella opened this issue · 14 comments

Function
stackEddy

Describe the bug
NEON has started gzipping the EC files available through the API - if a user downloads a zip file from the API using zipsByProduct, for example, stackEddy will crash if the unzipped archive contains .h5.gz files instead of .h5 files.

To Reproduce
zipsByProduct("DP4.00200.001", site = "ONAQ",
startdate = "2019-01-01",
enddate = "2019-12-31",
savepath = "~/Desktop/NEON/",
check.size = FALSE)

lhflux <- stackEddy("~/Desktop/NEON/filesToStack00200", avg = 30)

Expected behavior
stackEddy to look through the files and bind all of the dp04 data to a list.

System (please complete the following information):

  • OS: OSX 10.15.7
  • OS Version: 10.15.7
  • R Version: 4.0.2

Additional context
I've already developed a candidate fix that's available on my fork of NEON-utilities, under the gzip-stackeddy-fix - feel free to open a PR. I tried to open one, but since the feature branch doesn't exist in the NEONScience repo, I only open a PR to the master branch, which I would guess would not be the preferred approach initially.

The fix I've uploaded will likely require moving R.utils from suggests to imports in DESCRIPTION, and corresponding updates to NAMESPACE

@rfiorella Thanks for the alert about this! It should be an easy fix, thanks for sharing the code on your fork. There are some other pending changes to stackEddy() in the stack-from-store branch (which isn't fully functional yet), so I just need to decide the best way to integrate this fix. I'll let you know.

No problem @cklunch! There's an additional issue that also comes up that's a little more difficult to fix from the user side: current zip files will often contain multiple versions of EC data for a given site and month (e.g., HARV in 2019 has 3 versions of the monthly basic file with different processing dates wrapped up in the zip file).

I think it would be pretty straightforward to look for 'duplicates' here, (optionally?) have stackEddy gzip the old ones, and then stack the most recent version of each site month. I'm happy to take a crack at this but it might be a little bit before I can get to it.

@rfiorella Yeah, the duplicate files are a known issue, the cyberinfrastructure team is working on fixing it. So in theory, it will be fixed and become a non-issue soon, but you make a good point that it could also be handled pretty easily in the stacking code. I'll take a look at doing that along with the other code changes.

I found the similar error while using "stackEddy()" function.
Uploading image.png…

@rfiorella @libonancaesar
Fix is in on the stack-from-store branch. Other functions aren't ready to go on that branch, but the stackEddy() function there is handling the .gz files and the duplicate files. I'm hoping to be able to merge the branch in later this week.

cc @cflorian1

@cklunch Great, thanks! I'll checkout the stack-from-store branch and test it out.

@rfiorella There was a bug in the duplicate file handling, now fixed.

@cklunch can confirm that stackEddy now runs in the cases where it did not previously, and only stacks one file per month. however, it seems to be picking the oldest file instead of the newest where there are duplicates - based on your comments yesterday though, not sure this is worth fixing if CI is already working on a fix for the duplicates.

EDIT, 10/30/20: My error! Not entirely sure how, but the new code hadn't been pulled into my clone of NEON-utilities. Re fetched, merged, and rebuilt - and it works just fine now.

@rfiorella @cklunch

I'll know what the CI plan is for fixing the duplicate files after a 1pm meeting today, I can report back in a couple hours.

@rfiorella @cklunch

Sounds like this is a bug, CI had a working solution when it was tested at the start of reprocessing. I hope it will be an easy fix but don't have any information about timeline yet.

@rfiorella @cflorian1 These changes are now merged to master. I have 3 more changes to make to the package before I send to CRAN, but they're fairly simple, hoping to knock those out in the next day or two.

@cklunch @rfiorella Great, thanks! FYI this should be resolved on the data portal side pretty soon.

Turns out the h5.gz extension broke the script that removes old versions of a file from the portal because it changed the position of the timestamp in the file name.

@rfiorella @cflorian1 v1.3.8 is at CRAN now.