harvard-lil/capstone

Dataverse upload practice

kilbergr opened this issue · 6 comments

Given what we’ve learned here: #2134, we're going to practice getting data into Dataverse!
It's likely that the structure of what we'll put up there will depend in part on the currently developing folder structure, but for now we'll satisfy ourselves with knowing more about the upload process.

This ticket entails the following work:

  • Create a demo account on demo dataverse
  • Manually upload dummy data to demo
  • Use dataverse uploader to upload a few dummy data files.
  • Find out whether it's possible to do directly from S3?

Created a state demo template and set as default. Make sure to add custom terms in separate panel if we maintain those. Uploaded IL Dataset zip file

Created DRAFT version of single state data. Here is a link to Demo CAP Dataverse: Demo 1 dataset:
https://demo.dataverse.org/privateurl.xhtml?token=c00f70f8-7ced-4e1a-8989-b66e6fa7c5b9
I'll delete at the end of this experiment.

Adding content to: doi:10.70122/FK2/P9CA9Z
Using server: https://demo.dataverse.org
Request to upload: ./states
List Only Mode

***Starting to Process Upload Requests:***


PROCESSING(D): ./states
              Found as: doi:10.70122/FK2/P9CA9Z

PROCESSING(F): ./states/dataverse_files_dm2.zip
               Does not yet exist on server.

PROCESSING(F): ./states/dataverse_files_dm3.zip
               Does not yet exist on server.

PROCESSING(F): ./states/dataverse_files_dm4.zip
               Does not yet exist on server.

***Execution Complete.***

File format:
cap_files/
-> /states/
 -> /dataverse_files_dm2.zip
 -> /dataverse_files_dm3.zip
 -> /dataverse_files_dm4.zip

I ran this command from the cap_files dir: java -jar DVUploader-v1.1.0.jar -key=XXXXXXXX -did=doi:10.70122/FK2/P9CA9Z -server=https://demo.dataverse.org ./states

Adding content to: doi:10.70122/FK2/P9CA9Z
Using server: https://demo.dataverse.org
Request to upload: ./states

***Starting to Process Upload Requests:***


PROCESSING(D): ./states
              Found as: doi:10.70122/FK2/P9CA9Z

PROCESSING(F): ./states/dataverse_files_dm2.zip
               Does not yet exist on server.
               UPLOADED as: MD5:781a1406d58f062ddf9cf9bbb96b47ea
CURRENT TOTAL: 1 files :973834935 bytes

PROCESSING(F): ./states/dataverse_files_dm3.zip
               Does not yet exist on server.
               UPLOADED as: MD5:781a1406d58f062ddf9cf9bbb96b47ea
CURRENT TOTAL: 2 files :1947669870 bytes

PROCESSING(F): ./states/dataverse_files_dm4.zip
ZIP(1L)

Each of the zipped dirs contained 2 files. The result of the command was the dirs were unzipped and the files inside were uploaded to dataverse.

I had to create the dataset they were uploaded to prior to upload.

I zipped the states/ dir into a new folder. Now the structure is:
cap_files
  -> /dbl_zip
   -> /states.zip
    -> /dataverse_files_dm2.zip
    -> /dataverse_files_dm3.zip

(I had to remove a file bc it was over the 2.5G limit)

I ran this command from the cap_files dir (note addition of -recurse to account for subfolders): java -jar DVUploader-v1.1.0.jar -key=XXXXXXXX -did=doi:10.70122/FK2/P9CA9Z -server=https://demo.dataverse.org -recurse ./dbl_zip

Happy to report this did create 2 zipped dirs (dataverse_files_dm2.zip and dataverse_files_dm3.zip) rather than 4 unzipped files.

After some experimentation with S3, we decided to wrap the experiment and continue should we decide that the format of files in S3 mirrors what we'll ultimately upload to our new data homes.