harvard-lil/capstone

Scope Dataverse dump plan

kilbergr opened this issue · 1 comments

We know we will be putting the CAP dataset into Dataverse when our contract expires in 2024. We're not sure whether that will include all file formats or not, but we also don't know exactly what it would entail (Kelly Fitzpatrick did the one that's in there). Let's find out!
Here's some background information about Dataverse that we learned last year.

The plan:
Reach out to Sonia Barbosa and Ceilyn Boyd to start the process. Sign up: https://dataverse.harvard.edu/dataverseuser.xhtml?editMode=CREATE&redirectPage=%2Fdataverse.xhtml follow these instructions: https://support.dataverse.harvard.edu/getting-started
I know that Sonia said certain settings we'd have to reach out about. I will likely try to get through it all with a test dataset--is there anything I can't use?
Info to find out: if we want to split the data multiple ways, what's the best way to do that? How do we transfer bigger data (e.g. should we want to include PDFs)? What metadata do we need to develop?

Ideally, what results is a design doc/step by step of how to do this migration.

Information so far:

Dataverse can accept datasets up to 2.5G but Sonia Barbosa can increase that limit on request.
Datasets can be added either individually or using this bulk upload tool that is not maintained by Harvard, but Dataverse recommends.
To speed upload, we can double zip our files. This is faster because then the individual files contained in the zip are not examined.
We can start uploading datasets and leave them in “unpublished” or “drafted” state if we can’t make them publicly available. They'll only be visible to us as the data owners while they're in that state unless we generate a private URL from the Edit menu for each dataset.
Once you publish the datasets, they'll be visible to the public, the DOI number will become active, and their private URLs will stop working.
Regarding replacement with unredacted/decrypted versions: if we make future edits to published datasets, new drafts will be created, and publishing those will replace the previous versions with those drafts.
This is all to say I see no reason why we can't start uploading in draft state once we determine our desired data upload format.