Convert Eprints XML to DataCite XML and mint DOIs. Only tested on Caltech repositories.
- caltech_thesis - Generate DataCite metadata and DOIs from CaltechTHESIS
- caltech_authors_tech_report - Generate DataCite metadata and DOIs from CaltechAUTHORS tech reports
- caltech_authors_to_data - Make DataCite metadata for data files in CaltechAUTHORS
You need to have Python 3.7 on your machine
(Miniconda is a great
installation option). Test whether you have python installed by opening a terminal or
anaconda prompt window and typing python -V
, which should print version 3.7
or greater. It's best to download this software using git. To install git, type
conda install git
in your terminal or anaconda prompt window.
Find where you want the epxml_to_datacite folder to live on your computer in File Explorer or Finder
(This could be the Desktop or Documents folder, for example). Type cd
in anaconda prompt or terminal and drag the location from the file browser into
the terminal window. The path to the location
will show up, so your terminal will show a command like
cd /Users/tmorrell/Desktop
. Hit enter. Then type
git clone https://github.com/caltechlibrary/epxml_to_datacite.git
. Once you
hit enter you'll see an epxml_to_datacite folder. Type cd epxml_to_datacite
Now that you're in the epxml_to_datacite folder, type python setup.py install
to install dependencies.
If you're on a Mac, you'll need to authorize the underlying eputil application.
Open the epxml_to_datacite
directory in finder, open the epxml_support
directory, and right click on eputil
and select 'Open'. Agree that you
authorize the executible. This is a one-time installation step.
If you will be minting DOIs, you need to create a file called pw
using a text
editor that contains your DataCite password. The username is hardcoded in the
script, since non-Caltech users will have to modify the script to work with
their Eprints installation. If you don't have a text editor on your machine, type
conda install -c swc nano
When there is a new version of the software, go to the epxml_to_datacite
folder in anaconda prompt or terminal and type git pull
. You shouldn't need to re-do
the installation steps unless there are major updates.
There are three different scripts
caltech_thesis.py
caltech_authors_to_data.py
(Prepares metadata from CaltechAUTHORS for submission to CaltechDATA)caltech_authors_tech_report.py
(Prepares metadata from CaltechAUTHORS tech reports withmonograph
item type (Report or Paper))
In this documentation we use caltech_thesis.py
as the example script, but in most cases you can substitute one of the other sources.
If you have Eprints XML files (from thesis.library.caltech.edu/rest/eprint/1234.xml, for example), put them in the epxml_to_datacite folder. Type
python caltech_thesis.py
And you'll get '_datacite.xml' for each xml file in the folder
You can use Eprints ids (e.g. 9690) to download Eprints xml files by adding a
-ids
option to any command.
python caltech_thesis.py -ids 9690
Alternativly, you can provide a tsv file, where the first column is the Eprints
id using the -id_file
option
python caltech_thesis.py -id_file ids.tsv
You can also have the script submit the metadata to DataCite and add the DOI to the source repository. Add the -mint
option and if you want to make test DOIs add the -test
option to the command line.
python caltech_thesis.py -mint -ids 9690
caltech_authors_tech_report.py
has support for alternative DOI prefixes. By
adding the -prefix option you can mint a DOI for any of the DataCite prefixes
controlled by the library.
python caltech_authors_tech_report.py -prefix 10.26206 -ids 99015
Custom prefixes can also trigger metadata changes. For example, the publisher for prefix 10.26206 is the Keck Institute for Space Studies"
You can also import the metadata transformation function into another python
script by including from caltech_thesis import epxml_to_datacite
at the top of your new script.
Then you will be able to call epxml_to_datacite(eprint)
, where eprint is an
xml file parsed by something like:
infile = open('10271.xml',encoding="utf8")
eprint = xmltodict.parse(infile.read())['eprints']['eprint']
datacite = epxml_to_datacite(eprint)