/refextract

Extract bibliographic references from (High-Energy Physics) articles.

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

refextract

Small library for extracting references used in scholarly communication.

Originally exported from Invenio https://github.com/inveniosoftware/invenio.

Dependencies

Installation

pip install refextract

Usage

To get structured info from a publication reference:

from refextract import extract_journal_reference
reference = extract_journal_reference("J.Phys.,A39,13445")
print(reference)
{
    'extra_ibids': [],
    'is_ibid': False,
    'misc_txt': u'',
    'page': u'13445',
    'title': u'J. Phys.',
    'type': 'JOURNAL',
    'volume': u'A39',
    'year': ''
 }

To extract references from a publication full-text PDF:

from refextract import extract_references_from_file
reference = extract_references_from_file("some/fulltext/1503.07589v1.pdf")
print(reference)
[
        {'author': [u'F. Englert and R. Brout'],
         'doi': [u'10.1103/PhysRevLett.13.321'],
         'journal_page': [u'321'],
         'journal_reference': ['Phys.Rev.Lett.,13,1964'],
         'journal_title': [u'Phys.Rev.Lett.'],
         'journal_volume': [u'13'],
         'journal_year': [u'1964'],
         'linemarker': [u'1'],
         'title': [u'Broken symmetry and the mass of gauge vector mesons'],
         'year': [u'1964']}, ...
]

You can also extract directly from a URL:

from refextract import extract_references_from_url
reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
print(reference)
[
         {'author': [u'F. Englert and R. Brout'],
          'doi': [u'10.1103/PhysRevLett.13.321'],
          'journal_page': [u'321'],
          'journal_reference': ['Phys.Rev.Lett.,13,1964'],
          'journal_title': [u'Phys.Rev.Lett.'],
          'journal_volume': [u'13'],
          'journal_year': [u'1964'],
          'linemarker': [u'1'],
          'title': [u'Broken symmetry and the mass of gauge vector mesons'],
          'year': [u'1964']}, ...
]