/Neutron_PDF_to_CSV

Conversion of PDF with neutron scattering length and cross section data to CSV format.

Primary LanguagePython

Neutron_PDF_to_CSV

Purpose: convert PDF of important data about neutron scattering lengths and cross sections to CSV format.

About

Comma-separated values (CSV) files are a text file that usually uses a comma to separate each unique value [1]. They are often used as data-storage and tabulation files.

The PDF in question is a 10-page, 110kb file available here, provided by the Vienna University of Technology (click for English [3]) [2]. The webpage on which this file is available was last updated 02/14/2001 [2].

The majority of the column headers are described in the first paragraph of [2] and in Table 1 in [5], but to reiterate:

ZSymbA: nuclide charge number Z, element symbol Symb, mass number A
P or T_{1/2}: natural abundance OR "percent"/half-life (MIN: minutes, Y: years)
I: nuclear spin
b_{c}: bound-coherent scattering lengths, (fm, femptometers, 1e-15)
b_{+}: spin-dependent scattering lengths for I + 1/2 (fm, femptometers, 1e-15)
b_{-}: spin-dependent scattering lengths for I - 1/2 (fm, femptometers, 1e-15)
c: ?? (if you know this, please contact me.)
sigma_{coh}: coherent cross-section (barns, 1e-24 cm^-2)
sigma_{inc}: incoherent cross-section (barns, 1e-24 cm^-2)
sigma_{scatt}: scattering cross-section (barns, 1e-24 cm^-2)
sigma_{abs}: absorption cross-section (barns, 1e-24 cm^-2)

The purpose of this conversion of this PDF to CSV format is to obtain the exact information enclosed in the PDF in a more machine-readable format. The quickest method to perform this conversion was to look for existing Python packages that already had this capability. The first package that came up was tabula. tabula has two methods that were relevant for this task: read_pdf and convert_into [4]. Converting the PDF file into CSV was performed in two lines of code (#1, importing tabula, #2, using convert_into.)

Primarily, the parameter pages in convert_into was set to "1" just to test the efficacy of the method. It was confirmed that the output CSV matched the input PDF data by comparing the numbers in each cell of five randomly-selecting rows. As a demonstration, the first six rows of the PDF file are:

ZSymbA p or T_{1/2} I b_{c} b_{+} b_{-} c \sigma_{coh} \sigma_{inc} \sigma_{scatt} \sigma_{abs}
0-N-1 10.3 MIN 1/2 -37.0(6) 0 -37.0(6) 43.01(2) 43.01(2) 0
1-H -3.7409(11) 1.7568(10) 80.26(6) 82.02(6) 0.3326(7)
1-H-1 99.985 1/2 -3.7423(12) 10.817(5) -47.420(14) +/- 1.7583(10) 80.27(6) 82.03(6) 0.3326(7)
1-N-2 0.0149 1 6.674(6) 9.53(3) 0.975(60) 5.592(7) 2.05(3) 7.64(3) 0.000519(7)
1-H-3 12.26 Y 1/2 4.792(27) 4.18(15) 6.56(37) 2.89(3) 0.14(4) 3.03(5) < 6.0E-6

and the first six rows of the CSV file are:

ZSymbA,p or T1/2,I,bc,b+,b-,c,σcoh,σ inc,σscatt,σabs
0-N-1,10.3 MIN,1/2,-37.0(6),0,-37.0(6),,43.01(2),,43.01(2),0
1-H,,,-3.7409(11),,,,1.7568(10),80.26(6),82.02(6),0.3326(7)
1-H-1,99.985,1/2,-3.7423(12),10.817(5),-47.420(14),+/-,1.7583(10),80.27(6),82.03(6),0.3326(7)
1-H-2,0.0149,1,6.674(6),9.53(3),0.975(60),,5.592(7),2.05(3),7.64(3),0.000519(7)
1-H-3,12.26 Y,1/2,4.792(27),4.18(15),6.56(37),,2.89(3),0.14(4),3.03(5),< 6.0E-6

At this point, pages was changed to "all", and the final CSV file was obtained. Once again, the final CSV file was checked with the original PDF file. There remains no obvious method besides manually checking the numbers per row to verify that all the values in all 10 pages of the PDF remain the exact same. After verifying that the values in five rows randomly-selected from the final CSV file matched exactly their counterparts in the PDF file, it was assumed that the rest of the CSV file copied all the information correctly. Empty cells in the PDF are empty in the corresponding CSV file, preserving the dimension of the data structure. Should there be a way to more rigorously approaching this problem, please contact me.

This project concludes with a reflection: consider storing experimental data both in a PDF format, for final copies, and a machine-readable format, like a CSV, to be used in data science applications.

References

  • [1] Definition of the CSV Format. Internet Engineering Task Force. Retrieved September 25, 2020.
  • [2] Bound Coherent Neutron Scattering Lengths. Vienna University of Technology. Retrieved September 25, 2020.
  • [3] Neutron Scattering Lengths. Vienna University of Technology. Retrieved September 26, 2020.
  • [4] Ariga A. (2020) tabula-py. github.com/chezou. Retrieved September 25, 2020.
  • [5] Varley F. Sears (1992) Neutron scattering lengths and cross sections. Neutron scattering lengths and cross sections, Neutron News,3:3, 26-37. Retrieved September 27, 2020.
  • Parenthetical:

    For how to use tabula's method "read_pdf": https://stackoverflow.com/a/49562555
    For how to resolve the issue "modules 'tabula' has no attribute 'read_pdf'": https://stackoverflow.com/a/60532664
    For why method 'read_pdf' was not included in tabula: https://stackoverflow.com/a/49997114
    For another reason why 'read_pdf' was not working with tabula: https://stackoverflow.com/a/54123725
    For how to respond to 'y/n' prompts in Jupyter Notebook: https://stackoverflow.com/a/39841757