The ever-increasing selection of microcontrollers brings the challenge of porting embedded software to new devices through much manual work, while code generators are used only in special cases. Since, in practice, usable data is limited to machine-readable formats and the substantial amount of technical documentation is difficult to access due to the print-oriented nature of PDF, we identify the need for a processor to access the PDFs and extract data with a high quality to enable more code generation of embedded software.
In this paper, we design and implement a modular processor for extracting detailed data sets from technical documentation using deterministic table processing for thousands of microcontrollers: device identifiers, interrupt tables, package and pinouts, pin functions, and register maps. Our evaluation of STMicro documentation compares the completeness and correctness of these data sets against existing machine-readable sources with a weighted average of 96.5% across almost 6 million data points while also finding several issues in both sources. We show that our tool yields very accurate data with only limited manual effort and can enable and enhance a significant amount of existing and new code generation use cases in the embedded software domain that are currently limited by a lack of machine-readable data sources.
The paper is published by the Journal of Systems Research (JSys) and is available free of charge here.
@article{HP23,
author = {Hauser, Niklas and Pennekamp, Jan},
title = {{Automatically Extracting Hardware Descriptions from PDF Technical Documentation}},
journal = {Journal of Systems Research},
year = {2023},
volume = {3},
number = {1},
publisher = {eScholarship Publishing},
month = {10},
doi = {10.5070/SR33162446},
code = {https://github.com/salkinium/pdf-data-extraction-jsys-artifact},
code2 = {https://github.com/modm-io/modm-data},
meta = {},
}
Please note that this repository is archived for reproducibility. Any future development will be done in the modm-io/modm-data. repository.
We thank the JSys reviewers for their remarks that improved our manuscript. We are grateful to Eduard Vlad for testing the artifacts and improving their documentation as well as to Roman Matzutt for proof-reading the manuscript.
This repository contains the exact same code that passed the artifact evaluation by the Journal of Systems Research (JSys).
This repository contains the entire code for the tool licensed as MPLv2:
- The conversion pipelines are implemented in the
modm_data
folder and are orchestrated by thetools/scripts
files. - The HTML patches are in the
patches
folder. - The evaluation code and data in in the
tools/eval
folder.
The input and output data is zipped as a separate file, which we are not allowed to distribute publicly due to the copyright of the STMicro PDF documentation. Please contact @salkinium to provide you with a private copy of the input sources to proceed.
Please extract or symlink the artifact into the ext/
folder, so the code
artifact has this structure:
jsys-artifact-code
├── ext
│ ├── cache
│ │ ├── stmicro-html
│ │ ├── stmicro-owl
│ │ ├── stmicro-pdf
│ │ └── stmicro-svd
│ ├── cmsis
│ └── modm-devices
├── modm_data
├── patches
└── tools
There are two artifact versions:
- A tiny version of the data that can be used to test all pipelines quickly with the individual commands described in each pipeline. However, it does not allow for the full evaluation to run.
- A complete version, containing all input data required to run all pipelines completely and perform the evaluation on the output data.
This is a Python 3.11 project making use of these libraries:
pypdfium2
for C-bindings topdfium
; a pdf manipulation library.anytree
for a tree data structure.owlready2
for working with knowledge graphs via OWL.dashtable
for formatting tables in debug mode.BeautifulSoup4
as a dependency for dashtable, unfortunately.numpy
for working with transformation matrices.lxml
for working with HTML.pillow
for debug renders and image manipulation.patch_ng
for applying unified diff patches.deepdiff
for diffing data structures.CppHeaderParser
for parsing C headers.pygount
for counting source lines, similar tocloc
.matplotlib
for drawing graphs.jinja2
for templating as part ofmodm-devices
.
Install the project dependencies with the following command:
pip install -r requirements.txt
You also need g++
installed and callable in your path.
The implemented pipelines are available as Python modules inside modm_data
folder. The actually implemented data pipelines have the following structure:
┌──────┐ ┌──────────┐
┌────────►│CubeMX├─[modm-devices]─►│XML Format├─────────[modm-devices]──────┐
│ └──────┘ └──────────┘ ▼
┌───┴───┐ ┌────────────┐ ┌───────────┐ ┌───────────┐ ┌─────────┐
│STMicro├─►│PDF Document├─[pdf2html]─►│HTML Folder├───────────[html2py]─►│Python Data│◄─[owlready2]─►│OWL Graph│
└───┬───┘ └────────────┘ └───────────┴──[html2svd]─┐ └─────────┬─┘ └─────────┘
├────────────────────────────────────────────────┐ ▼ ▲ │ ┌──────────┐
│ ┌────────────┐ │ ┌─────────┐ │ └───────────────►│Evaluation│
└────────────────────►│CMSIS Header├─[header2svd]┴─►│CMSIS-SVD├─[cmsis-svd]─┘ └──────────┘
└────────────┘ └─────────┘
Not all pipelines are implemented directly in this project. For example,
accessing the (7) STM32CubeMX database is already implemented by the
ext/modm-devices
project, so we just call their Python code directly.
Similarly, parsing the (6) CMSIS-SVD files is already implemented by the
ext/cmsis/svd
project. Therefore some pipelines just involve calling a single
library function, and are simply part of the evaluation and not callable on
their own. However, all novel pipelines are individually callable as described
here.
Conversion from HTML to PDF can be performed either selectively or for the entirety of PDF files from STMicro. Both ways are presented below.
Examples of accessing STMicro PDFs with the tools/scripts/pdf2html.py
script:
# show the raw AST of the first page
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --page 1 --ast
# show the normalized AST of the first 20 pages
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --range :20 --tree
# Overlay the graphical debug output on top of the input PDF
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --page 1 --pdf --output test.html
# Convert a single PDF page into HTML
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --page 1 --html --output test.html
# Convert the whole PDF into a single (!) HTML
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --html --output test.html
# Convert the whole PDF into a folder with multiple HTMLs using multiprocessing
python3 tools/scripts/pdf2html.py --document ext/cache/stmicro-pdf/DS11581-v6.pdf --parallel --output DS11581
We recommend using the Makefile to convert all PDFs. This can take 1-2 hours!
The parallelism depends on the number of CPU cores and amount of RAM. We
recommend using 4-8 jobs at most. The Makefile also redirects the output of
every conversion into the log/
folder.
# Conversion of a single datasheet
make ext/cache/stmicro-html/DS11581-v6
# or multiple PDFs
make ext/cache/stmicro-html/DS11581-v6 ext/cache/stmicro-html/RM0432-v9
# Convert all PDFs (Datasheets, Reference Manuals)
make convert-html -j4
# Clean all PDFs
make clean-html
Selective conversion of PDFs is also possible:
# Data Sheets only
make convert-html-ds
# Reference Manuals only
make convert-html-rm
The resulting knowledge graphs are found in ext/cache/stmicro-owl
.
Sadly owlready2 does not sort the XML serialization, so the graphs change with
every call, making diffs impractical.
Only takes a few minutes.
# Convert a single HTML folder to OWL using table processing
python3 tools/scripts/html2owl.py --document ext/cache/stmicro-html/DS11581-v6
# Convert ALL HTML folders using multiprocessing with #CPUs jobs
python3 tools/scripts/html2owl.py --all
To perform the steps automatically, you may also use make
:
# Generate all owl files
make convert-html-owl
# Remove all generated OWL Graphs
make clean-owl
The resulting SVD files are found in ext/cache/stmicro-svd
.
Only takes a few minutes.
# Convert a single HTML folder to SVD using table processing
python3 tools/scripts/html2svd.py --document ext/cache/stmicro-html/RM0432-v9
# Convert ALL HTML folders using multiprocessing
python3 tools/scripts/html2svd.py --all
To perform the steps automatically, you may also use make
:
# Conversion using make
make convert-html-svd
# Remove all svd files generated for rms
make clean-html-svd
The resulting SVD files are found in ext/cache/stmicro-svd
.
Only takes a few minutes.
# Convert a group of devices into SVD files
python3 tools/scripts/header2svd.py --device stm32f030c6t6 --device stm32f030f4p6 --device stm32f030k6t6
# Convert all CMSIS headers into SVD files
python3 tools/scripts/header2svd.py --all
To perform the steps automatically, you may also use make
:
# Using make
make convert-header-svd
# Remove all svd files
make clean-svd
The evaluation scripts reside in the tools/eval
folder including their output
as .txt
files. For some steps the eval is split into two or three steps,
since the actual comparison code is quite slow and the subsequent statistical
computing is done later. The intermediary data is stored as JSON files in the
same folder.
To successfully render the charts, some dependencies are required.
Specifically, a LaTeX distribution like, texlive
is needed along with
texlive-science
or at least the siunitx.sty
style file.
To install the dependencies use the following command:
# Arch Linux
pacman -S texlive-bin texlive-science
# Ubuntu 22.04 (untested)
apt install texlive-base texlive-science
To perform the automatic evaluation for all the steps described below, execute the following command:
make evaluation-all
Assessed manually. Click around in the HTML archive to see for yourself.
Also see the patches/stmicro
folder for an understanding of what needed to be
manually fixed.
Data for Table 4 is in tools/eval/output_eval_identifiers.txt
# Check if all documents are uniquely identifiable
# Then checks if the identifier are subsets of each other
python3 tools/eval/compare_identifiers.py > tools/eval/output_eval_identifiers.txt
Alternatively, you may use the make
command:
make evaluation-did
Data is part of the section text from tools/eval/output_eval_interrupts.txt
# Compiles the comparison data (slow)
python3 tools/eval/compare_interrupts.py > tools/eval/output_compare_interrupts.txt
# Computes and formats the comparison data nicely
python3 tools/eval/compare_interrupts.py --eval > tools/eval/output_eval_interrupts.txt
Alternatively, you may use the make
command:
make evaluation-ivt
This is a lot of data to compare, so this will take like 10mins to compile the
initial comparison. The eval formatting is then faster.
See the manual_eval_packages.txt
for the data that sources Appendix Table 9 and 10.
# Compiles the comparison data (very slow!)
python3 tools/eval/compare_packages.py > tools/eval/output_compare_packages.txt
# Computes and formats the comparison data
python3 tools/eval/compare_packages.py --eval > tools/eval/output_eval_packages.txt
Alternatively, you may use the make
command:
make evaluation-pap
Again, lots of data, relatively slow. Data in text and for Appendix Table 11 and 12.
# Compiles the comparison data (very slow!)
python3 tools/eval/compare_signals.py > tools/eval/output_compare_signals.txt
# Computes and formats the comparison data
python3 tools/eval/compare_signals.py --eval > tools/eval/output_eval_signals.txt
# Outputs charts
python3 tools/eval/compare_signals.py --charts
Alternatively, you may use the make
command:
make evaluation-pf
This eval takes 30-40mins due to the sheer mass of data to evaluate. Data in text, for Table 5, 6, 7, and 13. Charts for Figure 5, 6, 7, and 8.
# Compiles the pinout comparison data (very slow!)
python3 tools/eval/compare_svds.py --compare > tools/eval/output_compare_svds.txt
# Computes and formats the comparison data
python3 tools/eval/compare_svds.py --eval > tools/eval/output_eval_svds.txt
# Outputs charts
python3 tools/eval/compare_svds.py --charts
Alternatively, you may use the make
command:
make evaluation-rd
The tables in the appendix have been manually curated from the evaluation data.
Appendix Table 9 and 10 are sourced from the manual_eval_signals.txt
file,
which contains a filtered and annotated version of the data from the 5.4.3
evaluation resulting in the output_eval_packages.txt
file.
Appendix Table 11 is sourced from the output_eval_signals.txt
file created by
the 5.4.4 evaluation.
Appendix Table 12 is a filtered and annotated version of the same
output_eval_signals.txt
file, resulting in the manual_eval_signals.txt
file.
Appendix Table 13 is sources from the output_eval_svd.txt
file created by the
5.4.5 evaluation.