A set of tools to transform transcript data from the Trump criminal case (indictment # 71543-2023), published on the New York State Unified Court System website.
Data produced from these tools can be found at the caughtlistening website.
The diagram below depicts the entire transformation process from HTML to MP3, with optional PDF and TXT outputs too.
All steps in the process (except for the audio generation and PDF to text) can be executed by the following command:
convert.sh <inputDir>
┌───────┐
│ HTML │
└───────┘
|
|
(html2png)
|
V
┌───────┐
│ PNG │
└───────┘
|
|
(png2txt)
|
V
┌───────┐ ┌───────┐ ┌───────┐
│ PDF │──(pdf2txt)──>│ TXT │──(png2txt)──>│ PDF │
└───────┘ └───────┘ └───────┘
|
|
(txt2json)
|
V
┌───────┐ ┌───────┐
│ JSON │──(json2txt)──>│ TXT │
└───────┘ └───────┘
|
|
(json2mp3)
|
V
┌───────┐
│ MP3 │
└───────┘
The original HTML transcripts are slowly being replaced by PDF versions. This tool converts a PDF into the same text format that the OCR process does, so that the rest of the pipeline can use it.
yarn pdf2txt --help
E.g.
yarn pdf2txt -o ./output input.pdf
A utility to extract image(s) from html file(s) and creates an image file for each html file provided as an argument.
yarn html2png --help
E.g.
yarn html2png -o ./output *.html
A utility to convert the text in image file(s) and creates a text file for each image file provided as an argument.
yarn png2txt --help
E.g.
yarn png2txt -o ./output *.png
A utility to convert text file(s) into a PDF file.
yarn text2pdf --help
E.g.
yarn txt2pdf -o ./output/transcript.pdf *.txt
Coming soon!
The JSON is an array of objects, where each object represents the words spoken by a character in the transcript along with some metadata.
Example:
[
{
"page": "Page 818",
"lineNumber": "1",
"character": "THE COURT:",
"voice": "Drew",
"text": "Good morning."
}
]
A utility to convert JSON file(s) into a single formatted plain text file.
yarn json2txt --help
E.g.
yarn json2txt -o ./output/transcript.txt transcript.json
A utility to convert JSON file(s) into mp3 file(s). An mp3 file is created for each object in the JSON file provided as an argument. See JSON structure for information about the JSON object.
yarn json2mp3 --help
E.g.
yarn json2mp3 -o ./output transcript.json