Simple python package for converting VRT format (VeRticalized Text) aka. "verticalized XML" or IMS CWB format for the The IMS Open Corpus Workbench used by the Language Bank of Finland ("Kielipankki") into plain text format.
Once cloned, install with
python -m pip install -e <root>
where <root>
is a path to the folder with pyproject.toml
.
from pathlib import Path
from vrt2txt import iter_vrt_xml
raw_folder_root = Path(__file__).parent / "raw"
extracted_text_folder = Path(__file__).parent / "extracted_text"
def process_folder(folder: Path, folder_out: Path, paragraphs=False):
folder_out.mkdir(exist_ok=True, parents=True)
for file in folder.glob("*.VRT"):
print("Processing", file)
outfile = folder_out / file.with_suffix(".txt").name
with open(outfile, "w") as f:
for text in iter_vrt_xml(contents=file.read_text(), paragraphs=paragraphs):
f.write(text)
if __name__ == "__main__":
process_folder(
raw_folder_root / "wikipedia-fi-2017-src",
folder_out=extracted_text_folder / "wikipedia",
paragraphs=True,
)
process_folder(
raw_folder_root / "opensub-fi-2017-src",
folder_out=extracted_text_folder / "opensub",
paragraphs=False,
)
I wrote this as part of my keyboard layout optimization project where I created a English+Finnish+Coding optimized layout called Granite. This package is alpha-level quality but is has some unit tests.
Some functionality is still missing. For example only few types of quotes are handled, urls are not handled correctly. There are probably a lot more other edge cases that could be handled better, but this extracts sentences perfectly >99% of the cases, so it was good enough for me. I'm not currently planning on working on this package further. Feel free for fork and modify to your needs.
python -m pytest
- You can download VRT data from: kielipankki.fi/download/