/zimtextextractor

text extractor for zim html dumps like the wikipedia

Primary LanguageC++Apache License 2.0Apache-2.0

textextractor, for zim html dumps like the wikipedia

Early dev for a C++ tool to parse a zim file (such as the kiwix top 100 wikipedia articles), parse the html (via lexbor), and extract the text content for further analysis.

A small (1.2MB) zim file based on the top 100 English Wikpedia articles: wikipedia_en_100_mini_2021-01.zim

Assumes libzim and lexbor are compiled under this directory (not installed). The build is fragile and should be burned and rebuilt.

Barely tested, use at your own risk.

# after adding the deps, see libzim/README.md
cd libzim
meson . build
ninja -C build
cd ..

# See lexbor/INSTALL.md
cd lexbor
# TODO: use a build dir instead
cmake . -DLEXBOR_BUILD_TESTS=OFF -DLEXBOR_BUILD_EXAMPLES=OFF -DLEXBOR_BUILD_SEPARATELY=ON
make -j11
cd ..