KWARC/llamapun

Modality purification

dginev opened this issue · 1 comments

Similarly to KWARC/deprecated-LLaMaPUn#2, we should port the old modality purification to Rust, and run it prior doing any NLP analyses.

For instance, the naive arXMLiv token model I just generated shows 75,000 unique words that contain "mathformula" in them. A purification step can denoise that.

... but we have seen good results can be obtained without going the heuristic preprocessing route, given enough data, so maybe this can be allowed to rest without a new rust implementation