Modality purification
dginev opened this issue · 1 comments
dginev commented
Similarly to KWARC/deprecated-LLaMaPUn#2, we should port the old modality purification to Rust, and run it prior doing any NLP analyses.
For instance, the naive arXMLiv token model I just generated shows 75,000 unique words that contain "mathformula" in them. A purification step can denoise that.
dginev commented
... but we have seen good results can be obtained without going the heuristic preprocessing route, given enough data, so maybe this can be allowed to rest without a new rust implementation