ShayHill/docx2python

Processing of a special characters

slarionoff opened this issue · 1 comments

This is not an issue but the feature request/question.

It is known that MS Word has a lot of special characters, like, non-breakable spaces, soft returns, soft hyphens etc. Is it possible to add a new element (dictionary?) to the module which will automatically replace all elements listed as keys to an elements listed as values? Word built-in symbols can be used for that, for example, this dictionary may look like this:
{
"^l": "^p",
"^s", " ",
"^-", "",
...
}

This will allow to prepare a document before it is actually parsed.

On the other hand, I saw the content of an zipped docx file and I know that those elements are not visible in the XMLs, so the document is to be rendered first, then these replaces are to be done, then it is to be saved and only then to be opened. Possibly, another module is to be created for that...

The elements can be inferred from the xml. I've done this already for some forms of checkboxes.

I have not been able to find a master list of special characters, but I'd like to support a subset (too much support of anything and the output will be as difficult to parse as the input). What characters do you suggest? I will likely incorporate these as a dictionary to make it simpler to extend the module.