Dirty HTML from Word leads to unwanted results
Opened this issue · 1 comments
peterkaptein commented
When content from Word is pasted, a lot of extra bonus-HTML is passed as well.
This can lead to words strung together: "like: thequick brownfoxjumps over the lazydog".
I fixed this (for a great deal) in a local project by simply stripping all unwanted things from the HTML.
function sanitizeHtml(html) {
// This destroys all (useless Word) <span> and <o:p> tags Word uses
// to set fonts on parts of the text and whatever
html = html.replace(/<span[^>]*>|<\/span>/g, '');
html = html.replace(/<o[^>]*>|<\/o[^>]*>/g, '');
// This cleans all wanted tags by replacing anything after "<tag" until and excluding ">"
// By eradicating the second part, all extra settings per HTML tag are removed
html = html.replace(/(<\w+)([^\>]*)/g, '$1');
// Clean up bold / italics mess that can lead to converson issues
// This covers only the basic cases
html = html.replace(/<\/b><b>/g, '');
html = html.replace(/<\/i><i>/g, '');
html = html.replace(/\s+<\/b>/g, '</b> ');
html = html.replace(/\s+<\/i>/g, '</i> ');
html = html.replace(/\s+<\/p>/g, '</p>');
return html;
}
Using it:
pastebin.addEventListener('paste', function() {
setTimeout(function() {
// Used here
var html = sanitizeHtml(pastebin.innerHTML);
output.value =convert(html);
output.focus();
output.select();
}, 200);
});
De solution is not tested for all cases.
euangoddard commented
Feel free to make a pull request for this fix. It looks good to me 👍