euangoddard/clipboard2markdown

Dirty HTML from Word leads to unwanted results

Opened this issue · 1 comments

When content from Word is pasted, a lot of extra bonus-HTML is passed as well.

This can lead to words strung together: "like: thequick brownfoxjumps over the lazydog".

I fixed this (for a great deal) in a local project by simply stripping all unwanted things from the HTML.

  function sanitizeHtml(html) {
       // This destroys all (useless Word) <span> and <o:p> tags Word uses
       //  to set fonts on parts of the text and whatever
       html = html.replace(/<span[^>]*>|<\/span>/g, '');
       html = html.replace(/<o[^>]*>|<\/o[^>]*>/g, '');

       // This cleans all wanted tags by replacing anything after "<tag" until and excluding ">" 
       // By eradicating the second part, all extra settings per HTML tag are removed  
       html = html.replace(/(<\w+)([^\>]*)/g, '$1');

       // Clean up bold / italics mess that can lead to converson issues
       // This covers only the basic cases
       html = html.replace(/<\/b><b>/g, '');
       html = html.replace(/<\/i><i>/g, '');
       html = html.replace(/\s+<\/b>/g, '</b> ');
       html = html.replace(/\s+<\/i>/g, '</i> ');
       html = html.replace(/\s+<\/p>/g, '</p>');
       return html;
   }

Using it:

pastebin.addEventListener('paste', function() {
        setTimeout(function() {
            // Used here
            var html = sanitizeHtml(pastebin.innerHTML);

            output.value =convert(html);

            output.focus();
            output.select();
        }, 200);
    });

De solution is not tested for all cases.

Feel free to make a pull request for this fix. It looks good to me 👍