output has hard spaces which are rejected in pandoc

Question

output has hard spaces which are rejected in pandoc

colcord opened this issue 7 years ago · 6 comments

Hi, I really appreciate this project. I've just had some trouble when using the output. When I put my text files through pandoc, I get a complaint about hard spaces which have been inserted. I think it's latex which is rejecting them. Let me know if you want more details. Some representations of the spaces are \20.

Answer 1 · 2018-03-27T14:57:45.000Z

I'm not sure what the cause of this issue could be. It may be the text in the clipboard has these characters in which are respected by the markdown conversion. It could also be the library I used to actually convert the HTML to markdown.

Someone previously contributed a patch that added a lot of support for pandoc and I notice there are a lot of replacements done on the stream. Perhaps you'd like to add one that replaces this \20 with a space in a pull request?

Answer 2 · 2018-03-27T15:00:24.000Z

Wow. What a fast response. I'm not a programmer, so I can't write a patch, unfortunately. Let me find another example, and post it here, just so we have a clear test. thanks for your prompt response.

Answer 3 · 2018-03-28T19:53:18.000Z

Hi, I've tested this again. This is the latest web page which caused problems:

https://www.quora.com/What-is-the-best-textbook-for-Category-theory

thanks, Frank

Answer 4 · 2018-03-29T07:50:55.000Z

HI Frank,

I'll see what I can do. This project is pretty much unmaintained so I'll need to find some spare time to look into this. I'll see what I can do

Answer 5 · 2018-04-01T13:54:27.000Z

Hi Euan, just noticed another related bug. When a text as italics, the space before the first asterisk is a hard space. I've just looked at the javascript, and I don't see where it returns a single asterisk in replacement for italics. I see that it would return an underscore. But I haven't seen that in my results. Is most of the conversion using to-markdown?
When I look around, I see that Dom Christie has updated his project to-markdown to turndown
https://github.com/domchristie/turndown
It looks as if he is maintaining it. I don't see a project which uses that code in a manner which is as easy to use as yours.
I wish I could make the changes myself. kind regards, Frank

{
filter: ['em', 'i'],
replacement: function (content) {
return '' + content + ''
}
},

Answer 6 · 2018-04-03T07:41:15.000Z

@colcord, try replacing this part:

              .replace(/[ ]+\n/g, '\n')

with:

              .replace(/[ ]+\n/g, '\n')
              .replace(/\u00a0/g, ' ')