croqaz/clean-mark

If the code of the article has Chinese comments, weird encoding can appear

SilenceZhou opened this issue · 7 comments

If the code of the article has Chinese comments, weird encoding can appear:

You try this url (chinese blog):

clean-mark "https://juejin.im/post/5e916011e51d4547153d15c7"

the same question,example:
{@link 包名.类名#方法名(参数类型)} -->
{@link 包名.类名#方法名(参数类型)}

Hi guys, thank you for raising this issue!
I implemented a feature about encoding, some time ago: #2
In the case of the website you mention, the encoding cannot be detected from the meta charset.

I will implement a new command line flag, so you can manually specify the encoding, eg: --encoding gb2312. I will probably implement this in the next days.

Hi guys, thank you for raising this issue!
I implemented a feature about encoding, some time ago: #2
In the case of the website you mention, the encoding cannot be detected from the meta charset.

I will implement a new command line flag, so you can manually specify the encoding, eg: --encoding gb2312. I will probably implement this in the next days.

That would be great for Chinese users, for I just met the same issue. Thank you very much.

Guys, I didn't have time to look at this issue too deeply, sorry about that.
But I did find something and there's good news and bad news.
The good news is encoding works correctly in the HTML all the way.
The bad news is that breakdance library, that converts the HTML into Markdown, breaks the encoding in case of code blocks.

You can actually check this on your own like this:

clean-mark 'https://juejin.im/post/5e916011e51d4547153d15c7' -t html

You'll see that the HTML is correct. At least it looks to me, but I don't understand the language...
So I'll look into this more and see if there's anything I can do.

The worst case scenario, I have to look at alternative libraries to convert the HTML into Markdown. If there are any...

I have just checked the HTML generated by the above instruction, and it is correct.
Thank you for doing this for us and hopefully it will be solved one day.

Hi guys, I believe I fixed the issue in the latest commit.
I replaced "breakdance" with "turndown" to convert the HTML into Markdown and it works much better.
I didn't make a release yet, because the tests are still broken, but if you can clone the repo and check a few websites, it would be amazing, I'm thinking to add a few pages in the tests too, just to make sure the app will always work.
Would you mind giving me a 2-3 links to articles that you think are more important?

Thanks!Thanks!Thanks!I have cloned the repo and checked a few websites, it normally works.Such as :
https://blog.csdn.net/weixin_33743248/article/details/88733044
😄

However,in this article(https://blog.csdn.net/NextStand/article/details/59535555)
,some comments of the code like“//输出 test.js” will be losed