mika-cn/maoxian-web-clipper

Links in my mardown-file are not hidden, because of "redundant spaces"

Golddouble opened this issue · 8 comments

It looks like my "markdown-file" created with "maoxian-web-clipper" has too much spaces after "[" and before "]" in links.

I have clipped this page:
image
Source: https://www.ebay.de/sch/i.html?_nkw=dummies+statistik&_sacat=0&_sop=15

This gave me the following file:
2023-08-07 22-09-15.zip

It looks like this in reading view:
image

Only after deleting spaces after "[" and before "]" in the links, I get a better design:
image

Question:
Shouldn't the maoxian-web-clipper delete this spaces automatically?
Can I do anything?

Would appreciate some answer. Thank you.

Thanks for the feedback :)

I can reproduce this problem. It's because Turndown (a js library that MaoXian used to convert HTML to Markdown) converts block elements into \n\n XXX \n\n, and the given page wrap images with <div>(which is a block element) inside link tags <a>. like this:

<a href="https://example.org/index">
  <div class="a-image-wrapper">
    <img src="a-image.png">
  <div>
</a>

So it'll be converted to markdown like this:

[

![](a-image.png)

](https://example.org/index)

Shouldn't the maoxian-web-clipper delete this spaces automatically?

Yes, It should delete these spaces.

Maybe we can unwrap the image link, and put the link below the image. like the belowing markdown, what do you think?

![](i-am-a-image.png)

[image link](https://example.org/index)

Not sure, if I understand 100% correctly.

Do you mean this:
![Statistik für Wirtschafts- und Sozialwissenschaftler für Dummies A2 Thomas ...](asset/s-l300.webp)

[](https://www.ebay.de/itm/195524360773?epid=3042165208&hash=item2d8628fa45:g:Nr0AAOSwIf5joDmX&amdata=enc%3AAQAIAAAAwHFxi1KBPwUIcCk8tKLXL1LAN4KLWdr%2FP12IZNSb8zy66kfAwvL2Dz4x1MJgQ7IrKfQkHSPvU2m%2FBVmuG770YL0y5%2F4k%2FRFXli9ZbPAojdLW1Znou66D3v%2BkyoMFK3CahaAkhduhAfzMeTOZlphoShl0VSmXM%2Fdz3Mc2KJcr7GPnzSpcgveGrF5w5T83CRHH4bbie4ftZuQkOeadnNNv3ypR%2Fot5eOAczhzNyksLI4xLo6rt9xcZjwdsWrm0cD6prg%3D%3D%7Ctkp%3ABk9SR7Ko_dW1Yg)

or this:
![](asset/s-l300.webp)

[Statistik für Wirtschafts- und Sozialwissenschaftler für Dummies A2 Thomas ...](https://www.ebay.de/itm/195524360773?epid=3042165208&hash=item2d8628fa45:g:Nr0AAOSwIf5joDmX&amdata=enc%3AAQAIAAAAwHFxi1KBPwUIcCk8tKLXL1LAN4KLWdr%2FP12IZNSb8zy66kfAwvL2Dz4x1MJgQ7IrKfQkHSPvU2m%2FBVmuG770YL0y5%2F4k%2FRFXli9ZbPAojdLW1Znou66D3v%2BkyoMFK3CahaAkhduhAfzMeTOZlphoShl0VSmXM%2Fdz3Mc2KJcr7GPnzSpcgveGrF5w5T83CRHH4bbie4ftZuQkOeadnNNv3ypR%2Fot5eOAczhzNyksLI4xLo6rt9xcZjwdsWrm0cD6prg%3D%3D%7Ctkp%3ABk9SR7Ko_dW1Yg)

But maybe this is an issue, that better should be solved by the turndown-developper then.
Or do I miss anything?

Sorry for the delay reply.

I was trying to fix it (remove these unneed spaces), and I can't stop thinking of other cases that a link <a> wrap other block elements. So i haven't fix it yet.


But maybe this is an issue, that better should be solved by the turndown-developper then.

I search Turndown's issues. there did has an issue about this, see: https://github.com/mixmark-io/turndown/issues/332.

The author of turndown said this:

It's a little more complicated. We need to introduce Markdown contexts to do it both universally and efficiently. This isolated case has indeed a simpler solution, but I don't like the idea to provide case-by-case fixes for these, especially when there is no universally correct solution. With contexts, users would be able to choose what to do with block elements nested in inline elements. It's simple for a div without semantics. But for e.g. a table inside a link, something valid means either discarding user's data or just keeping it in HTML (which denies Turndown itself). So in this cases, users have to choose. Now they can only choose what to do with these cases using HTML preprocessing - unfortunately.

Because this issue is opened at 2020-06-14. I don't see turndown will fix it in the near future. But this problem needs to be solved, even we can't come out a very good solution that can handle all block element inside anchor element cases.


And we will consider that unwrap solution in the future.

Thank you for taking care of the problem.

I do not know something about markdown and HTML tags. But it is important, that making a fix does not create new conversion problems that did not exist before. So, I think it is important to delete the spaces only in a specific context.

But it is important, that making a fix does not create new conversion problems that did not exist before. So, I think it is important to delete the spaces only in a specific context.

I agree with you. We can handle these common cases first. I've fix this specific case on the new version and I've published it. Please update and send feedbacks if the problem is still exist?

Thank you.

I have tested it.
It works great for the "picture" case.

But in the example above, there is a second case. It's case 2 in the following picture:
grafik

Of course, this is another case.

No, secure way to solve also this second case?

Thanks for the feedback.

This new case is not easy to solve. The problem is there's not corresponding format in markdown about block links (in this case: multiple lines of text). So how do we convert these block links to markdown?

As this issue is for the image links specifically, I've created a new issue for the discussion about general block links.