mozilla/bleach

bug: Linkify does link "M.Sc." in text

barsch opened this issue · 5 comments

Describe the bug

Linkify does link stuff which shouldn't be ...

python and bleach versions (please complete the following information):

  • Python Version: 3.7.9
  • Bleach Version: 4.1.0

To Reproduce

Steps to reproduce the behavior:

>>> import bleach
>>> value = "XXX has an M.Sc. in YYY"
>>> bleach.linkify(value)
'XXX has an <a href="http://M.Sc" rel="nofollow">M.Sc</a>. in YYY'
>>> bleach.__version__
'4.1.0'

Expected behavior

>>> import bleach
>>> value = "XXX has an M.Sc. in YYY"
>>> bleach.linkify(value)
'XXX has an M.Sc. in YYY'

Linkify does link stuff which shouldn't be ...

Debatable. It's obviously not possible to tell deterministically whether something is supposed to be a URL or not. For example, consider the sentences:

I like hot chocolate.It makes me happy.

Here it's impossible to tell whether there is a space missing between two sentences, or if intention is to create a link to an Italian domain that should be linkified.

so why does the following work than:

>>> bleach.linkify("yyy M.Sac. xxx")
'yyy M.Sac. xxx'
>>> bleach.linkify("yyy M.bla. xxx")
'yyy M.bla. xxx'
>>> bleach.linkify("yyy M.bl. xxx")
'yyy M.bl. xxx'
>>> bleach.linkify("yyy M.B. xxx")
'yyy M.B. xxx'

somehow inconsistent ...

anyway I found the documented way to prevent links to certain domains - so I will just forbid the linking of the .sc domain

Because .sac, .bla, etc. aren't top-level domains in the TLD list:

https://github.com/mozilla/bleach/blob/main/bleach/linkifier.py

I actually tried with a .shop domain before and it was not working - so I assumed its some regex - anyway thanks for the explanation