jfilter/clean-text

URLs are not matched

lemon234071 opened this issue · 1 comments

text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")

output:

郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c

Expected:

郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX

Hey @lemon234071, thanks for reporting. I'm not sure how to handle this. Right now, the URL has to be somehow separated from other tokes (e.g. by a preceding space). In your string, the URLs could be detected by using the ASCII characters in the string. Maybe this can be useful to add a special handling for Chinese texts? I would not adapt the current URL regex for English (etc.). https://github.com/jfilter/clean-text/blob/master/cleantext/constants.py#L62