spamscanner/url-regex-safe

Extra characters at the end

Closed this issue · 3 comments

Hey

So I used this code to extract some urls from a html file.

const urlRegex = require("url-regex-safe");
function urlsFromText(text) {
    if (!text) {
        return [];
    }

    const matchingUrls = text.match(urlRegex({
        localhost: true,
    }));
    return matchingUrls || [];
}

When I run it, I get the following matches (the html does have the same links across several places) and as you can see, there are some places where it is found ok, and others are not removing the last characters. NOTE: I changed the hostname, but is ok, that is not the part that is failing.

[
"http://thisdomain.com/c/eJwtjsEKwyAQRL8muUU2qzF68NBLf6NsNGkEbYJryO_XQmHgwTDMTHBLmLzEPjoERBhBgVYGjcAwWw12lAq8DOg7BZliEr7kclAQ_sj97qzXejVWBW2lwcnONM4G1DLJAITr1ie313pyJx8dPpvu-xb3mhJvVN7Hr6WZZ1mZG9u--gMQphdnSmlYLo6fFhiOs8YcOffF8U47lfZpScS1UIgXC4pf7dg_Yw",
"http://thisdomain.com/c/eJwtjsEKwyAQRL8muUU2qzF68NBLf6NsNGkEbYJryO_XQmHgwTDMTHBLmLzEPjoERBhBgVYGjcAwWw12lAq8DOg7BZliEr7kclAQ_sj97qzXejVWBW2lwcnONM4G1DLJAITr1ie313pyJx8dPpvu-xb3mhJvVN7Hr6WZZ1mZG9u--gMQphdnSmlYLo6fFhiOs8YcOffF8U47lfZpScS1UIgXC4pf7dg_Yw>.",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19DmzTkQO2hOfz77XO47T2bIMc3fVT4uKWvVpzZJxLCO-2BnMr4_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKSfXMcQhi6XbDKFSpeCfAX2BSplJFosHqoO-2B47y56WQ-2BMAjh5TPyYzCTBsVurHpCTeYNo17KesLVQSfiE4yBkMNN-2BlStPCGUbntKRMrf-2BnL0cbPriBj1FSi86bbTY6q6vT2wXwB-2BognImKofq803zMLG2JNz6lR1-2Bo7ms72uVRfaNP2xuG3hM2hDzfXDhcTuJXCMrdnKreeZEhSHuS77-2FYXZ1IP35IVzKn6H8MD05V758Ig6FB5GALPf6RS7g7aV-2Fw7U-2FxFGrxjg6QgEWdYh1Jg-3D",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19DmzTkQO2hOfz77XO47T2bIMc3fVT4uKWvVpzZJxLCO-2BnMr4_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKSfXMcQhi6XbDKFSpeCfAX2BSplJFosHqoO-2B47y56WQ-2BMAjh5TPyYzCTBsVurHpCTeYNo17KesLVQSfiE4yBkMNN-2BlStPCGUbntKRMrf-2BnL0cbPriBj1FSi86bbTY6q6vT2wXwB-2BognImKofq803zMLG2JNz6lR1-2Bo7ms72uVRfaNP2xuG3hM2hDzfXDhcTuJXCMrdnKreeZEhSHuS77-2FYXZ1IP35IVzKn6H8MD05V758Ig6FB5GALPf6RS7g7aV-2Fw7U-2FxFGrxjg6QgEWdYh1Jg-3D>;",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19LQ-2FdstCwdNG97aq-2BoKXcUNnhvG3KpLkcq0oeyJNtaudeZ0V_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKY4AYzg-2BbyFJle44p2Nwr3WIWW3AiLXnesEuTNuz17FZAbx6h2oWpO8I-2FbW4LJl88L6h6QCn5mnYgDikeWl-2FKWL-2BrgosEqEoH-2FskquLIQktySB1kz6M-2FT-2BhXu8C2DdXlfI3ahSRNjQIvkwp-2FzFbTdlxJ32vRnbdSrmTJS97orQlk0q2wr9jr9QMYq4hKUIrjNuyEO7AFhK7N8pzPq-2FNbR4BJEauwBP33v7NWzR-2BQ4VFbdI-2B7E4t04555TlbB0ndkLaGJ2hyI1o1YECwmiqWkceI-3D>[image:",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19PFHnXZ1cLWsRvhx9RaY-2BQAS5Vos-2BHGFwfuQwfhpbU-2FZ7JjEkayk2WmqvPwVmk2DWQ-3D-3DI3QJ_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKcCDbuMBMgW5oVfHk-2BCaODfQayFCp9YHYhzQAPKJJbSmYqbtbTZ98nj0XgwxLBsj8NSfuVuXc1KqTFvvMKzlByWqJSDk7JWWOhJEoG3D9NphMRpU69JGqsu-2BDnC8c4XQxC-2BeSx-2FvQ1J0C0dEMt1kAQilciDJK926NIxxyok4LZSp-2FVoIe4H3LLTYGv1H8MSN1R4REVk8n6uCvjmox0-2Blq-2FOUFtwLCOQF-2BkqqM9gbAPhWSBnVBZcwHalHIktdK2pN-2BmXznQQ4R0yYRELFY2-2BcMrI-3D>[image:",
]

Is this a bug? or is it something we can improve somehow without changing the library?

Thanks!

more info, this comes from the text/plain version of an email (which originally was html) and some clients convert to text version so it is rendered correctly in any other client.

and it seems it is because of the ">" character.

As a workaround, we did a preprocessing step by text.replace(/>/g, " > ") to add spaces to the text, but it is suboptimal.

If you can submit a PR to fix this, and/or add tests that fail that would be great!

v4.0.0 released with this fixed

release notes @ https://github.com/spamscanner/url-regex-safe/releases/tag/v4.0.0

note: this version now requires node v14+