FGRibreau/node-truncate

Limit URL size when truncating

Closed this issue · 2 comments

Hi there,

I noticed that there's no maximum to the size of a URL that could be at the end of a string someone intends to truncate:

URL_REGEX = /(((ftp|https?):\/\/)[\-\w@:%_\+.~#?,&\/\/=]+)|((mailto:)?[_.\w-]{1,300}@(.{1,300}\.)[a-zA-Z]{2,3})/g;

So for example:

const truncate = require('truncate')
input = "hello http://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
truncate(input, 7)
# 'hello http://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

As far as I can tell, if someone fed a huge URL to any of the popular libraries that depend on truncate, nothing too bad would happen, but it also looks like they don't expect this URL-preserving-at-all-costs behavior:

I understand there's not really an RFC-defined maximum URL size, but could we implement one as a precaution? It could be something huge like 32k or something like 10 times the maxLength?

Alternatively, the URL-preserving feature could be moved into an option and disabled by default. Would be a breaking change for people other than those libraries I mentioned, but potentially safer.

@mac-chaffee agreed, will accept a PR along with some updated test for this and a maximum size of 80000 (safari highest limit) :)

How about 3000?

I grabbed a Common Crawl index (which contains a ~1/3000 sample of URLs totaling 1.2 million) and found that aside from 48 outliers, every URL in the sample was <3000 characters (not including the http[s]://)

$ curl -SsL https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2021-31/indexes/cluster.idx -O
$ cat cluster.idx | wc -l
 1238717
$ cat cluster.idx | awk '{print length($1)}' | sort --reverse --numeric | head -n 50
221327
31706
17715
16417
14269
13970
13967
11065
11003
10651
9300
8551
7054
6302
6091
5177
4820
4802
4669
4136
4070
3952
3950
3845
3812
3602
3518
3392
3375
3351
3349
3329
3319
3287
3245
3196
3195
3161
3145
3142
3078
3054
3048
3033
3028
3027
3018
3004
2995
2978

Which is even higher than this person found: https://www.supermind.org/blog/740/average-length-of-a-url-part-2

Since many of the top libraries dependent on node-truncate are truncating for display purposes, 3000 at least only takes up about half a page, whereas VSCode (where I was testing this) lags like crazy above 10k.