ArchiveTeam/terroroftinytown

Psuedo-random one-to-one mapping function

Opened this issue · 2 comments

Sep_18_2014_17:13:53 chfoo anyone know a good algorithm to turn a number into a randomly looking shortcode? it has to be one-to-one mapping. it seems like using using bytes from md5 digest doesn't seem idea because we'll never know when it cycles.
Sep_18_2014_17:18:14 aaaaaaaaa base 36 conversion or not random looking enough?
Sep_18_2014_17:19:55 aaaaaaaaa or heck a different, less used, base would look odder.
Sep_18_2014_17:23:55 aaaaaaaaa A coworker suggested using some simple cipher.
Sep_18_2014_17:42:49 aaaaaaaaa heck, maybe just xor it with a big enough key
Sep_18_2014_17:45:05 Smiley rot13 ?
Sep_18_2014_18:21:03 chfoo a combination of those might work. the reason is that the auto queue works only by incrementing a integer and it doesn't work well for those shorteners with random, fixed-length shortcodes
Sep_18_2014_18:23:09 chfoo so i figured if there was a function that converted the auto queue integer into a random number, it would give a better distribution of random chance of hitting a url
Sep_18_2014_18:23:38 chfoo rather than starting from 000000 and slowly going to ZZZZZ for example
Sep_18_2014_18:24:20 xmc gray code

soultcer: You could also just scan them randomly until you have a statistically high enough chance of hitting each code at least once, or try some sort of sequence
soultcer: As I have said earlier, I actually like the "random" approach because then you get some codes twice. If some evil terroroftinytown-client returns wrong results, you will find out about it sooner or later ;-)
soultcer: chfoo: You don't have to give out every job twice. You just need to check one or two codes from each job and see if they are correct. So for say each 10 jobs, you hand out 1 verification job that includes 20 shortcodes
soultcer: But so far I have never detected any problems. The only times I had different results was when the owner of a url shortener decided that he needed a specific shortcode and just deleted the old URL (is.gd), or with tinyurl which displays advertisements when an URL returns a 404 and sometimes has encoding issues.

I didn't see this until now:

soultcer: What you are probably looking for is a maximum length sequences https://en.wikipedia.org/wiki/Maximum_length_sequence
soultcer: Just FYI, for the old database I used something similar but a bit more simple, basically just taking the md5 hash of each previous shortcode: https://github.com/ArchiveTeam/tinyback/blob/master/tinyback/generators.py#L43-77
soultcer: This obviously means that the same code will sometimes be hit twice, but this was intended. If a bad person somehow tried to return wrong results, sooner or later another user would hit the same shortcode at random and we would discover the wrong data
aaaaaaaaa: The problem with hashes is that it isn't guaranteed to hit all of them and may get stuck in a loop
aaaaaaaaa: But I'm not familiar with the MLS you suggested, hopefully the devs are.


But to update:

Right now, we're just doing random codes for bitly and isgd. It's the easiest to implement and some URLs are stuck behind a dead server so we'll never get 100% of any major shortener regardless.