/lexnum

efficient lexicographic encoding of numbers

Primary LanguageGoMIT LicenseMIT

lexnum

lexnum provides a lexicographic encoding of numbers. It is based off of the ideas presented in Peter Seymour's Efficient Lexicographic Encoding of Numbers. The paper's original source can be found at http://www.zanopha.com/docs/elen.pdf, but it is re-hosted here for posterity.

Numbers are ordered. That's part of their whole thing, what with them being numbers and all. 1 comes before 2, and 2 comes before 3, and so on, and so forth. If you sort a list of integers like [1, 9, 4, 10, 13, 31] you'll get them back in their normal ordering: [1, 4, 9, 10, 13, 31]. If each of those were to be strings, such that the input is ['1', '9', '4', '10', '13', '31'], sorting them lexically would not produce a proper numerical sorting, you would wind up with ['1', '10', '13', '31', '4', '9']. That is quite annoying when doing things like appending numbers to identical file names or basically anything where you wish to put a number in the middle of what is otherwise a string.

You could always just pad your numbers with zeroes, so the prior example would have an input more like ['01', '09', '04', '10', '13', '31']. This works fine, but it requires that you know something about your input in advance. Namely, it requires that you know the range of the input, which may not always be the case. More subtly, it also requires that you know that you're only dealing with positive integers; even with zero padding, negative numbers would lexically sort backwards (e.g., '+1' lexically precedes '+2', which makes sense because 1 is less than 2, but it's also the case that '-1' precedes '-2', which makes no sense, since -2 is less than -1). Doubly subtle is the fact that the only reason a negative number string precedes a positive number string is that the negative character precedes the zero character in the ascii table. Triply subtle is that if you just always put a sign in front of your numbers, the positive numbers will precede the negative numbers because the + character precedes the - character in ascii. The end result is that a list like ['+1', '+4', '-9', '+10', '-13', '+31'] would lexically sort to ['+1', '+10', '+31', '+4', '-13', '-9'].

For a full description of how the problem is solved, read the white paper.

example

The following program would count from -20 to 20 and print their lexnum strings:

package main

import (
    "fmt"
    "github.com/jordanorelli/lexnum"
)

func main() {
    e := lexnum.NewEncoder('=', '-')
    for i := -20; i <= 20; i++ {
        fmt.Printf("%-12s%d\n", e.EncodeInt(i), i)
    }
}

Running it would produce the following output:

--779       -20
--780       -19
--781       -18
--782       -17
--783       -16
--784       -15
--785       -14
--786       -13
--787       -12
--788       -11
--789       -10
-0          -9
-1          -8
-2          -7
-3          -6
-4          -5
-5          -4
-6          -3
-7          -2
-8          -1
0           0
=1          1
=2          2
=3          3
=4          4
=5          5
=6          6
=7          7
=8          8
=9          9
==210       10
==211       11
==212       12
==213       13
==214       14
==215       15
==216       16
==217       17
==218       18
==219       19
==220       20