/parsing-int-series

Parse multiple decimal integers separated by arbitrary number of delimiters

Primary LanguageC++BSD 2-Clause "Simplified" LicenseBSD-2-Clause

Parsing series of integers with SIMD

Sample programs for article Parsing series of integers with SIMD

Parsers extract integer numbers from strings. A number can be prepended by a sign character. The numbers are separated by arbitrary sequences of separator chars. All other characters are invalid and the parsers detects them and raise exception.

This repository contains:

  • scalar reference implementation;
  • two variants of SSE parsers; there are also separate variants designed solely of parsing unsigned numbers;
  • scalar hybrid that combines ideas from SIMD parsing with scalar conversion procedures.

Requires: C++11 compiler (tested with GCC 7.3) and Python 2.7.

Usage

Type make to build all programs.

Type make run-unittests to build all unit tests and then run them. Some tests are time consuming, be patient.

Type make microbenchmarks.rst to run microbenchmarks.

Type make report-overall.rst to run performance benchmarks.

Type make spanmaskhistogram.rst to produce runtime analysis report for SSE implementation.

Programs

There are several programs available in bin subdirectory.

  • benchmark --- test performance of given procedure
  • benchmark-cpuclocks --- measure performance of all procedures; display CPU clocks
  • benchmark-all --- compare performance of different procedures
  • compare-singed and comapre-unsigned --- are used to validate if parsers produces the same results as the reference parser
  • compare-avx512 --- the same as above, but tests only AVX512BW implementation
  • statistics --- gather execution statistics from SSE parsers

Apart from these programs, there are several verify_* executables that runs various unit tests; they are invoked by make run-tests.

Common arguments

All the programs generates random data which is then parsed, following command line arguments can be used to control data characteristics:

--size=NUMBER input size (in bytes)
--loops=NUMBER how many times a test must be repeated [default: 1]
--seed=NUMBER seed for random number generator [default: 0]
--num=DISTRIBUTION
 distribution of lengths of numbers
--sep=DISTRIBUTION
 distribution of lengths of gaps between numbers [default: '1']
--separators=string
 list of separator characters [default: ",; "]
--sign=DISTRIBUTION
 distribution of sign in front of number [default: '1']

DISTRIBUTION is a list of weights separated with commas, which defines distribution of items.

In case of --num it's the count of decimal digits in a random number. For instance --num=1,1,1,1 will produce one-, two-, three- or four-digit numbers with the same probability; --num=0,0,0,1,5,1 will produce four-, five- or six-digit numbers, but five-digits numbers with probability 5/7.

In case of --sep it is the distribution of numbers of separator characters between the generated numbers. The default --sep=1 means there's always exactly one character; --sep=0,0,1,1,1 would put form 2 to 4 separator chars.

The --sign defines distribution of set: no-character-sign, '+' and '-'. Thus the default --sign=1 forces just unsigned numbers; --sign=0,0,1 will force all numbers negative.

TODO

  • Complete AVX512 implementation to handle scalar fallback.

License

BSD