Parsing series of integers with SIMD
While conversion from a string into an integer value is feasible with SIMD
instructions, this application is unpractical. For typical cases, when a single
value is parsed, scalar procedures — like the standard atoi or
strtol — are faster than any fancy SSE code.
However, SIMD procedures can be really fast and convert in parallel several
numbers. There is only one “but”: the input data has to be regular and valid,
i.e. the input string must contain only ASCII digits. Recently, I updated
article about SSE parsing with the benchmark results. The
speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times
faster than a naive, scalar code.
The obvious question is how these powerful SIMD procedures can be used to
convert real data? By real I mean possibly broken inputs that contain series
of numbers of different length separated with characters from a predefined set.
In this text I try to answer that question; major contributions of this
article are:

  • Methods to efficiently parse and validate such strings using SSE
    instructions. There is a special variant that handles only unsigned
    numbers and also a fully featured variant for signed numbers.
  • Boosting the SSE procedure with AVX2 or AVX512 instructions when
    possible.
  • A way to combine some of SIMD techniques with scalar conversions.

The article starts with unsigned conversion, because it is easier than a signed
one. The signed conversion shares the core idea, it just adds some extra steps.
The text is accompanied with BSD-licensed software, that includes fully
functional implementations alongside the programs which validate and benchmark
the procedures.
The input is defined as follows:

  • It is a string of arbitrary length; but it should be large, say a few
    kilobytes or megabytes.
  • The input contains integer numbers. A number is an optional sign character
    ‘+’ or ‘-‘ followed by a non-empty sequence of digits ‘0’..’9′.
  • Numbers are separated with characters from a user-defiend set;
    for instance, it might be the comma and blank characters (space, new
    line, etc.).
  • All characters other than digits, separators and signs are considered
    invalid.
  • Parser must be able to check validity of the input, i.e. lack of invalid
    characters and the proper input format (numbers are separated with at
    least one character, sign characters are present only at the beginning
    of a number).

For the input string “123; -52, +432424 -999; 1234568, +879”, and three
separators ‘,’, ‘;’ and ‘ ‘, an algorithm is expected to extract six integers:
[123, -52, 432424, -999, 1234568, 879]. The order of converted numbers must be
exactly the same as in the input string.
A simplified parser is also tested. This parser simply considers all
non-digit and non-sign characters as delimiters.
The conversion algorithm works in a loop, in each iteration exactly one input
vector is loaded from memory and processed.

  • The vector is validated.
  • Depending on the layout of the digits, the input is normalized to the
    format required by the SIMD procedures. At this step we choose: how many
    numbers from the input are converted and which SIMD procedure is suitable
    for conversion.
  • After conversion, the input pointer is advanced and the loop continues.

Since SIMD procedures impose specific layout of digits within a vector, an
arbitrary input has to be properly aligned. In order to do this, we need to
identify spans of digits in the input and move each span onto certain
subarrays of a vector. Let’s assume at this point that the input contains either
digits or separators (denoted with _).
For instance, the vector with one-digit numbers 1___2_3__4__5_6_ must be
transformed into 123456__________. Then an SSE procedure will convert in
parallel all the numbers into the array [1, 2, 3, 4, 5, 6]. Likewise, the vector
with two-digit numbers _12__34___56_78_ must be transformed into
12345678________ and then the result will be [12, 34, 56, 78].
Let’s consider a more complicated case, when a string has numbers with
different count of digits. There are a few ways to convert the input
_1_2_34_567_89__:

  • If we choose conversion of one-digit numbers, then just the two first spans
    can be converted — because we need to keep the order of numbers from the input.
    After normalization the input into 12______________ just two values [1,
    2] will be produced. The input’s tail _34_567_89__ remain untouched.
  • If we choose conversion of two-digit numbers, then the three first spans
    can be converted. Shorter numbers are completed with zeros, then normalized
    input is 010234__________. The result is [1, 2, 34]; this time
    a bit shorter input’s tail _567_89__ remain untouched.
  • Finally, if we choose conversion of four-digit numbers, then the four first
    spans can be converted. Again, shorter numbers are completed with zeros,
    and normalized input is 0001000200340567. The result is [1, 2, 34, 567],
    but still the chunk’s tail, i.e. _89__, is unprocessed.

We can see that in order to convert given span combination we need to know:

  1. How to shuffle bytes in the input vector?
  2. Which SIMD procedure can be used then?
  3. How many numbers converted by the SIMD procedure must be stored?

Obtaining this information seems to be quite complicated, especially when
we look at the last example. Fortunately, all parameters can
be precalculated. A span combination can be saved as a bit-pattern,
where ones represent digits. For example, from vector _1_2_34_567_89__
we get the span pattern0b0101011011101100 = 0x56ec.
A span pattern is used to fetch a record from the precalculated array.
The record contains following fields:

  • shuffle_digits — an array of 16 bytes, which is the argument for
    the instruction _mm_shuffle_epi8 (pshufb); the instruction
    moves bytes at certain positions;
  • conversion_routine — an enumeration that selects an SSE conversion
    procedure; for instance, it tells that shuffled input is an array
    of two-digit numbers;
  • element_count — the number of elements from the SSE conversion
    procedure that must be stored in the output collection.

The solution with a precalculated array is suitable only for SSE, as span
patterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider,
such a table would be simply too large, respectively 232 or
264 entries. Additionally, the AVX2 version of pshufb instruction
works on lanes, i.e. 128-bit halves of a vector, thus it is impossible to
shuffle all inputs.
But AVX2 and AVX512 instructions still might be used in some parts of
algorithms, especially in input validation.
To build the whole precalculated table all 65536 span patterns have to be
processed. Processing a pattern consists following steps:

  1. Determine digit spans. For pattern _ddd__d__ddd_dd_ we have four spans
    [(1,3), (6,6), (9,11), (14,15)]. But for pattern ____dddd___ddddd there
    is only one span [(4,7)]. The span at the vector’s end might be continued in
    the next chunk, and we don’t know how to convert it. However, we may assume
    that a span which starts at the vector’s beginning is a new number and thus
    it is included into the span’s list.
  2. Find the best conversion scheme for the spans. We systematically check
    which SSE conversion is possible, i.e. one-/two-/four-/eight-digit and how
    many spans each conversion can process — the procedure which processes the
    most items is selected. The output from this step are values for the following
    record fields:

    • element_size — 1, 2, 4, 8 or 0 if SSE conversion is not possible;
    • element_count — the numbers of converted spans;
    • total_skip — how many input bytes are processed; for pattern
      ____dddd________ we know that the whole input is processed
      (total_skip=16); for pattern ____d____ddddd_d we process only the
      first span at index 5, but we know that four non-digit characters after
      the span can be also skipped, thus total_skip is 9.
  3. Construct the shuffle pattern. Having the element size and spans positions
    the vector shuffle_digits is built. The characters from a span are mapped
    on the appropriate subarrays in the vector. Also zero completion is done in
    this step, as the instruction pshufb either copy given byte from the input
    vector or put zero if the index has got the most significant bit set.

Let’s consider span pattern _ddd__d__ddd_dd_. It can be converted using
four-digit conversion, so the layout of normalized vector should be:
[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ]
There are four numbers a, b, c and d in the vector; note
that the bytes at 0th index hold the most significant digits.
The first number in pattern has three digits at indices 1, 2 and 3. It means
that a1 = 1, a2 = 2 and a3 = 3; value of a0 should be zeroed,
so a0 = 0x80.
The second number has just one digit at index 6. Thus b3 = 6 and, since
the rest digits have to be zero, b0 = b1 = b2 = 0x80.
The third number has three digits at indices 9, 10, 11. Thus c1 = 9,
c2 = 10, c3 = 11, and c0 = 0x80.
The last, forth number, has got two digits at indices 14 and 15. Thus
d2 = 14, d3 = 15 and the rest have to be zero, so d0 = d1 = 0x80.
The final content of the shuffle_digits vector is:
[ 80 | 1 | 2 | 3 | 80 | 80 | 80 | 6 | 80 | 9 | 10 | 11 | 80 | 80 | 14 | 15 ]
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.

  1. Load 16 characters into the input vector.

const__m128iinput=_mm_loadu_si128((const__m128i*)data);

  1. Check for invalid characters (will be discussed later).
  1. Build the span pattern:

// t0 = input[i] in [‘0’..’9] ? 0xff : 0x00
const__m128it0=decimal_digits_mask(input);constuint16_tspan_mask=_mm_movemask_epi8(t0);

  1. Fetch the block info with the precalculate parameters.

constBlockInfo&bi=blocks[span_mask];

  1. From the block info get the shuffle pattern for _mm_shuffle_epi8.

const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)b.shuffle_digits);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);

  1. The shuffled vector is then issued to the proper conversion routine. The block
    info structure keeps the conversion kind (conversion_routine) and also the number
    of items to get from the SSE result (element_count). It is not always
    possible to use any SSE conversion procedure, thus a scalar code is needed
    to handle odd cases.

if(b.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE2Digits){convert_2digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE4Digits){convert_4digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE8Digits){convert_8digits(shuffled,b.element_count,output);}else{// case unsupported, a scalar code is called here
}

  1. Advance the input pointer. The block info structure keeps how many bytes
    were actually covered.

data+=b.total_skip;
In case of unsigned inputs we need to classify input characters into three
categories:

  • digits, i.e. chars ‘0’ .. ‘9’;
  • user-defined set of separators;
  • invalid numbers.

In practise we need two masks: one with positions of digits, another with
positions of separators. If the or’ed mask has some zero elements, it means
that there are invalid characters. Below is a schema:
const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_sep=matcher.get_mask(input);const__m128ibytemask_valid=_mm_or_si128(bytemask_digit,bytemask_sep);constuint16_tvalid_mask=_mm_movemask_epi8(bytemask_valid);if(valid_mask!=0xffff){throwstd::logic_error(“wrong input”);}
The most generic SSE implementation works in time proportional to the size
of a separators set. Each character from the set is broadcasted into
a vector and then compared with the input vector. The resulting byte mask is
ored with the previous result.
__m128iget_mask(const__m128i&input,const__m128i&initial)const{__m128iresult=initial;for(size_ti=0;i<n+1;i++){const__m128imask=_mm_cmpeq_epi8(letters[i],input);result=_mm_or_si128(result,mask);}returnresult;}
If the separators set is known at compilation time, then a compiler may
unroll the loop.
The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be used
in same cases. The instruction gets two arguments. One is an input vector.
Another one is interpreted either as a set of individual characters (up to 16) or
characters ranges (up to 8). The result is a mask for the input characters matching
the set (or ranges). In the example below the set variant was used.
__m128iget_mask(const__m128i&input,const__m128i&initial){constuint8_tmode=_SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_UNIT_MASK;return_mm_or_si128(initial,_mm_cmpestrm(set,set_size,input,16,mode));}

  1. It is unlikely, and sometimes impossible, to process all digits from the
    input vector. In a single iteration one to 16 bytes are converted. The table
    below shows details — an important conclusion is that almost 95% of
    patterns process at least half of an input vector.
bytes processed patterns % cumulative %
0 17 0.03% 0.03%
2 32 0.05% 0.07%
3 48 0.07% 0.15%
4 176 0.27% 0.42%
5 360 0.55% 0.97%
6 542 0.83% 1.79%
7 693 1.06% 2.85%
8 1084 1.65% 4.50%
9 1620 2.47% 6.98%
10 2274 3.47% 10.45%
11 3057 4.66% 15.11%
12 4526 6.91% 22.02%
13 6193 9.45% 31.47%
14 8952 13.66% 45.13%
15 12890 19.67% 64.79%
16 23072 35.21% 100.00%
  1. However, the input validation is always done for the whole vector,
    all 16 bytes. This problem might be overcome, see Processing larger
    inputs.
  2. SSE units are underutilized. The longest one-digit number sequence
    is “1;2;3;4;5;6;7;8;”. It contains eight numbers, while the SSE procedure
    is able to process 16 numbers. Likewise, the longest two-digit number
    sequence is “12;34;56;78;98;;”. It has five numbers, but the SSE procedure
    is able to convert eight numbers. Similarly, the longest four-digit numbers
    sequence is “1234;5678;9123;;” — it has three four-digit numbers, the SSE
    procedure is able to convert four numbers.
SSE procedure usage in parser converted numbers average
utilization
input size converted nums # % avg max
N/A 264 0.40%
1-digit 16 5799 8.85% 3.52 8 22%
2-digit 8 21241 32.41% 3.99 7 49%
4-digit 4 30924 47.18% 3.48 4 87%
8-digit 2 7308 11.15% 1.97 2 98%

Similarly to the unsigned conversion, the signed conversion also requires a
span pattern. But in this case also the sign characters ‘+’ and ‘-‘ are
included in the pattern. For this sample string:
input = “_+123_-45_78_-9_”
the span pattern is:
span_mask = _dddd_ddd_dd_dd_
The conversion unfortunately requires two shuffles:

  1. Shuffle spans onto proper vector elements — this is exactly the same
    as in unsigned conversion.
  2. Populate the first span characters, i.e. possibly sign character,
    on the proper vector elements.

For input:
input =
[ ‘_’ | ‘-‘ | ‘1’ | ‘2’ | ‘3’ | ‘_’ | ‘+’ | ‘4’ | ‘5’ | ‘_’ | ‘-‘ | ‘6’ | ‘_’ | ‘7’ | ‘8’ | ‘_’ ]
The conversion processes four four-digit numbers. After shuffling the
digit spans we have:
shuffled =
[ ‘-‘ ‘1’ ‘2’ ‘3’ | 0x00 ‘+’ ‘4’ ‘5’ | 0x00 0x00 ‘-‘ ‘6’ | 0x00 0x00 ‘7’ ‘8’ ]
and the vector of sign characters is:
shuffled_signs =
[ ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘+’ ‘+’ ‘+’ ‘+’ | ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘7’ ‘7’ ‘7’ ‘7’ ]
The shuffled_signs vector does not necessarily contain only ‘+’ or ‘-‘,
it might contain also digits, but this is not an error.
Then the conversion works as follows:

  • From the shuffled vector the sign characters are removed and
    ASCII characters are converted into values:
    [ 0 1 2 3 | 0 0 4 5 | 0 0 0 6 | 0 0 7 8 ]
  • Thus we end up with a vector of unsigned numbers, and such an input
    can be converted by the already defined SSE procedures:
    [ 123 | 45 | 6 | 78 ]
  • Finally, negate appropriate elements; we use a well known equation (~x + 1),
    which is expressed in the SSE code as (x xor (-1) – (-1)). The value -1
    is obtained from comparing shuffled_signs with the ‘-‘:
    negated_mask = (shuffled_signs == packed_byte(‘-‘))
    [ ff ff ff ff | 00 00 00 00 | ff ff ff ff | 00 00 00 00 ]
    or as vector of int32_t:
    [ -1 | 0 | -1 | 0 ]

Masking of the sign characters in shuffled vector is done at no cost, as
the instruction _mm_subs_epu8 (psubusb) is used to convert from ASCII to
numeric values:
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);
The instruction performs subtract with saturation, i.e. calculates max(0,
a-b). Since the ASCII codes of ‘-‘ and ‘+’ are respectively 43 and 45 and the
ASCII code of ‘0’ is 48, then the result of such expression is zero for both
sign characters.
The characters ‘+’ and ‘-‘ can be present only at the beginning of a digit
span, i.e. we want to reject inputs like “++12” or “1234-,”.
To properly validate the input we need four masks for:

  1. separators,
  2. digits,
  3. the character ‘-‘,
  4. the character ‘+’.

The first phase is similar to the unsigned conversion, i.e. all masks are or’ed
together, then zero elements point at invalid characters.
The second phase checks if sign characters are placed properly. For each span
combination we need to precalculate the mask of valid sign positions. Then,
after or’ing the masks for the ‘+’ and ‘-‘ we simply verify if they are placed
on valid positions.
In practise the precalculated data has a bit-mask for invalid sign positions,
thanks to that validation require just a simple and. The snippet below shows
the schema of the second step.
const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);// …
constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=blocks[span_mask];if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}
AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b)
which does a lookup in a 128-byte table using as the indices the seven lowest bits of
each byte of a ZMM register. Invocation of the intrinsics function is weird:
const__m512ilookup_lo=…// elements 0..63
const__m512ilookup_hi=…// and 64..127 of the lookup
const__m512itransformed=_mm512_permutex2var_epi8(lookup_lo,input,lookup_hi);
If the set of valid character fits in the standard ASCII character set, i.e.
just seven lowest bits are set, then the validation step might quickly reject
extended ASCII (7th bit set) and then translate the input using single
invocation of _mm512_permutex2var_epi8. When full 8-bit input
is allowed, then the instruction has to be called twice with two halves
of a 256-byte lookup.
An invocation (or two) of this instruction allows to classify at once
all 64 input bytes into required categories:

  • separators — it doesn’t matter how many separator chars are used;
  • digits;
  • and sign characters.

Category is encoded as a bit at certain position, in sample implementation
following schema is used:

  • separator = 0x01,
  • digit = 0x80,
  • sign = 0x80 | 0x40.

If the transformed vector contains zero byte, it means there are invalid characters.
The span pattern is obtained from the most significant bits by the instruction
_mm512_movepi8_mask (vpmovb2m); it works exactly the same way as
pmovmaskb from SSE. The sign pattern is obtained by the instruction
_mm512_test_epi8_mask (vptestmb), by testing bit 7th (mask 0x40).
Below is a snippet from the implementation:
const__m512iclasses=_mm512_permutex2var_epi8(class_lo,input,class_hi);if(_mm512_test_epi8_mask(classes,classes)!=uint64_t(-1)){throwstd::logic_error(“invalid character”);}uint64_tspan_mask64=_mm512_movepi8_mask(classes);uint64_tsign_mask64=_mm512_test_epi8_mask(classes,_mm512_set1_epi8(int8_t(0x40)));
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.

  1. Load 16 characters into the input vector.

const__m128iinput=_mm_loadu_si128((const__m128i*)data);

  1. Check for invalid characters (as described above).
  2. Build the span pattern and fetch the conversion parameters.

const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_span=_mm_or_si128(bytemask_digit,bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=block_info[span_mask];

  1. Validate if the sign characters are properly placed.

constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}

  1. If sign_mask is zero, choose faster unsigned conversion path.
  2. Otherwise prepare data for signed conversion.

const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)bi.shuffle_digits);const__m128ishuffle_signs=_mm_loadu_si128((const__m128i*)bi.shuffle_signs);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);const__m128ishuffled_signs=_mm_shuffle_epi8(input,shuffle_signs);const__m128inegate_mask=_mm_cmpeq_epi8(shuffled_signs,ascii_minus);

  1. And finally invoke proper conversion routine.

if(bi.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE2Digits){convert_2digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE4Digits){convert_4digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE8Digits){convert_8digits_signed(shuffled,negate_mask,bi.element_count,output);}else{// scalar conversion
}
As it was stated in Caveats section, a single step of algorithm validates the
whole input vector but rarely converts all bytes from the input. Although
detecting digits and sign characters might be considered cheap, we saw that
validating against an arbitrary set of separators might be really time
consuming.
The idea is to separate the validation from the conversion. In case of the
unsigned conversion only the span_mask is needed; in case of the
signed conversion we need also sign_mask.
In each iteration we process four 16-byte chunks. The result from this step are
64-bit masks. Then, we load 16 bytes into the input vector and use the lower
16-bit part of already calculated mask to get block info for the loaded code.
For unsigned conversion the main loop looks like this:
while(data<end){const__m128iinput=_mm_loadu_si128((const__m128i*)data);// validate input and calculate span_mask
constuint16_tspan_mask=get_span_mask(input);// obtain block info
constBlockInfo&bi=block_info[span_mask];// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;}
And after applying the proposed change:
while(data<end){const__m128iinput0=_mm_loadu_si128((const__m128i*)(data+0*16));const__m128iinput1=_mm_loadu_si128((const__m128i*)(data+1*16));const__m128iinput2=_mm_loadu_si128((const__m128i*)(data+2*16));const__m128iinput3=_mm_loadu_si128((const__m128i*)(data+3*16));// validate input and calculate 4 x span_mask
constuint16_tspan_mask0=get_span_mask(input0);constuint16_tspan_mask1=get_span_mask(input1);constuint16_tspan_mask2=get_span_mask(input2);constuint16_tspan_mask3=get_span_mask(input3);constuint64_tspan_mask64=(uint64_t(span_mask0)<<(0*16))|(uint64_t(span_mask1)<<(1*16))|(uint64_t(span_mask2)<<(2*16))|(uint64_t(span_mask3)<<(3*16));char*ptr=data+3*16;while(data<ptr){constuint16_tspan_mask=span_mask64&0xffff;// obtain block info
constBlockInfo&bi=block_info[span_mask];// read input
const__m128iinput=load16-bytefromdata// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;// shift the mas accordingly
span_mask64>>=bi.total_processed;}}
The problem of this solution is multiple reading the input bytes. Moreover, the
looping schema is more complicated; especially in practice, where also scalar
fallback have to be considered (it is not shown here).
The idea of separation the validation step from the actual conversion might
also be applied for scalar parser. The validation, i.e. building a span_mask
is done by SIMD code. Then, each byte of the span_mask is processed
by a precompiled code for each of 256 cases.
There are two major advantages:
1. Scalar conversions are called with compile-time constants, thanks to
that a compiler can perform as many optimizations as possible.
2. Digits spans that cross byte boundary can be processed seamlessly. When a
span ends at the byte’s end, a conversion routine converts all characters from
the input’s chunk and marks that the conversion is not completed.
When the next pattern is processed, the mark is checked: if the first
bit of the current pattern is zero, it means that the previously converted
number is ready and can be saved. If the first bit is one, it means that the
first span is continuation and thus conversion must be carried on from
the saved point.
Let’s convert following input 24-byte input (the space separates 8-byte chunks):
_12_3_45 6_7890_1 ___99___
The first chunk of input is _12_3_45, i.e. the span_pattern = 0b01101011
= 0x6b. The digit spans are [(1,2), (4,4), (6,7)], following code handles them:
*output++=convert<2>(data+1);*output++=convert<1>(data+4);val=convert<2>(data+6);has_val=true;
The template convert is parametrized by the number of digits to convert and
gets a pointer to the first character. The important fact is such code doesn’t need
to validate pointers, sizes etc. All is known in the compile-time.
The next input’s chunk is 6_7890_1, the digit spans are [(0,0), (2,5), (7,7)].
A variant of template convert that accepts two parameters simple starts
conversion from the saved state (val).
if(has_val){*output++=continue<1>(data+0,val);}else{*output++=convert<1>(data+0);}*output++=convert<4>(data+2);val=convert<1>(data+7);has_val=true;
The last, third chunk, is ___99___, and there is just one digit span [(3,4)]:
if(has_val){*output++=val;has_val=false;}*output++=convert<2>(data+3);has_val=false;
When the input vector contains an array of three-digit numbers, like
“123;456;789;123;”, we still use four-digit conversion routine. But we might
utilize the fact that the most significant digit is always zero.
We use _mm_maddubs_epi16, but with weights 1, 10, 100, 0: two lower digits
are converted and the most significant digit is properly scaled. Then simple
horizontal add _mm_hadd_epi16 adds the intermediate results.
// t0 = [ 0 | 5 | 7 | 9 | 0 | 1 | 2 | 3 | 0 | 9 | 2 | 5 | 0 | 3 | 2 | 7 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t1 = [ 500 | 79 | 100 | 23 | 900 | 25 | 300 | 27 ]
const__m128imul_all=_mm_setr_epi8(0,100,10,1,0,100,10,1,0,100,10,1,0,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 579 | 123 | 925 | 327 | 579 | 123 | 925 | 327 ]
const__m128it2=_mm_hadd_epi16(t1,t1);
Where two four-digits are going to be converted, a faster approach can be used.
Instead of storing each digit on separate bytes, digits are stored on 16-bit
values. Then a single _mm_madd_epi16 is used to multiply all digits by
weights 1, 10, 100, 1000. Then the result’s halves are added with _mm_hadd_epi32
(phaddd).
// input = “12345678”
// t0 = [ 0 | 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 0 | 8 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ]
// * [ 1000 | 100 | 10 | 1 | 1000 | 100 | 10 | 1 ]
// t1 = [ 1200 | 34 | 5600 | 78 ]
const__m128imul_all=_mm_setr_epi16(1000,100,10,1,1000,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 1234 | 5678 | 1234 | 5678 ]
const__m128it2=_mm_hadd_epi32(t1,t1);
There are two reference scalar implementations:

  • a fully specialized, handcrafted parser;
  • parser based on two standard functions: strspn and strtol.

The handcrafted source:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){enumState{Separator,Plus,Minus,Digit};Statestate=Separator;Stateprev=Separator;boolnegative=false;int32_tnumber=0;for(size_ti=0;i<size;i++){constcharc=data[i];if(c==’+’){state=Plus;}elseif(c==’-‘){state=Minus;}elseif(c>=’0’&&c<=’9’){state=Digit;}elseif(contains(separators,c)){state=Separator;}else{throwstd::runtime_error(“Wrong character (scalar)”);}switch(state){casePlus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘+’ follows a non-separator character)”);}number=0;negative=false;break;caseMinus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ follows a non-separator character)”);}number=0;negative=true;break;caseDigit:if(prev==Separator){number=c-‘0’;negative=false;}else{number=10*number+c-‘0’;}break;caseSeparator:if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}break;}// switch
prev=state;}// for
if(state==Separator){if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}}}
And implementation based on standard functions:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){char*ptr=const_cast<char*>(data);char*end=ptr+size;char*endptr;while(true){ptr+=strspn(ptr,separators);if(ptr==end){break;}errno=0;constlongval=std::strtol(ptr,&endptr,10);// the following check comes from “man 3 strtol”
if((errno==ERANGE&&(val==LONG_MAX||val==LONG_MIN))||(errno!=0&&val==0)){throwstd::logic_error(“invalid input”);}if(endptr==ptr){throwstd::logic_error(“no digits”);}ptr=endptr;*output++=val;}}
This part presents evaluation of algorithms in terms of running time
(performance tests) and also some statistical measurements that might
help in understanding run-time properties.
Following parameters of the input data are being changed:

  • Input size in bytes.
  • Distribution of number of digits:
    • fixed size from 1 to 8 digits [only in overall performance tests];
    • uniform distribution over range 1 to k, where k = 1..8;
    • Gaussian distribution with mu = 1..8 (sigma = 1.0).
  • Distribution of count of separators between numbers:
    • exactly one separator character,
    • uniform distribution of 1 to 6 separators.
  • Uniform distribution of the set: no-sign-char, ‘+’ and ‘-‘.

Caveats shows statistics of 1) average processed bytes and 2) SSE
utilization. This section presents real numbers gathered during
conversion of 1000000 bytes containing 1-8 digit signed numbers.

loops 90’790  
… different span patterns 1’315  
total numbers converted 162’228 100.00%
conversion handled by scalar fallbacks 13’601 8.38%
conversions done by SSE procedures 148’627 91.62%
… unsigned SSE parser 6’222 N/A
… signed SSE parser 142’405

As we see, most conversions were done by the SSE code; moreover, most
conversions went through the signed parser, which is slower than unsigned
counterpart.
Another important number, which may give a hint about possible cache misses
to the block_info structure, is the count of different span patterns.
It is 1315 (~2% of the whole table size, 65536), but only 120 of them were
processed in 75% iterations. The next section
shows this parameter in details.
Below are detailed statistics for SSE routines:

  • calls — how many times the procedure was called;
  • converted — how many numbers were parsed;
  • conv/call — average utilization.

Conversions 1-/2-/4-/8-digits are described in section SSE conversion
capabilities, the 3-digit procedure in
Appendix A — conversion of three-digit numbers.

procedure unsigned path signed path
calls converted conv/call calls converted conv/call
1-digit conversion 325 410 1.26 N/A N/A N/A
2-digit conversion 21 48 2.29 1949 3521 1.81
3-digit conversion 21 43 2.05 1409 2246 1.59
4-digit conversion 239 581 2.43 10569 26255 2.48
8-digit conversion 2887 5140 1.78 59769 110383 1.85

The table below shows how many bytes from the input vector were processed
in single iteration. It is pretty close to average numbers showed in Caveats.

bytes processed patterns % cumulative %
0 0 0.00% 0.00%
1 0 0.00% 0.00%
2 197 0.26% 0.26%
3 583 0.76% 1.01%
4 773 1.00% 2.01%
5 1137 1.47% 3.48%
6 1841 2.39% 5.87%
7 2544 3.30% 9.17%
8 5305 6.87% 16.04%
9 9135 11.83% 27.87%
10 7163 9.28% 37.15%
11 8834 11.44% 48.60%
12 10102 13.09% 61.68%
13 9342 12.10% 73.79%
14 8221 10.65% 84.44%
15 6743 8.74% 93.17%
16 5269 6.83% 100.00%

The lookup table is quite large: in the sample implementation single record
occupies 38 bytes, it gives almost 2,5 MiB. This section presents analysis of
span_pattern usage during conversion of various inputs. Following parameters
are collated:

  1. The number of different span patterns that appear during conversion.
    This is explained in details below.
  2. The number of CPU clocks per input byte (on Skylake); the results were
    directly get from Performance comparison.
  3. Both cache miss and branch miss ratios (the same Skylake); these
    results were obtained by running separate bin/benchmark-hwevents tool.
    The ratios should give a hint about real-workload penalties.

During a conversion a histogram is build. The histogram contains span_pattern
and the number of its occurrences; the histogram is sorted by the frequency.
Having such a sorted histogram it is easy to estimate how many distinct
patterns appear in most iterations.
Let’s analyse sample, made up, histogram:

# span pattern frequency cumulative
1 0xaaaa 1 1
2 0xbbbb 1 2
3 0x4444 1 3
4 0xdddd 1 4
5 0xeeee 2 6
6 0xffff 2 8
7 0x1111 3 11
8 0x3333 5 16
9 0x5555 10 26
10 0x7777 17 43

There are 10 different patterns that were used in 43 loops. We can read
from the table that in 1/4 of iterations (11 of 43) were used even seven
patterns. Most of processing uses just three patterns. It means that only a
small portion of the large lookup is actually touched.
The tables in following sections have got data processed in this way. Each column
contains number of distinct patterns that take part in given percentage of the
iterations (25%, 50%, 75%, 95% and all, 100%).
The results for various input sizes are available in the repository, below
are shown only 4kB and 64kB inputs.

parameters distinct span masks count cycles per byte branches cache references
  < 25% < 50% < 75% < 95% 100% min avg taken mispredicted ratio count missed ratio
Fixed length
1 digit, single separator character 32 44 50 52 53 3.095 3.165 26085 633 2.43% 1531 415 27.11%
2 digits, single separator character 12 16 18 19 20 3.305 3.400 25270 485 1.92% 1333 114 8.55%
3 digits, single separator character 5 7 8 8 9 2.533 2.584 21809 247 1.13% 954 63 6.60%
4 digits, single separator character 4 5 6 6 7 3.193 3.231 24145 336 1.39% 926 49 5.29%
5 digits, single separator character 2 3 4 4 5 2.648 2.680 20788 241 1.16% 1067 24 2.25%
6 digits, single separator character 2 4 4 4 5 2.312 2.347 18289 233 1.27% 915 19 2.08%
7 digits, single separator character 2 2 2 2 3 4.121 4.167 25682 313 1.22% 780 6 0.77%
8 digits, single separator character 1 3 3 3 4 5.495 5.592 44188 837 1.89% 836 25 2.99%
Uniform distribution
1 .. 1 digit, single separator character 42 54 60 62 63 3.162 3.288 25992 619 2.38% 1618 334 20.64%
1 .. 2 digits, single separator character 71 102 118 126 128 3.201 3.268 26561 465 1.75% 2118 655 30.93%
1 .. 3 digits, single separator character 72 112 134 146 148 2.891 2.941 24393 514 2.11% 2522 663 26.29%
1 .. 4 digits, single separator character 87 135 159 170 172 3.398 3.463 25831 906 3.51% 2480 667 26.90%
1 .. 5 digits, single separator character 92 136 163 177 179 3.495 3.553 25865 806 3.12% 2509 548 21.84%
1 .. 6 digits, single separator character 90 144 168 178 180 3.494 3.557 24943 639 2.56% 2591 627 24.20%
1 .. 7 digits, single separator character 86 125 143 151 153 3.259 3.324 23730 579 2.44% 2137 476 22.27%
1 .. 8 digits, single separator character 95 135 157 166 167 3.828 3.894 28964 859 2.97% 2074 318 15.33%
Gaussian distribution
max at 1 digit, single separator character 76 133 161 174 176 3.218 3.264 25990 485 1.87% 2658 799 30.06%
max at 2 digit, single separator character 77 129 157 170 173 3.122 3.189 24951 686 2.75% 2803 455 16.23%
max at 3 digit, single separator character 76 103 115 120 122 3.192 3.246 24812 799 3.22% 3038 256 8.43%
max at 4 digit, single separator character 43 57 63 65 66 3.134 3.215 23754 414 1.74% 1650 106 6.42%
max at 5 digit, single separator character 32 38 41 43 44 2.822 2.875 21380 346 1.62% 1320 93 7.05%
max at 6 digit, single separator character 25 30 33 34 35 2.887 2.935 21765 496 2.28% 1223 79 6.46%
max at 7 digit, single separator character 20 22 24 24 25 3.921 3.983 29234 794 2.72% 1096 40 3.65%
max at 8 digit, single separator character 11 12 13 14 15 4.825 4.887 37274 1015 2.72% 944 44 4.66%
Fixed length
1 digit, 1 .. 6 separator characters 65 131 197 223 227 3.331 3.404 18979 677 3.57% 3111 1581 50.82%
2 digits, 1 .. 6 separator characters 67 135 175 195 198 3.473 3.529 20588 799 3.88% 2858 1287 45.03%
3 digits, 1 .. 6 separator characters 68 119 147 157 158 2.747 2.806 18629 481 2.58% 2728 1150 42.16%
4 digits, 1 .. 6 separator characters 69 101 118 124 125 2.641 2.695 18672 488 2.61% 1919 613 31.94%
5 digits, 1 .. 6 separator characters 53 71 78 81 82 2.748 2.824 19009 530 2.79% 1638 372 22.71%
6 digits, 1 .. 6 separator characters 43 50 53 55 56 2.958 3.002 19989 450 2.25% 1304 208 15.95%
7 digits, 1 .. 6 separator characters 12 14 17 18 19 3.316 3.341 20881 300 1.44% 1058 52 4.91%
8 digits, 1 .. 6 separator characters 33 46 54 57 58 4.302 4.346 34811 728 2.09% 1377 206 14.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters 66 132 198 232 237 3.197 3.268 19319 685 3.55% 2972 1273 42.83%
1 .. 2 digits, 1 .. 6 separator characters 66 132 198 238 243 3.025 3.081 19752 842 4.26% 3186 1247 39.14%
1 .. 3 digits, 1 .. 6 separator characters 67 135 203 233 238 3.070 3.150 19453 743 3.82% 3357 1216 36.22%
1 .. 4 digits, 1 .. 6 separator characters 70 140 210 250 256 3.390 3.477 19640 897 4.57% 3586 1226 34.19%
1 .. 5 digits, 1 .. 6 separator characters 71 142 213 242 247 3.338 3.401 19580 823 4.20% 3170 940 29.65%
1 .. 6 digits, 1 .. 6 separator characters 71 143 207 233 237 3.288 3.375 19457 785 4.03% 2916 807 27.67%
1 .. 7 digits, 1 .. 6 separator characters 74 148 208 228 230 3.304 3.375 19268 761 3.95% 2985 904 30.28%
1 .. 8 digits, 1 .. 6 separator characters 79 159 209 227 230 3.692 3.766 23045 881 3.82% 2946 727 24.68%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters 66 133 200 244 249 3.154 3.216 19347 900 4.65% 3515 1060 30.16%
max at 2 digit, 1 .. 6 separator characters 67 135 203 241 246 3.155 3.229 19299 786 4.07% 3340 892 26.71%
max at 3 digit, 1 .. 6 separator characters 69 139 205 229 233 3.202 3.278 19382 804 4.15% 3188 661 20.73%
max at 4 digit, 1 .. 6 separator characters 70 141 180 198 201 2.914 2.978 18669 611 3.27% 2775 692 24.94%
max at 5 digit, 1 .. 6 separator characters 74 131 156 164 166 3.147 3.217 19276 586 3.04% 2291 552 24.09%
max at 6 digit, 1 .. 6 separator characters 70 93 101 105 106 3.196 3.257 20806 493 2.37% 1719 237 13.79%
max at 7 digit, 1 .. 6 separator characters 64 80 87 91 92 3.597 3.646 24974 622 2.49% 1599 251 15.70%
max at 8 digit, 1 .. 6 separator characters 56 74 82 87 88 3.994 4.055 29874 848 2.84% 1587 231 14.56%
parameters distinct span masks count cycles per byte branches cache references
  < 25% < 50% < 75% < 95% 100% min avg taken mispredicted ratio count missed ratio
Fixed length
1 digit, single separator character 61 74 81 84 85 3.619 3.655 393175 6529 1.66% 13469 974 7.23%
2 digits, single separator character 13 17 20 21 22 3.727 3.754 374164 3451 0.92% 11212 286 2.55%
3 digits, single separator character 4 6 8 8 9 2.839 2.852 328065 829 0.25% 7539 101 1.34%
4 digits, single separator character 4 5 6 6 7 3.495 3.507 361605 2429 0.67% 7148 89 1.25%
5 digits, single separator character 1 2 3 3 4 2.921 2.935 313419 849 0.27% 7335 119 1.62%
6 digits, single separator character 1 2 3 3 4 2.545 2.555 272684 741 0.27% 8704 58 0.67%
7 digits, single separator character 1 2 2 2 3 4.407 4.417 386099 2206 0.57% 5224 27 0.52%
8 digits, single separator character 2 3 4 4 5 7.094 7.135 666950 13453 2.02% 5192 75 1.44%
Uniform distribution
1 .. 1 digit, single separator character 59 72 80 82 83 3.627 3.693 393231 6582 1.67% 14582 973 6.67%
1 .. 2 digits, single separator character 236 280 299 308 310 3.908 3.935 397134 5214 1.31% 19011 2176 11.45%
1 .. 3 digits, single separator character 355 420 450 466 469 3.565 3.590 361819 4669 1.29% 21347 1858 8.70%
1 .. 4 digits, single separator character 416 493 523 537 540 5.329 5.387 391117 12854 3.29% 24169 1621 6.71%
1 .. 5 digits, single separator character 444 510 543 556 558 5.270 5.328 390345 11393 2.92% 19020 1568 8.24%
1 .. 6 digits, single separator character 423 492 525 535 537 4.589 4.633 372174 7872 2.12% 19917 1282 6.44%
1 .. 7 digits, single separator character 392 462 481 491 493 4.174 4.218 355907 6263 1.76% 15979 949 5.94%
1 .. 8 digits, single separator character 519 598 620 630 631 5.217 5.273 423108 11289 2.67% 16635 695 4.18%
Gaussian distribution
max at 1 digit, single separator character 481 587 630 645 648 4.150 4.189 395482 6382 1.61% 25467 1620 6.36%
max at 2 digit, single separator character 454 534 570 586 589 4.104 4.143 371930 6882 1.85% 25775 739 2.87%
max at 3 digit, single separator character 300 330 344 350 352 4.990 5.050 375961 11571 3.08% 15877 351 2.21%
max at 4 digit, single separator character 132 147 153 155 156 3.760 3.814 356129 4522 1.27% 10634 220 2.07%
max at 5 digit, single separator character 84 90 93 95 96 3.031 3.045 318232 1548 0.49% 8123 177 2.18%
max at 6 digit, single separator character 54 59 62 64 65 3.479 3.499 328097 6041 1.84% 8491 366 4.31%
max at 7 digit, single separator character 36 39 41 41 42 5.520 5.560 444157 11932 2.69% 5814 124 2.13%
max at 8 digit, single separator character 22 23 24 25 26 6.897 6.946 562307 15610 2.78% 5449 120 2.20%
Fixed length
1 digit, 1 .. 6 separator characters 1052 1626 1870 1950 1958 4.371 4.408 280108 7580 2.71% 34860 8289 23.78%
2 digits, 1 .. 6 separator characters 693 909 995 1026 1029 5.044 5.108 308039 12071 3.92% 26297 4481 17.04%
3 digits, 1 .. 6 separator characters 354 438 480 494 496 3.334 3.358 273155 3865 1.41% 17568 2647 15.07%
4 digits, 1 .. 6 separator characters 177 224 246 252 254 3.311 3.335 277306 4944 1.78% 10967 1361 12.41%
5 digits, 1 .. 6 separator characters 100 126 134 137 138 3.648 3.677 284172 7099 2.50% 7727 742 9.60%
6 digits, 1 .. 6 separator characters 69 77 80 82 83 3.675 3.703 297939 5210 1.75% 6470 402 6.21%
7 digits, 1 .. 6 separator characters 46 49 51 53 54 3.662 3.673 315611 2311 0.73% 6077 308 5.07%
8 digits, 1 .. 6 separator characters 30 45 54 58 59 5.735 5.782 532039 10908 2.05% 6292 438 6.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters 1053 1643 1896 1973 1981 4.364 4.396 279251 7676 2.75% 34478 6144 17.82%
1 .. 2 digits, 1 .. 6 separator characters 1066 1792 2104 2212 2221 5.090 5.154 288339 12133 4.21% 39995 5781 14.45%
1 .. 3 digits, 1 .. 6 separator characters 1084 1802 2120 2226 2236 4.974 5.031 286026 10029 3.51% 44877 5187 11.56%
1 .. 4 digits, 1 .. 6 separator characters 1120 1808 2084 2173 2182 5.332 5.395 287345 11700 4.07% 42918 3871 9.02%
1 .. 5 digits, 1 .. 6 separator characters 1141 1713 1959 2031 2039 5.078 5.153 287785 10589 3.68% 40253 2950 7.33%
1 .. 6 digits, 1 .. 6 separator characters 1142 1617 1803 1856 1860 4.888 4.960 286913 10124 3.53% 35625 2229 6.26%
1 .. 7 digits, 1 .. 6 separator characters 1091 1503 1646 1678 1681 4.970 5.033 287420 11059 3.85% 30932 3218 10.40%
1 .. 8 digits, 1 .. 6 separator characters 1109 1476 1598 1619 1622 5.650 5.711 339596 14223 4.19% 28870 2243 7.77%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters 1070 1975 2405 2551 2565 5.389 5.456 285670 12734 4.46% 46840 6140 13.11%
max at 2 digit, 1 .. 6 separator characters 1094 1862 2174 2283 2293 5.125 5.177 288202 10818 3.75% 43779 3228 7.37%
max at 3 digit, 1 .. 6 separator characters 1021 1440 1610 1666 1672 4.927 4.986 284545 10542 3.70% 34647 1690 4.88%
max at 4 digit, 1 .. 6 separator characters 715 905 985 1010 1013 4.072 4.129 280545 8310 2.96% 20920 931 4.45%
max at 5 digit, 1 .. 6 separator characters 445 534 567 575 577 4.127 4.167 286377 8569 2.99% 14319 651 4.55%
max at 6 digit, 1 .. 6 separator characters 296 336 346 351 352 4.025 4.054 312734 6394 2.04% 11762 402 3.42%
max at 7 digit, 1 .. 6 separator characters 185 207 214 219 220 4.481 4.520 376654 8042 2.14% 7924 452 5.70%
max at 8 digit, 1 .. 6 separator characters 110 133 141 146 148 5.532 5.576 448317 12621 2.82% 7779 821 10.55%

Following procedures are tested:

  • scalar — the hand-coded validating scalar parser;
  • scalar (std) — also the hand-coded parser,
    but use only standard functions strtol and strspn;
  • scalar hybrid — the implementation of Scalar hybrid;
  • SSE — the implementation of the base SSE algorithm;
  • SSE (block) — the implementation of variant described in Processing
    larger inputs
  • SSE (simplified) — the SSE variant with relaxed delimiters definition.

There are two tests approaches:

  • Overall, where rather huge input data is parsed (a few megabytes) and
    the wall clock is used to measure the time of the whole procedure. Procedures
    are repeated several times (10, 100, 1000 or more), the minimum running
    time is noted.
    To run these tests type make report-overall.rst; the tests suite uses executable
    bin/benchmark.
    The results from various computers are available in the repository.
  • Microbenchmarks, where small inputs are parsed and exact number of CPU
    cycles is obtained. Procedures are also repeated many times (1000) and then
    minumum and average cycles to process a single input byte are calculated.
    To run these tests type make microbenchmarks.rst; the tests suite uses
    executable bin/benchmark-cpuclocks.
    The all results from various computers are available in the repository,
    below are only the measurements from the Skylake computer.

Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
The table below compares speedup to the reference procedure for various
input sizes and data distributions. Detailed measurements for each size and
different distribution parameters are presented in the subsequent sections.

  speedup over scalar procedure
  scalar (std) scalar (hybrid) SSE SSE (block) SSE (simplified)
size [B] distribution samples min avg max min avg max min avg max min avg max min avg max
1024 Gaussian distribution 16 0.45 0.87 1.29 1.65 2.36 3.26 1.53 3.21 5.44 1.66 3.16 4.97 1.58 3.37 5.78
1024 Fixed length 16 0.42 0.87 1.29 1.63 2.41 3.53 1.28 3.32 5.23 1.47 3.30 5.05 1.30 3.54 5.79
1024 Uniform distribution 16 0.40 0.86 1.32 1.64 2.41 3.36 2.04 3.70 5.54 2.22 3.53 4.97 2.13 3.88 5.79
4096 Gaussian distribution 16 0.39 0.69 1.24 1.69 3.01 4.67 1.91 4.56 7.21 2.13 4.88 7.84 2.01 4.77 7.68
4096 Fixed length 16 0.36 0.69 1.39 1.94 2.87 4.32 1.60 4.44 6.86 1.71 4.66 8.34 1.62 4.70 7.83
4096 Uniform distribution 16 0.43 0.78 1.38 2.07 3.24 4.55 3.19 5.33 7.40 3.36 5.61 7.87 3.27 5.61 7.79
65536 Gaussian distribution 16 0.45 0.77 1.30 1.40 1.74 2.13 1.73 4.46 5.88 1.76 4.57 6.00 1.77 4.60 6.25
65536 Fixed length 16 0.40 0.75 1.36 1.55 1.87 2.37 1.46 4.68 6.72 1.48 5.00 7.31 1.50 4.95 7.53
65536 Uniform distribution 16 0.55 0.86 1.36 1.38 1.75 2.34 3.32 4.98 6.39 3.29 5.01 6.52 3.43 5.20 6.89
102400 Gaussian distribution 16 0.46 0.77 1.26 1.38 1.71 2.03 1.78 4.28 5.66 1.81 4.44 5.81 1.82 4.42 6.00
102400 Fixed length 16 0.40 0.75 1.36 1.53 1.84 2.31 1.29 4.59 6.49 1.46 4.91 6.99 1.31 4.85 7.16
102400 Uniform distribution 16 0.54 0.86 1.38 1.35 1.71 2.32 3.22 4.78 6.38 3.21 4.83 6.43 3.31 4.97 6.85

Gaussian distribution (max at 8 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 8.782 9.103 1.00
scalar (std) 22.874 23.586 0.38
scalar (hybrid) 5.005 5.148 1.75
SSE 4.898 4.974 1.79
SSE (block) 4.603 4.651 1.91
SSE (simplified) 4.687 4.730 1.87

Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.729 13.130 1.00
scalar (std) 17.501 18.150 0.73
scalar (hybrid) 4.067 4.273 3.13
SSE 4.002 4.069 3.18
SSE (block) 3.903 3.945 3.26
SSE (simplified) 3.914 3.970 3.25

Gaussian distribution (max at 7 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 8.836 9.191 1.00
scalar (std) 23.365 24.139 0.38
scalar (hybrid) 5.104 5.250 1.73
SSE 4.129 4.195 2.14
SSE (block) 3.476 3.518 2.54
SSE (simplified) 3.975 4.039 2.22

Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.646 14.115 1.00
scalar (std) 18.029 18.596 0.76
scalar (hybrid) 4.042 4.274 3.38
SSE 3.591 3.644 3.80
SSE (block) 3.294 3.334 4.14
SSE (simplified) 3.431 3.481 3.98

Gaussian distribution (max at 6 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 9.434 9.852 1.00
scalar (std) 24.766 25.547 0.38
scalar (hybrid) 5.571 5.728 1.69
SSE 2.833 2.882 3.33
SSE (block) 2.977 3.024 3.17
SSE (simplified) 2.738 2.786 3.45

Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.200 14.750 1.00
scalar (std) 17.992 18.563 0.79
scalar (hybrid) 4.091 4.330 3.47
SSE 3.207 3.266 4.43
SSE (block) 2.885 2.932 4.92
SSE (simplified) 3.058 3.113 4.64

Gaussian distribution (max at 5 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 10.384 10.864 1.00
scalar (std) 25.763 26.495 0.40
scalar (hybrid) 6.102 6.273 1.70
SSE 2.786 2.827 3.73
SSE (block) 2.710 2.744 3.83
SSE (simplified) 2.699 2.752 3.85

Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.168 15.622 1.00
scalar (std) 18.088 18.872 0.84
scalar (hybrid) 4.244 4.442 3.57
SSE 3.023 3.096 5.02
SSE (block) 2.864 2.934 5.30
SSE (simplified) 2.908 2.984 5.22

Gaussian distribution (max at 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.091 18.877 1.00
scalar (std) 34.484 35.258 0.52
scalar (hybrid) 7.037 7.275 2.57
SSE 3.160 3.213 5.72
SSE (block) 3.282 3.352 5.51
SSE (simplified) 3.120 3.194 5.80

Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.390 22.112 1.00
scalar (std) 18.566 19.088 1.15
scalar (hybrid) 5.160 5.286 4.15
SSE 3.067 3.126 6.97
SSE (block) 2.904 2.963 7.37
SSE (simplified) 2.961 3.022 7.22

Gaussian distribution (max at 4 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.966 12.550 1.00
scalar (std) 26.982 27.705 0.44
scalar (hybrid) 5.896 6.082 2.03
SSE 3.234 3.284 3.70
SSE (block) 3.026 3.074 3.95
SSE (simplified) 3.117 3.183 3.84

Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.371 16.967 1.00
scalar (std) 17.832 18.445 0.92
scalar (hybrid) 4.173 4.395 3.92
SSE 2.909 3.001 5.63
SSE (block) 2.760 2.813 5.93
SSE (simplified) 2.848 2.908 5.75

Gaussian distribution (max at 3 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.280 13.948 1.00
scalar (std) 29.238 30.157 0.45
scalar (hybrid) 6.225 6.450 2.13
SSE 3.225 3.309 4.12
SSE (block) 2.963 3.012 4.48
SSE (simplified) 3.238 3.286 4.10

Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.193 18.831 1.00
scalar (std) 18.642 19.287 0.98
scalar (hybrid) 4.119 4.246 4.42
SSE 3.126 3.190 5.82
SSE (block) 3.016 3.090 6.03
SSE (simplified) 3.000 3.048 6.06

Gaussian distribution (max at 2 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.942 16.733 1.00
scalar (std) 32.551 33.334 0.49
scalar (hybrid) 6.372 6.605 2.50
SSE 3.130 3.190 5.09
SSE (block) 3.041 3.121 5.24
SSE (simplified) 3.020 3.082 5.28

Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.088 20.800 1.00
scalar (std) 18.382 19.013 1.09
scalar (hybrid) 4.574 4.835 4.39
SSE 3.184 3.264 6.31
SSE (block) 3.018 3.069 6.66
SSE (simplified) 2.988 3.042 6.72

Uniform distribution (1 .. 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.677 19.206 1.00
scalar (std) 34.479 35.207 0.54
scalar (hybrid) 7.953 8.184 2.35
SSE 3.014 3.099 6.20
SSE (block) 3.117 3.172 5.99
SSE (simplified) 2.977 3.062 6.27

Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.981 22.621 1.00
scalar (std) 16.860 17.391 1.30
scalar (hybrid) 5.285 5.481 4.16
SSE 3.029 3.085 7.26
SSE (block) 2.942 3.000 7.47
SSE (simplified) 3.099 3.179 7.09

Uniform distribution (1 .. 2 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.695 18.444 1.00
scalar (std) 34.917 35.633 0.51
scalar (hybrid) 7.245 7.401 2.44
SSE 3.324 3.387 5.32
SSE (block) 3.323 3.398 5.33
SSE (simplified) 3.236 3.321 5.47

Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.164 21.860 1.00
scalar (std) 18.567 19.126 1.14
scalar (hybrid) 4.937 5.156 4.29
SSE 2.940 2.995 7.20
SSE (block) 2.924 2.975 7.24
SSE (simplified) 2.785 2.826 7.60

Uniform distribution (1 .. 3 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.385 17.254 1.00
scalar (std) 33.359 34.075 0.49
scalar (hybrid) 6.877 7.145 2.38
SSE 2.892 2.947 5.67
SSE (block) 3.088 3.158 5.31
SSE (simplified) 2.752 2.844 5.95

Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.530 21.221 1.00
scalar (std) 18.995 19.580 1.08
scalar (hybrid) 4.647 4.811 4.42
SSE 3.037 3.089 6.76
SSE (block) 2.941 2.992 6.98
SSE (simplified) 2.914 2.972 7.05

Uniform distribution (1 .. 4 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.926 15.613 1.00
scalar (std) 31.275 32.210 0.48
scalar (hybrid) 6.445 6.677 2.32
SSE 3.393 3.458 4.40
SSE (block) 3.126 3.184 4.77
SSE (simplified) 3.293 3.344 4.53

Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.047 19.920 1.00
scalar (std) 19.500 20.130 0.98
scalar (hybrid) 4.539 4.723 4.20
SSE 3.325 3.383 5.73
SSE (block) 3.132 3.193 6.08
SSE (simplified) 3.182 3.235 5.99

Uniform distribution (1 .. 5 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.748 14.405 1.00
scalar (std) 30.105 30.941 0.46
scalar (hybrid) 6.144 6.413 2.24
SSE 3.475 3.533 3.96
SSE (block) 3.171 3.238 4.34
SSE (simplified) 3.293 3.356 4.17

Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.297 19.010 1.00
scalar (std) 19.209 19.845 0.95
scalar (hybrid) 4.477 4.652 4.09
SSE 3.293 3.352 5.56
SSE (block) 3.135 3.188 5.84
SSE (simplified) 3.182 3.230 5.75

Uniform distribution (1 .. 6 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.912 13.548 1.00
scalar (std) 29.511 30.337 0.44
scalar (hybrid) 5.958 6.253 2.17
SSE 3.430 3.487 3.76
SSE (block) 3.201 3.265 4.03
SSE (simplified) 3.283 3.337 3.93

Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.288 17.956 1.00
scalar (std) 19.250 19.841 0.90
scalar (hybrid) 4.356 4.481 3.97
SSE 3.205 3.276 5.39
SSE (block) 2.990 3.046 5.78
SSE (simplified) 3.116 3.172 5.55

Uniform distribution (1 .. 7 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.974 12.552 1.00
scalar (std) 27.884 28.562 0.43
scalar (hybrid) 6.214 6.385 1.93
SSE 3.253 3.309 3.68
SSE (block) 3.057 3.100 3.92
SSE (simplified) 3.119 3.167 3.84

Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.583 17.366 1.00
scalar (std) 18.947 19.528 0.88
scalar (hybrid) 4.265 4.401 3.89
SSE 3.248 3.311 5.11
SSE (block) 2.989 3.048 5.55
SSE (simplified) 3.165 3.215 5.24

Uniform distribution (1 .. 8 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.637 12.165 1.00
scalar (std) 27.107 27.851 0.43
scalar (hybrid) 6.079 6.231 1.91
SSE 3.821 3.870 3.05
SSE (block) 3.492 3.550 3.33
SSE (simplified) 3.688 3.745 3.16

Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.739 16.589 1.00
scalar (std) 18.875 19.563 0.83
scalar (hybrid) 4.335 4.463 3.63
SSE 3.611 3.667 4.36
SSE (block) 3.337 3.400 4.72
SSE (simplified) 3.490 3.538 4.51

Gaussian distribution (max at 8 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.852 11.971 1.00
scalar (std) 26.589 26.792 0.45
scalar (hybrid) 8.336 8.411 1.42
SSE 6.911 6.956 1.71
SSE (block) 6.921 6.990 1.71
SSE (simplified) 6.757 6.802 1.75

Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.008 17.110 1.00
scalar (std) 21.492 21.653 0.79
scalar (hybrid) 8.722 8.800 1.95
SSE 5.650 5.692 3.01
SSE (block) 5.679 5.733 2.99
SSE (simplified) 5.426 5.470 3.13

Gaussian distribution (max at 7 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.818 12.929 1.00
scalar (std) 27.335 27.609 0.47
scalar (hybrid) 9.246 9.315 1.39
SSE 5.513 5.542 2.33
SSE (block) 5.259 5.296 2.44
SSE (simplified) 5.364 5.405 2.39

Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.978 18.119 1.00
scalar (std) 21.924 22.109 0.82
scalar (hybrid) 9.205 9.286 1.95
SSE 4.537 4.591 3.96
SSE (block) 4.351 4.388 4.13
SSE (simplified) 4.352 4.407 4.13

Gaussian distribution (max at 6 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.772 13.862 1.00
scalar (std) 28.295 28.596 0.49
scalar (hybrid) 10.005 10.079 1.38
SSE 3.456 3.479 3.98
SSE (block) 3.358 3.385 4.10
SSE (simplified) 3.379 3.402 4.08

Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.374 19.538 1.00
scalar (std) 22.288 22.489 0.87
scalar (hybrid) 9.660 9.730 2.01
SSE 3.949 3.983 4.91
SSE (block) 3.726 3.751 5.20
SSE (simplified) 3.870 3.918 5.01

Gaussian distribution (max at 5 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.989 15.092 1.00
scalar (std) 29.851 30.089 0.50
scalar (hybrid) 10.750 10.842 1.39
SSE 2.933 2.945 5.11
SSE (block) 2.807 2.822 5.34
SSE (simplified) 2.866 2.878 5.23

Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.729 20.888 1.00
scalar (std) 22.536 22.718 0.92
scalar (hybrid) 10.225 10.311 2.03
SSE 4.032 4.086 5.14
SSE (block) 3.762 3.805 5.51
SSE (simplified) 3.985 4.043 5.20

Gaussian distribution (max at 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.489 24.804 1.00
scalar (std) 38.285 38.540 0.64
scalar (hybrid) 16.088 16.198 1.52
SSE 4.159 4.200 5.89
SSE (block) 4.107 4.142 5.96
SSE (simplified) 4.213 4.250 5.81

Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 28.265 28.549 1.00
scalar (std) 22.648 22.767 1.25
scalar (hybrid) 13.524 13.622 2.09
SSE 5.201 5.271 5.43
SSE (block) 5.452 5.513 5.18
SSE (simplified) 5.057 5.128 5.59

Gaussian distribution (max at 4 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.687 16.790 1.00
scalar (std) 31.153 31.398 0.54
scalar (hybrid) 11.835 11.913 1.41
SSE 3.590 3.614 4.65
SSE (block) 3.466 3.489 4.81
SSE (simplified) 3.496 3.525 4.77

Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.284 22.529 1.00
scalar (std) 22.476 22.653 0.99
scalar (hybrid) 10.916 11.005 2.04
SSE 4.032 4.081 5.53
SSE (block) 3.859 3.898 5.77
SSE (simplified) 3.953 3.991 5.64

Gaussian distribution (max at 3 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.121 19.291 1.00
scalar (std) 32.927 33.175 0.58
scalar (hybrid) 13.629 13.718 1.40
SSE 4.848 4.901 3.94
SSE (block) 4.780 4.825 4.00
SSE (simplified) 4.760 4.812 4.02

Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.281 24.611 1.00
scalar (std) 22.530 22.703 1.08
scalar (hybrid) 11.920 12.185 2.04
SSE 4.756 4.823 5.11
SSE (block) 4.781 4.835 5.08
SSE (simplified) 4.607 4.664 5.27

Gaussian distribution (max at 2 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.188 22.377 1.00
scalar (std) 36.158 36.461 0.61
scalar (hybrid) 15.232 15.326 1.46
SSE 4.058 4.108 5.47
SSE (block) 3.952 3.990 5.61
SSE (simplified) 4.061 4.110 5.46

Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 26.633 26.932 1.00
scalar (std) 22.723 22.869 1.17
scalar (hybrid) 13.162 13.267 2.02
SSE 5.097 5.174 5.23
SSE (block) 5.160 5.224 5.16
SSE (simplified) 4.941 5.008 5.39

Uniform distribution (1 .. 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.963 23.203 1.00
scalar (std) 37.465 37.718 0.61
scalar (hybrid) 13.951 14.058 1.65
SSE 3.389 3.447 6.78
SSE (block) 3.593 3.643 6.39
SSE (simplified) 3.324 3.394 6.91

Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 27.116 27.518 1.00
scalar (std) 20.621 20.742 1.31
scalar (hybrid) 11.908 12.006 2.28
SSE 4.128 4.167 6.57
SSE (block) 4.179 4.214 6.49
SSE (simplified) 4.105 4.136 6.61

Uniform distribution (1 .. 2 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 23.675 24.277 1.00
scalar (std) 37.914 38.203 0.62
scalar (hybrid) 15.327 15.432 1.54
SSE 3.896 3.933 6.08
SSE (block) 3.870 3.901 6.12
SSE (simplified) 3.913 3.952 6.05

Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 27.811 28.100 1.00
scalar (std) 22.433 22.537 1.24
scalar (hybrid) 13.003 13.113 2.14
SSE 4.947 5.018 5.62
SSE (block) 5.201 5.265 5.35
SSE (simplified) 4.807 4.872 5.79

Uniform distribution (1 .. 3 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.512 22.742 1.00
scalar (std) 36.660 36.946 0.61
scalar (hybrid) 15.263 15.362 1.47
SSE 3.474 3.507 6.48
SSE (block) 3.456 3.480 6.51
SSE (simplified) 3.528 3.561 6.38

Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 26.841 27.155 1.00
scalar (std) 22.874 23.042 1.17
scalar (hybrid) 13.184 13.284 2.04
SSE 4.815 4.876 5.57
SSE (block) 4.952 5.022 5.42
SSE (simplified) 4.699 4.753 5.71

Uniform distribution (1 .. 4 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.123 21.329 1.00
scalar (std) 35.205 35.486 0.60
scalar (hybrid) 14.756 14.856 1.43
SSE 5.217 5.283 4.05
SSE (block) 5.053 5.106 4.18
SSE (simplified) 5.163 5.227 4.09

Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 25.877 26.129 1.00
scalar (std) 23.132 23.285 1.12
scalar (hybrid) 12.984 13.084 1.99
SSE 5.211 5.291 4.97
SSE (block) 5.283 5.356 4.90
SSE (simplified) 5.048 5.116 5.13

Uniform distribution (1 .. 5 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.780 19.978 1.00
scalar (std) 34.247 34.511 0.58
scalar (hybrid) 14.274 14.371 1.39
SSE 5.099 5.160 3.88
SSE (block) 4.920 4.975 4.02
SSE (simplified) 5.053 5.099 3.91

Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.749 25.022 1.00
scalar (std) 23.269 23.468 1.06
scalar (hybrid) 12.627 12.718 1.96
SSE 5.046 5.111 4.90
SSE (block) 4.999 5.126 4.95
SSE (simplified) 4.903 4.954 5.05

Uniform distribution (1 .. 6 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.656 18.857 1.00
scalar (std) 33.362 33.565 0.56
scalar (hybrid) 13.643 13.742 1.37
SSE 4.433 4.479 4.21
SSE (block) 4.282 4.319 4.36
SSE (simplified) 4.356 4.399 4.28

Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 23.975 24.304 1.00
scalar (std) 23.495 23.653 1.02
scalar (hybrid) 12.269 12.358 1.95
SSE 4.883 4.947 4.91
SSE (block) 4.715 4.770 5.08
SSE (simplified) 4.778 4.825 5.02

Uniform distribution (1 .. 7 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.632 17.799 1.00
scalar (std) 32.227 32.509 0.55
scalar (hybrid) 13.079 13.177 1.35
SSE 4.114 4.158 4.29
SSE (block) 3.976 4.028 4.43
SSE (simplified) 4.032 4.069 4.37

Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.878 23.116 1.00
scalar (std) 23.259 23.512 0.98
scalar (hybrid) 12.010 12.113 1.90
SSE 4.926 4.986 4.64
SSE (block) 4.670 4.735 4.90
SSE (simplified) 4.836 4.892 4.73

Uniform distribution (1 .. 8 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.753 16.923 1.00
scalar (std) 31.309 31.604 0.54
scalar (hybrid) 12.522 12.615 1.34
SSE 5.181 5.232 3.23
SSE (block) 5.263 5.313 3.18
SSE (simplified) 5.089 5.133 3.29

Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.070 22.400 1.00
scalar (std) 23.084 23.272 0.96
scalar (hybrid) 11.689 11.777 1.89
SSE 5.666 5.729 3.90
SSE (block) 5.833 5.886 3.78
SSE (simplified) 5.493 5.553 4.02
  • The vectorized algorithm can be faster from 2 to 5 times than the scalar base.
  • Although SSE speed-up depends on the count of digits in numbers, negative
    impact is visible when the numbers are separated by more than one character.
    This is caused by greater number of distinct span patterns: for “Gaussian distribution — max at
    1 digit, single separator character” the number of distinct patterns is
    approximately 1000; for the same distribution, but variable number of
    separator, this value raises almost 28 times, to 28000.
    It means touching more memory locations within the lookup table and, as a
    result, more cache misses. Also amount of branch mispredictions is greater
    due to processing greater variety of patterns.
  • Scalar hybrid performs surprisingly well, it can be even 2 times faster than
    the baseline. However, its compile time is terribly long.

Nathan Kurz and Daniel Lemire gave incredibly valuable feedback and
comments on text and experiments. Thank you!
External links:
Internal links:
Source code is available on github.