Parsing series of integers with SIMD
While conversion from a string into an integer value is feasible with SIMD
instructions, this application is unpractical. For typical cases, when a single
value is parsed, scalar procedures — like the standard atoi or
strtol — are faster than any fancy SSE code.
However, SIMD procedures can be really fast and convert in parallel several
numbers. There is only one “but”: the input data has to be regular and valid,
i.e. the input string must contain only ASCII digits. Recently, I updated
article about SSE parsing with the benchmark results. The
speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times
faster than a naive, scalar code.
The obvious question is how these powerful SIMD procedures can be used to
convert real data? By real I mean possibly broken inputs that contain series
of numbers of different length separated with characters from a predefined set.
In this text I try to answer that question; major contributions of this
article are:

Methods to efficiently parse and validate such strings using SSE
instructions. There is a special variant that handles only unsigned
numbers and also a fully featured variant for signed numbers.
Boosting the SSE procedure with AVX2 or AVX512 instructions when
possible.
A way to combine some of SIMD techniques with scalar conversions.

The article starts with unsigned conversion, because it is easier than a signed
one. The signed conversion shares the core idea, it just adds some extra steps.
The text is accompanied with BSD-licensed software, that includes fully
functional implementations alongside the programs which validate and benchmark
the procedures.
The input is defined as follows:

It is a string of arbitrary length; but it should be large, say a few
kilobytes or megabytes.
The input contains integer numbers. A number is an optional sign character
‘+’ or ‘-‘ followed by a non-empty sequence of digits ‘0’..’9′.
Numbers are separated with characters from a user-defiend set;
for instance, it might be the comma and blank characters (space, new
line, etc.).
All characters other than digits, separators and signs are considered
invalid.
Parser must be able to check validity of the input, i.e. lack of invalid
characters and the proper input format (numbers are separated with at
least one character, sign characters are present only at the beginning
of a number).

For the input string “123; -52, +432424 -999; 1234568, +879”, and three
separators ‘,’, ‘;’ and ‘ ‘, an algorithm is expected to extract six integers:
[123, -52, 432424, -999, 1234568, 879]. The order of converted numbers must be
exactly the same as in the input string.
A simplified parser is also tested. This parser simply considers all
non-digit and non-sign characters as delimiters.
The conversion algorithm works in a loop, in each iteration exactly one input
vector is loaded from memory and processed.

The vector is validated.
Depending on the layout of the digits, the input is normalized to the
format required by the SIMD procedures. At this step we choose: how many
numbers from the input are converted and which SIMD procedure is suitable
for conversion.
After conversion, the input pointer is advanced and the loop continues.

Since SIMD procedures impose specific layout of digits within a vector, an
arbitrary input has to be properly aligned. In order to do this, we need to
identify spans of digits in the input and move each span onto certain
subarrays of a vector. Let’s assume at this point that the input contains either
digits or separators (denoted with _).
For instance, the vector with one-digit numbers 1___2_3__4__5_6_ must be
transformed into 123456__________. Then an SSE procedure will convert in
parallel all the numbers into the array [1, 2, 3, 4, 5, 6]. Likewise, the vector
with two-digit numbers _12__34___56_78_ must be transformed into
12345678________ and then the result will be [12, 34, 56, 78].
Let’s consider a more complicated case, when a string has numbers with
different count of digits. There are a few ways to convert the input
_1_2_34_567_89__:

If we choose conversion of one-digit numbers, then just the two first spans
can be converted — because we need to keep the order of numbers from the input.
After normalization the input into 12______________ just two values [1,
2] will be produced. The input’s tail _34_567_89__ remain untouched.
If we choose conversion of two-digit numbers, then the three first spans
can be converted. Shorter numbers are completed with zeros, then normalized
input is 010234__________. The result is [1, 2, 34]; this time
a bit shorter input’s tail _567_89__ remain untouched.
Finally, if we choose conversion of four-digit numbers, then the four first
spans can be converted. Again, shorter numbers are completed with zeros,
and normalized input is 0001000200340567. The result is [1, 2, 34, 567],
but still the chunk’s tail, i.e. _89__, is unprocessed.

We can see that in order to convert given span combination we need to know:

How to shuffle bytes in the input vector?
Which SIMD procedure can be used then?
How many numbers converted by the SIMD procedure must be stored?

Obtaining this information seems to be quite complicated, especially when
we look at the last example. Fortunately, all parameters can
be precalculated. A span combination can be saved as a bit-pattern,
where ones represent digits. For example, from vector _1_2_34_567_89__
we get the span pattern0b0101011011101100 = 0x56ec.
A span pattern is used to fetch a record from the precalculated array.
The record contains following fields:

shuffle_digits — an array of 16 bytes, which is the argument for
the instruction _mm_shuffle_epi8 (pshufb); the instruction
moves bytes at certain positions;
conversion_routine — an enumeration that selects an SSE conversion
procedure; for instance, it tells that shuffled input is an array
of two-digit numbers;
element_count — the number of elements from the SSE conversion
procedure that must be stored in the output collection.

The solution with a precalculated array is suitable only for SSE, as span
patterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider,
such a table would be simply too large, respectively 232 or
264 entries. Additionally, the AVX2 version of pshufb instruction
works on lanes, i.e. 128-bit halves of a vector, thus it is impossible to
shuffle all inputs.
But AVX2 and AVX512 instructions still might be used in some parts of
algorithms, especially in input validation.
To build the whole precalculated table all 65536 span patterns have to be
processed. Processing a pattern consists following steps:

Determine digit spans. For pattern _ddd__d__ddd_dd_ we have four spans
[(1,3), (6,6), (9,11), (14,15)]. But for pattern ____dddd___ddddd there
is only one span [(4,7)]. The span at the vector’s end might be continued in
the next chunk, and we don’t know how to convert it. However, we may assume
that a span which starts at the vector’s beginning is a new number and thus
it is included into the span’s list.
Find the best conversion scheme for the spans. We systematically check
which SSE conversion is possible, i.e. one-/two-/four-/eight-digit and how
many spans each conversion can process — the procedure which processes the
most items is selected. The output from this step are values for the following
record fields:
- element_size — 1, 2, 4, 8 or 0 if SSE conversion is not possible;
- element_count — the numbers of converted spans;
- total_skip — how many input bytes are processed; for pattern
  ____dddd________ we know that the whole input is processed
  (total_skip=16); for pattern ____d____ddddd_d we process only the
  first span at index 5, but we know that four non-digit characters after
  the span can be also skipped, thus total_skip is 9.
Construct the shuffle pattern. Having the element size and spans positions
the vector shuffle_digits is built. The characters from a span are mapped
on the appropriate subarrays in the vector. Also zero completion is done in
this step, as the instruction pshufb either copy given byte from the input
vector or put zero if the index has got the most significant bit set.

Let’s consider span pattern _ddd__d__ddd_dd_. It can be converted using
four-digit conversion, so the layout of normalized vector should be:
[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ]
There are four numbers a, b, c and d in the vector; note
that the bytes at 0th index hold the most significant digits.
The first number in pattern has three digits at indices 1, 2 and 3. It means
that a1 = 1, a2 = 2 and a3 = 3; value of a0 should be zeroed,
so a0 = 0x80.
The second number has just one digit at index 6. Thus b3 = 6 and, since
the rest digits have to be zero, b0 = b1 = b2 = 0x80.
The third number has three digits at indices 9, 10, 11. Thus c1 = 9,
c2 = 10, c3 = 11, and c0 = 0x80.
The last, forth number, has got two digits at indices 14 and 15. Thus
d2 = 14, d3 = 15 and the rest have to be zero, so d0 = d1 = 0x80.
The final content of the shuffle_digits vector is:
[ 80 | 1 | 2 | 3 | 80 | 80 | 80 | 6 | 80 | 9 | 10 | 11 | 80 | 80 | 14 | 15 ]
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.

Load 16 characters into the input vector.

const__m128iinput=_mm_loadu_si128((const__m128i*)data);

Check for invalid characters (will be discussed later).

Build the span pattern:

// t0 = input[i] in [‘0’..’9] ? 0xff : 0x00
const__m128it0=decimal_digits_mask(input);constuint16_tspan_mask=_mm_movemask_epi8(t0);

Fetch the block info with the precalculate parameters.

constBlockInfo&bi=blocks[span_mask];

From the block info get the shuffle pattern for _mm_shuffle_epi8.

const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)b.shuffle_digits);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);

The shuffled vector is then issued to the proper conversion routine. The block
info structure keeps the conversion kind (conversion_routine) and also the number
of items to get from the SSE result (element_count). It is not always
possible to use any SSE conversion procedure, thus a scalar code is needed
to handle odd cases.

if(b.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE2Digits){convert_2digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE4Digits){convert_4digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE8Digits){convert_8digits(shuffled,b.element_count,output);}else{// case unsupported, a scalar code is called here
}

Advance the input pointer. The block info structure keeps how many bytes
were actually covered.

data+=b.total_skip;
In case of unsigned inputs we need to classify input characters into three
categories:

digits, i.e. chars ‘0’ .. ‘9’;
user-defined set of separators;
invalid numbers.

In practise we need two masks: one with positions of digits, another with
positions of separators. If the or’ed mask has some zero elements, it means
that there are invalid characters. Below is a schema:
const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_sep=matcher.get_mask(input);const__m128ibytemask_valid=_mm_or_si128(bytemask_digit,bytemask_sep);constuint16_tvalid_mask=_mm_movemask_epi8(bytemask_valid);if(valid_mask!=0xffff){throwstd::logic_error(“wrong input”);}
The most generic SSE implementation works in time proportional to the size
of a separators set. Each character from the set is broadcasted into
a vector and then compared with the input vector. The resulting byte mask is
ored with the previous result.
__m128iget_mask(const__m128i&input,const__m128i&initial)const{__m128iresult=initial;for(size_ti=0;i<n+1;i++){const__m128imask=_mm_cmpeq_epi8(letters[i],input);result=_mm_or_si128(result,mask);}returnresult;}
If the separators set is known at compilation time, then a compiler may
unroll the loop.
The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be used
in same cases. The instruction gets two arguments. One is an input vector.
Another one is interpreted either as a set of individual characters (up to 16) or
characters ranges (up to 8). The result is a mask for the input characters matching
the set (or ranges). In the example below the set variant was used.
__m128iget_mask(const__m128i&input,const__m128i&initial){constuint8_tmode=_SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_UNIT_MASK;return_mm_or_si128(initial,_mm_cmpestrm(set,set_size,input,16,mode));}

It is unlikely, and sometimes impossible, to process all digits from the
input vector. In a single iteration one to 16 bytes are converted. The table
below shows details — an important conclusion is that almost 95% of
patterns process at least half of an input vector.

bytes processed	patterns	%	cumulative %
0	17	0.03%	0.03%
2	32	0.05%	0.07%
3	48	0.07%	0.15%
4	176	0.27%	0.42%
5	360	0.55%	0.97%
6	542	0.83%	1.79%
7	693	1.06%	2.85%
8	1084	1.65%	4.50%
9	1620	2.47%	6.98%
10	2274	3.47%	10.45%
11	3057	4.66%	15.11%
12	4526	6.91%	22.02%
13	6193	9.45%	31.47%
14	8952	13.66%	45.13%
15	12890	19.67%	64.79%
16	23072	35.21%	100.00%

However, the input validation is always done for the whole vector,
all 16 bytes. This problem might be overcome, see Processing larger
inputs.
SSE units are underutilized. The longest one-digit number sequence
is “1;2;3;4;5;6;7;8;”. It contains eight numbers, while the SSE procedure
is able to process 16 numbers. Likewise, the longest two-digit number
sequence is “12;34;56;78;98;;”. It has five numbers, but the SSE procedure
is able to convert eight numbers. Similarly, the longest four-digit numbers
sequence is “1234;5678;9123;;” — it has three four-digit numbers, the SSE
procedure is able to convert four numbers.

SSE procedure	usage in parser	converted numbers	average utilization
input size	converted nums	#	%	avg	max
N/A	—	264	0.40%	—	—	—
1-digit	16	5799	8.85%	3.52	8	22%
2-digit	8	21241	32.41%	3.99	7	49%
4-digit	4	30924	47.18%	3.48	4	87%
8-digit	2	7308	11.15%	1.97	2	98%

Similarly to the unsigned conversion, the signed conversion also requires a
span pattern. But in this case also the sign characters ‘+’ and ‘-‘ are
included in the pattern. For this sample string:
input = “_+123_-45_78_-9_”
the span pattern is:
span_mask = _dddd_ddd_dd_dd_
The conversion unfortunately requires two shuffles:

Shuffle spans onto proper vector elements — this is exactly the same
as in unsigned conversion.
Populate the first span characters, i.e. possibly sign character,
on the proper vector elements.

For input:
input =
[ ‘_’ | ‘-‘ | ‘1’ | ‘2’ | ‘3’ | ‘_’ | ‘+’ | ‘4’ | ‘5’ | ‘_’ | ‘-‘ | ‘6’ | ‘_’ | ‘7’ | ‘8’ | ‘_’ ]
The conversion processes four four-digit numbers. After shuffling the
digit spans we have:
shuffled =
[ ‘-‘ ‘1’ ‘2’ ‘3’ | 0x00 ‘+’ ‘4’ ‘5’ | 0x00 0x00 ‘-‘ ‘6’ | 0x00 0x00 ‘7’ ‘8’ ]
and the vector of sign characters is:
shuffled_signs =
[ ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘+’ ‘+’ ‘+’ ‘+’ | ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘7’ ‘7’ ‘7’ ‘7’ ]
The shuffled_signs vector does not necessarily contain only ‘+’ or ‘-‘,
it might contain also digits, but this is not an error.
Then the conversion works as follows:

From the shuffled vector the sign characters are removed and
ASCII characters are converted into values:
[ 0 1 2 3 | 0 0 4 5 | 0 0 0 6 | 0 0 7 8 ]
Thus we end up with a vector of unsigned numbers, and such an input
can be converted by the already defined SSE procedures:
[ 123 | 45 | 6 | 78 ]
Finally, negate appropriate elements; we use a well known equation (~x + 1),
which is expressed in the SSE code as (x xor (-1) – (-1)). The value -1
is obtained from comparing shuffled_signs with the ‘-‘:
negated_mask = (shuffled_signs == packed_byte(‘-‘))
[ ff ff ff ff | 00 00 00 00 | ff ff ff ff | 00 00 00 00 ]
or as vector of int32_t:
[ -1 | 0 | -1 | 0 ]

Masking of the sign characters in shuffled vector is done at no cost, as
the instruction _mm_subs_epu8 (psubusb) is used to convert from ASCII to
numeric values:
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);
The instruction performs subtract with saturation, i.e. calculates max(0,
a-b). Since the ASCII codes of ‘-‘ and ‘+’ are respectively 43 and 45 and the
ASCII code of ‘0’ is 48, then the result of such expression is zero for both
sign characters.
The characters ‘+’ and ‘-‘ can be present only at the beginning of a digit
span, i.e. we want to reject inputs like “++12” or “1234-,”.
To properly validate the input we need four masks for:

separators,
digits,
the character ‘-‘,
the character ‘+’.

The first phase is similar to the unsigned conversion, i.e. all masks are or’ed
together, then zero elements point at invalid characters.
The second phase checks if sign characters are placed properly. For each span
combination we need to precalculate the mask of valid sign positions. Then,
after or’ing the masks for the ‘+’ and ‘-‘ we simply verify if they are placed
on valid positions.
In practise the precalculated data has a bit-mask for invalid sign positions,
thanks to that validation require just a simple and. The snippet below shows
the schema of the second step.
const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);// …
constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=blocks[span_mask];if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}
AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b)
which does a lookup in a 128-byte table using as the indices the seven lowest bits of
each byte of a ZMM register. Invocation of the intrinsics function is weird:
const__m512ilookup_lo=…// elements 0..63
const__m512ilookup_hi=…// and 64..127 of the lookup
const__m512itransformed=_mm512_permutex2var_epi8(lookup_lo,input,lookup_hi);
If the set of valid character fits in the standard ASCII character set, i.e.
just seven lowest bits are set, then the validation step might quickly reject
extended ASCII (7th bit set) and then translate the input using single
invocation of _mm512_permutex2var_epi8. When full 8-bit input
is allowed, then the instruction has to be called twice with two halves
of a 256-byte lookup.
An invocation (or two) of this instruction allows to classify at once
all 64 input bytes into required categories:

separators — it doesn’t matter how many separator chars are used;
digits;
and sign characters.

Category is encoded as a bit at certain position, in sample implementation
following schema is used:

separator = 0x01,
digit = 0x80,
sign = 0x80 | 0x40.

If the transformed vector contains zero byte, it means there are invalid characters.
The span pattern is obtained from the most significant bits by the instruction
_mm512_movepi8_mask (vpmovb2m); it works exactly the same way as
pmovmaskb from SSE. The sign pattern is obtained by the instruction
_mm512_test_epi8_mask (vptestmb), by testing bit 7th (mask 0x40).
Below is a snippet from the implementation:
const__m512iclasses=_mm512_permutex2var_epi8(class_lo,input,class_hi);if(_mm512_test_epi8_mask(classes,classes)!=uint64_t(-1)){throwstd::logic_error(“invalid character”);}uint64_tspan_mask64=_mm512_movepi8_mask(classes);uint64_tsign_mask64=_mm512_test_epi8_mask(classes,_mm512_set1_epi8(int8_t(0x40)));
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.

Load 16 characters into the input vector.

const__m128iinput=_mm_loadu_si128((const__m128i*)data);

Check for invalid characters (as described above).
Build the span pattern and fetch the conversion parameters.

const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_span=_mm_or_si128(bytemask_digit,bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=block_info[span_mask];

Validate if the sign characters are properly placed.

constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}

If sign_mask is zero, choose faster unsigned conversion path.
Otherwise prepare data for signed conversion.

const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)bi.shuffle_digits);const__m128ishuffle_signs=_mm_loadu_si128((const__m128i*)bi.shuffle_signs);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);const__m128ishuffled_signs=_mm_shuffle_epi8(input,shuffle_signs);const__m128inegate_mask=_mm_cmpeq_epi8(shuffled_signs,ascii_minus);

And finally invoke proper conversion routine.

if(bi.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE2Digits){convert_2digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE4Digits){convert_4digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE8Digits){convert_8digits_signed(shuffled,negate_mask,bi.element_count,output);}else{// scalar conversion
}
As it was stated in Caveats section, a single step of algorithm validates the
whole input vector but rarely converts all bytes from the input. Although
detecting digits and sign characters might be considered cheap, we saw that
validating against an arbitrary set of separators might be really time
consuming.
The idea is to separate the validation from the conversion. In case of the
unsigned conversion only the span_mask is needed; in case of the
signed conversion we need also sign_mask.
In each iteration we process four 16-byte chunks. The result from this step are
64-bit masks. Then, we load 16 bytes into the input vector and use the lower
16-bit part of already calculated mask to get block info for the loaded code.
For unsigned conversion the main loop looks like this:
while(data<end){const__m128iinput=_mm_loadu_si128((const__m128i*)data);// validate input and calculate span_mask
constuint16_tspan_mask=get_span_mask(input);// obtain block info
constBlockInfo&bi=block_info[span_mask];// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;}
And after applying the proposed change:
while(data<end){const__m128iinput0=_mm_loadu_si128((const__m128i*)(data+0*16));const__m128iinput1=_mm_loadu_si128((const__m128i*)(data+1*16));const__m128iinput2=_mm_loadu_si128((const__m128i*)(data+2*16));const__m128iinput3=_mm_loadu_si128((const__m128i*)(data+3*16));// validate input and calculate 4 x span_mask
constuint16_tspan_mask0=get_span_mask(input0);constuint16_tspan_mask1=get_span_mask(input1);constuint16_tspan_mask2=get_span_mask(input2);constuint16_tspan_mask3=get_span_mask(input3);constuint64_tspan_mask64=(uint64_t(span_mask0)<<(0*16))|(uint64_t(span_mask1)<<(1*16))|(uint64_t(span_mask2)<<(2*16))|(uint64_t(span_mask3)<<(3*16));char*ptr=data+3*16;while(data<ptr){constuint16_tspan_mask=span_mask64&0xffff;// obtain block info
constBlockInfo&bi=block_info[span_mask];// read input
const__m128iinput=load16-bytefromdata// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;// shift the mas accordingly
span_mask64>>=bi.total_processed;}}
The problem of this solution is multiple reading the input bytes. Moreover, the
looping schema is more complicated; especially in practice, where also scalar
fallback have to be considered (it is not shown here).
The idea of separation the validation step from the actual conversion might
also be applied for scalar parser. The validation, i.e. building a span_mask
is done by SIMD code. Then, each byte of the span_mask is processed
by a precompiled code for each of 256 cases.
There are two major advantages:
1. Scalar conversions are called with compile-time constants, thanks to
that a compiler can perform as many optimizations as possible.
2. Digits spans that cross byte boundary can be processed seamlessly. When a
span ends at the byte’s end, a conversion routine converts all characters from
the input’s chunk and marks that the conversion is not completed.
When the next pattern is processed, the mark is checked: if the first
bit of the current pattern is zero, it means that the previously converted
number is ready and can be saved. If the first bit is one, it means that the
first span is continuation and thus conversion must be carried on from
the saved point.
Let’s convert following input 24-byte input (the space separates 8-byte chunks):
_12_3_45 6_7890_1 ___99___
The first chunk of input is _12_3_45, i.e. the span_pattern = 0b01101011
= 0x6b. The digit spans are [(1,2), (4,4), (6,7)], following code handles them:
*output++=convert<2>(data+1);*output++=convert<1>(data+4);val=convert<2>(data+6);has_val=true;
The template convert is parametrized by the number of digits to convert and
gets a pointer to the first character. The important fact is such code doesn’t need
to validate pointers, sizes etc. All is known in the compile-time.
The next input’s chunk is 6_7890_1, the digit spans are [(0,0), (2,5), (7,7)].
A variant of template convert that accepts two parameters simple starts
conversion from the saved state (val).
if(has_val){*output++=continue<1>(data+0,val);}else{*output++=convert<1>(data+0);}*output++=convert<4>(data+2);val=convert<1>(data+7);has_val=true;
The last, third chunk, is ___99___, and there is just one digit span [(3,4)]:
if(has_val){*output++=val;has_val=false;}*output++=convert<2>(data+3);has_val=false;
When the input vector contains an array of three-digit numbers, like
“123;456;789;123;”, we still use four-digit conversion routine. But we might
utilize the fact that the most significant digit is always zero.
We use _mm_maddubs_epi16, but with weights 1, 10, 100, 0: two lower digits
are converted and the most significant digit is properly scaled. Then simple
horizontal add _mm_hadd_epi16 adds the intermediate results.
// t0 = [ 0 | 5 | 7 | 9 | 0 | 1 | 2 | 3 | 0 | 9 | 2 | 5 | 0 | 3 | 2 | 7 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t1 = [ 500 | 79 | 100 | 23 | 900 | 25 | 300 | 27 ]
const__m128imul_all=_mm_setr_epi8(0,100,10,1,0,100,10,1,0,100,10,1,0,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 579 | 123 | 925 | 327 | 579 | 123 | 925 | 327 ]
const__m128it2=_mm_hadd_epi16(t1,t1);
Where two four-digits are going to be converted, a faster approach can be used.
Instead of storing each digit on separate bytes, digits are stored on 16-bit
values. Then a single _mm_madd_epi16 is used to multiply all digits by
weights 1, 10, 100, 1000. Then the result’s halves are added with _mm_hadd_epi32
(phaddd).
// input = “12345678”
// t0 = [ 0 | 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 0 | 8 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ]
// * [ 1000 | 100 | 10 | 1 | 1000 | 100 | 10 | 1 ]
// t1 = [ 1200 | 34 | 5600 | 78 ]
const__m128imul_all=_mm_setr_epi16(1000,100,10,1,1000,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 1234 | 5678 | 1234 | 5678 ]
const__m128it2=_mm_hadd_epi32(t1,t1);
There are two reference scalar implementations:

a fully specialized, handcrafted parser;
parser based on two standard functions: strspn and strtol.

The handcrafted source:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){enumState{Separator,Plus,Minus,Digit};Statestate=Separator;Stateprev=Separator;boolnegative=false;int32_tnumber=0;for(size_ti=0;i<size;i++){constcharc=data[i];if(c==’+’){state=Plus;}elseif(c==’-‘){state=Minus;}elseif(c>=’0’&&c<=’9’){state=Digit;}elseif(contains(separators,c)){state=Separator;}else{throwstd::runtime_error(“Wrong character (scalar)”);}switch(state){casePlus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘+’ follows a non-separator character)”);}number=0;negative=false;break;caseMinus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ follows a non-separator character)”);}number=0;negative=true;break;caseDigit:if(prev==Separator){number=c-‘0’;negative=false;}else{number=10*number+c-‘0’;}break;caseSeparator:if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}break;}// switch
prev=state;}// for
if(state==Separator){if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}}}
And implementation based on standard functions:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){char*ptr=const_cast<char*>(data);char*end=ptr+size;char*endptr;while(true){ptr+=strspn(ptr,separators);if(ptr==end){break;}errno=0;constlongval=std::strtol(ptr,&endptr,10);// the following check comes from “man 3 strtol”
if((errno==ERANGE&&(val==LONG_MAX||val==LONG_MIN))||(errno!=0&&val==0)){throwstd::logic_error(“invalid input”);}if(endptr==ptr){throwstd::logic_error(“no digits”);}ptr=endptr;*output++=val;}}
This part presents evaluation of algorithms in terms of running time
(performance tests) and also some statistical measurements that might
help in understanding run-time properties.
Following parameters of the input data are being changed:

Input size in bytes.
Distribution of number of digits:
- fixed size from 1 to 8 digits [only in overall performance tests];
- uniform distribution over range 1 to k, where k = 1..8;
- Gaussian distribution with mu = 1..8 (sigma = 1.0).
Distribution of count of separators between numbers:
- exactly one separator character,
- uniform distribution of 1 to 6 separators.
Uniform distribution of the set: no-sign-char, ‘+’ and ‘-‘.

Caveats shows statistics of 1) average processed bytes and 2) SSE
utilization. This section presents real numbers gathered during
conversion of 1000000 bytes containing 1-8 digit signed numbers.

loops	90’790
… different span patterns	1’315
total numbers converted	162’228	100.00%
conversion handled by scalar fallbacks	13’601	8.38%
conversions done by SSE procedures	148’627	91.62%
… unsigned SSE parser	6’222	N/A
… signed SSE parser	142’405

As we see, most conversions were done by the SSE code; moreover, most
conversions went through the signed parser, which is slower than unsigned
counterpart.
Another important number, which may give a hint about possible cache misses
to the block_info structure, is the count of different span patterns.
It is 1315 (~2% of the whole table size, 65536), but only 120 of them were
processed in 75% iterations. The next section
shows this parameter in details.
Below are detailed statistics for SSE routines:

calls — how many times the procedure was called;
converted — how many numbers were parsed;
conv/call — average utilization.

Conversions 1-/2-/4-/8-digits are described in section SSE conversion
capabilities, the 3-digit procedure in
Appendix A — conversion of three-digit numbers.

procedure	unsigned path	signed path
calls	converted	conv/call	calls	converted	conv/call
1-digit conversion	325	410	1.26	N/A	N/A	N/A
2-digit conversion	21	48	2.29	1949	3521	1.81
3-digit conversion	21	43	2.05	1409	2246	1.59
4-digit conversion	239	581	2.43	10569	26255	2.48
8-digit conversion	2887	5140	1.78	59769	110383	1.85

The table below shows how many bytes from the input vector were processed
in single iteration. It is pretty close to average numbers showed in Caveats.

bytes processed	patterns	%	cumulative %
0	0	0.00%	0.00%
1	0	0.00%	0.00%
2	197	0.26%	0.26%
3	583	0.76%	1.01%
4	773	1.00%	2.01%
5	1137	1.47%	3.48%
6	1841	2.39%	5.87%
7	2544	3.30%	9.17%
8	5305	6.87%	16.04%
9	9135	11.83%	27.87%
10	7163	9.28%	37.15%
11	8834	11.44%	48.60%
12	10102	13.09%	61.68%
13	9342	12.10%	73.79%
14	8221	10.65%	84.44%
15	6743	8.74%	93.17%
16	5269	6.83%	100.00%

The lookup table is quite large: in the sample implementation single record
occupies 38 bytes, it gives almost 2,5 MiB. This section presents analysis of
span_pattern usage during conversion of various inputs. Following parameters
are collated:

The number of different span patterns that appear during conversion.
This is explained in details below.
The number of CPU clocks per input byte (on Skylake); the results were
directly get from Performance comparison.
Both cache miss and branch miss ratios (the same Skylake); these
results were obtained by running separate bin/benchmark-hwevents tool.
The ratios should give a hint about real-workload penalties.

During a conversion a histogram is build. The histogram contains span_pattern
and the number of its occurrences; the histogram is sorted by the frequency.
Having such a sorted histogram it is easy to estimate how many distinct
patterns appear in most iterations.
Let’s analyse sample, made up, histogram:

#	span pattern	frequency	cumulative
1	0xaaaa	1	1
2	0xbbbb	1	2
3	0x4444	1	3
4	0xdddd	1	4
5	0xeeee	2	6
6	0xffff	2	8
7	0x1111	3	11
8	0x3333	5	16
9	0x5555	10	26
10	0x7777	17	43

There are 10 different patterns that were used in 43 loops. We can read
from the table that in 1/4 of iterations (11 of 43) were used even seven
patterns. Most of processing uses just three patterns. It means that only a
small portion of the large lookup is actually touched.
The tables in following sections have got data processed in this way. Each column
contains number of distinct patterns that take part in given percentage of the
iterations (25%, 50%, 75%, 95% and all, 100%).
The results for various input sizes are available in the repository, below
are shown only 4kB and 64kB inputs.

parameters	distinct span masks count	cycles per byte	branches	cache references
	< 25%	< 50%	< 75%	< 95%	100%	min	avg	taken	mispredicted	ratio	count	missed	ratio
Fixed length
1 digit, single separator character	32	44	50	52	53	3.095	3.165	26085	633	2.43%	1531	415	27.11%
2 digits, single separator character	12	16	18	19	20	3.305	3.400	25270	485	1.92%	1333	114	8.55%
3 digits, single separator character	5	7	8	8	9	2.533	2.584	21809	247	1.13%	954	63	6.60%
4 digits, single separator character	4	5	6	6	7	3.193	3.231	24145	336	1.39%	926	49	5.29%
5 digits, single separator character	2	3	4	4	5	2.648	2.680	20788	241	1.16%	1067	24	2.25%
6 digits, single separator character	2	4	4	4	5	2.312	2.347	18289	233	1.27%	915	19	2.08%
7 digits, single separator character	2	2	2	2	3	4.121	4.167	25682	313	1.22%	780	6	0.77%
8 digits, single separator character	1	3	3	3	4	5.495	5.592	44188	837	1.89%	836	25	2.99%
Uniform distribution
1 .. 1 digit, single separator character	42	54	60	62	63	3.162	3.288	25992	619	2.38%	1618	334	20.64%
1 .. 2 digits, single separator character	71	102	118	126	128	3.201	3.268	26561	465	1.75%	2118	655	30.93%
1 .. 3 digits, single separator character	72	112	134	146	148	2.891	2.941	24393	514	2.11%	2522	663	26.29%
1 .. 4 digits, single separator character	87	135	159	170	172	3.398	3.463	25831	906	3.51%	2480	667	26.90%
1 .. 5 digits, single separator character	92	136	163	177	179	3.495	3.553	25865	806	3.12%	2509	548	21.84%
1 .. 6 digits, single separator character	90	144	168	178	180	3.494	3.557	24943	639	2.56%	2591	627	24.20%
1 .. 7 digits, single separator character	86	125	143	151	153	3.259	3.324	23730	579	2.44%	2137	476	22.27%
1 .. 8 digits, single separator character	95	135	157	166	167	3.828	3.894	28964	859	2.97%	2074	318	15.33%
Gaussian distribution
max at 1 digit, single separator character	76	133	161	174	176	3.218	3.264	25990	485	1.87%	2658	799	30.06%
max at 2 digit, single separator character	77	129	157	170	173	3.122	3.189	24951	686	2.75%	2803	455	16.23%
max at 3 digit, single separator character	76	103	115	120	122	3.192	3.246	24812	799	3.22%	3038	256	8.43%
max at 4 digit, single separator character	43	57	63	65	66	3.134	3.215	23754	414	1.74%	1650	106	6.42%
max at 5 digit, single separator character	32	38	41	43	44	2.822	2.875	21380	346	1.62%	1320	93	7.05%
max at 6 digit, single separator character	25	30	33	34	35	2.887	2.935	21765	496	2.28%	1223	79	6.46%
max at 7 digit, single separator character	20	22	24	24	25	3.921	3.983	29234	794	2.72%	1096	40	3.65%
max at 8 digit, single separator character	11	12	13	14	15	4.825	4.887	37274	1015	2.72%	944	44	4.66%
Fixed length
1 digit, 1 .. 6 separator characters	65	131	197	223	227	3.331	3.404	18979	677	3.57%	3111	1581	50.82%
2 digits, 1 .. 6 separator characters	67	135	175	195	198	3.473	3.529	20588	799	3.88%	2858	1287	45.03%
3 digits, 1 .. 6 separator characters	68	119	147	157	158	2.747	2.806	18629	481	2.58%	2728	1150	42.16%
4 digits, 1 .. 6 separator characters	69	101	118	124	125	2.641	2.695	18672	488	2.61%	1919	613	31.94%
5 digits, 1 .. 6 separator characters	53	71	78	81	82	2.748	2.824	19009	530	2.79%	1638	372	22.71%
6 digits, 1 .. 6 separator characters	43	50	53	55	56	2.958	3.002	19989	450	2.25%	1304	208	15.95%
7 digits, 1 .. 6 separator characters	12	14	17	18	19	3.316	3.341	20881	300	1.44%	1058	52	4.91%
8 digits, 1 .. 6 separator characters	33	46	54	57	58	4.302	4.346	34811	728	2.09%	1377	206	14.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters	66	132	198	232	237	3.197	3.268	19319	685	3.55%	2972	1273	42.83%
1 .. 2 digits, 1 .. 6 separator characters	66	132	198	238	243	3.025	3.081	19752	842	4.26%	3186	1247	39.14%
1 .. 3 digits, 1 .. 6 separator characters	67	135	203	233	238	3.070	3.150	19453	743	3.82%	3357	1216	36.22%
1 .. 4 digits, 1 .. 6 separator characters	70	140	210	250	256	3.390	3.477	19640	897	4.57%	3586	1226	34.19%
1 .. 5 digits, 1 .. 6 separator characters	71	142	213	242	247	3.338	3.401	19580	823	4.20%	3170	940	29.65%
1 .. 6 digits, 1 .. 6 separator characters	71	143	207	233	237	3.288	3.375	19457	785	4.03%	2916	807	27.67%
1 .. 7 digits, 1 .. 6 separator characters	74	148	208	228	230	3.304	3.375	19268	761	3.95%	2985	904	30.28%
1 .. 8 digits, 1 .. 6 separator characters	79	159	209	227	230	3.692	3.766	23045	881	3.82%	2946	727	24.68%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters	66	133	200	244	249	3.154	3.216	19347	900	4.65%	3515	1060	30.16%
max at 2 digit, 1 .. 6 separator characters	67	135	203	241	246	3.155	3.229	19299	786	4.07%	3340	892	26.71%
max at 3 digit, 1 .. 6 separator characters	69	139	205	229	233	3.202	3.278	19382	804	4.15%	3188	661	20.73%
max at 4 digit, 1 .. 6 separator characters	70	141	180	198	201	2.914	2.978	18669	611	3.27%	2775	692	24.94%
max at 5 digit, 1 .. 6 separator characters	74	131	156	164	166	3.147	3.217	19276	586	3.04%	2291	552	24.09%
max at 6 digit, 1 .. 6 separator characters	70	93	101	105	106	3.196	3.257	20806	493	2.37%	1719	237	13.79%
max at 7 digit, 1 .. 6 separator characters	64	80	87	91	92	3.597	3.646	24974	622	2.49%	1599	251	15.70%
max at 8 digit, 1 .. 6 separator characters	56	74	82	87	88	3.994	4.055	29874	848	2.84%	1587	231	14.56%

parameters	distinct span masks count	cycles per byte	branches	cache references
	< 25%	< 50%	< 75%	< 95%	100%	min	avg	taken	mispredicted	ratio	count	missed	ratio
Fixed length
1 digit, single separator character	61	74	81	84	85	3.619	3.655	393175	6529	1.66%	13469	974	7.23%
2 digits, single separator character	13	17	20	21	22	3.727	3.754	374164	3451	0.92%	11212	286	2.55%
3 digits, single separator character	4	6	8	8	9	2.839	2.852	328065	829	0.25%	7539	101	1.34%
4 digits, single separator character	4	5	6	6	7	3.495	3.507	361605	2429	0.67%	7148	89	1.25%
5 digits, single separator character	1	2	3	3	4	2.921	2.935	313419	849	0.27%	7335	119	1.62%
6 digits, single separator character	1	2	3	3	4	2.545	2.555	272684	741	0.27%	8704	58	0.67%
7 digits, single separator character	1	2	2	2	3	4.407	4.417	386099	2206	0.57%	5224	27	0.52%
8 digits, single separator character	2	3	4	4	5	7.094	7.135	666950	13453	2.02%	5192	75	1.44%
Uniform distribution
1 .. 1 digit, single separator character	59	72	80	82	83	3.627	3.693	393231	6582	1.67%	14582	973	6.67%
1 .. 2 digits, single separator character	236	280	299	308	310	3.908	3.935	397134	5214	1.31%	19011	2176	11.45%
1 .. 3 digits, single separator character	355	420	450	466	469	3.565	3.590	361819	4669	1.29%	21347	1858	8.70%
1 .. 4 digits, single separator character	416	493	523	537	540	5.329	5.387	391117	12854	3.29%	24169	1621	6.71%
1 .. 5 digits, single separator character	444	510	543	556	558	5.270	5.328	390345	11393	2.92%	19020	1568	8.24%
1 .. 6 digits, single separator character	423	492	525	535	537	4.589	4.633	372174	7872	2.12%	19917	1282	6.44%
1 .. 7 digits, single separator character	392	462	481	491	493	4.174	4.218	355907	6263	1.76%	15979	949	5.94%
1 .. 8 digits, single separator character	519	598	620	630	631	5.217	5.273	423108	11289	2.67%	16635	695	4.18%
Gaussian distribution
max at 1 digit, single separator character	481	587	630	645	648	4.150	4.189	395482	6382	1.61%	25467	1620	6.36%
max at 2 digit, single separator character	454	534	570	586	589	4.104	4.143	371930	6882	1.85%	25775	739	2.87%
max at 3 digit, single separator character	300	330	344	350	352	4.990	5.050	375961	11571	3.08%	15877	351	2.21%
max at 4 digit, single separator character	132	147	153	155	156	3.760	3.814	356129	4522	1.27%	10634	220	2.07%
max at 5 digit, single separator character	84	90	93	95	96	3.031	3.045	318232	1548	0.49%	8123	177	2.18%
max at 6 digit, single separator character	54	59	62	64	65	3.479	3.499	328097	6041	1.84%	8491	366	4.31%
max at 7 digit, single separator character	36	39	41	41	42	5.520	5.560	444157	11932	2.69%	5814	124	2.13%
max at 8 digit, single separator character	22	23	24	25	26	6.897	6.946	562307	15610	2.78%	5449	120	2.20%
Fixed length
1 digit, 1 .. 6 separator characters	1052	1626	1870	1950	1958	4.371	4.408	280108	7580	2.71%	34860	8289	23.78%
2 digits, 1 .. 6 separator characters	693	909	995	1026	1029	5.044	5.108	308039	12071	3.92%	26297	4481	17.04%
3 digits, 1 .. 6 separator characters	354	438	480	494	496	3.334	3.358	273155	3865	1.41%	17568	2647	15.07%
4 digits, 1 .. 6 separator characters	177	224	246	252	254	3.311	3.335	277306	4944	1.78%	10967	1361	12.41%
5 digits, 1 .. 6 separator characters	100	126	134	137	138	3.648	3.677	284172	7099	2.50%	7727	742	9.60%
6 digits, 1 .. 6 separator characters	69	77	80	82	83	3.675	3.703	297939	5210	1.75%	6470	402	6.21%
7 digits, 1 .. 6 separator characters	46	49	51	53	54	3.662	3.673	315611	2311	0.73%	6077	308	5.07%
8 digits, 1 .. 6 separator characters	30	45	54	58	59	5.735	5.782	532039	10908	2.05%	6292	438	6.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters	1053	1643	1896	1973	1981	4.364	4.396	279251	7676	2.75%	34478	6144	17.82%
1 .. 2 digits, 1 .. 6 separator characters	1066	1792	2104	2212	2221	5.090	5.154	288339	12133	4.21%	39995	5781	14.45%
1 .. 3 digits, 1 .. 6 separator characters	1084	1802	2120	2226	2236	4.974	5.031	286026	10029	3.51%	44877	5187	11.56%
1 .. 4 digits, 1 .. 6 separator characters	1120	1808	2084	2173	2182	5.332	5.395	287345	11700	4.07%	42918	3871	9.02%
1 .. 5 digits, 1 .. 6 separator characters	1141	1713	1959	2031	2039	5.078	5.153	287785	10589	3.68%	40253	2950	7.33%
1 .. 6 digits, 1 .. 6 separator characters	1142	1617	1803	1856	1860	4.888	4.960	286913	10124	3.53%	35625	2229	6.26%
1 .. 7 digits, 1 .. 6 separator characters	1091	1503	1646	1678	1681	4.970	5.033	287420	11059	3.85%	30932	3218	10.40%
1 .. 8 digits, 1 .. 6 separator characters	1109	1476	1598	1619	1622	5.650	5.711	339596	14223	4.19%	28870	2243	7.77%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters	1070	1975	2405	2551	2565	5.389	5.456	285670	12734	4.46%	46840	6140	13.11%
max at 2 digit, 1 .. 6 separator characters	1094	1862	2174	2283	2293	5.125	5.177	288202	10818	3.75%	43779	3228	7.37%
max at 3 digit, 1 .. 6 separator characters	1021	1440	1610	1666	1672	4.927	4.986	284545	10542	3.70%	34647	1690	4.88%
max at 4 digit, 1 .. 6 separator characters	715	905	985	1010	1013	4.072	4.129	280545	8310	2.96%	20920	931	4.45%
max at 5 digit, 1 .. 6 separator characters	445	534	567	575	577	4.127	4.167	286377	8569	2.99%	14319	651	4.55%
max at 6 digit, 1 .. 6 separator characters	296	336	346	351	352	4.025	4.054	312734	6394	2.04%	11762	402	3.42%
max at 7 digit, 1 .. 6 separator characters	185	207	214	219	220	4.481	4.520	376654	8042	2.14%	7924	452	5.70%
max at 8 digit, 1 .. 6 separator characters	110	133	141	146	148	5.532	5.576	448317	12621	2.82%	7779	821	10.55%

Following procedures are tested:

scalar — the hand-coded validating scalar parser;
scalar (std) — also the hand-coded parser,
but use only standard functions strtol and strspn;
scalar hybrid — the implementation of Scalar hybrid;
SSE — the implementation of the base SSE algorithm;
SSE (block) — the implementation of variant described in Processing
larger inputs
SSE (simplified) — the SSE variant with relaxed delimiters definition.

There are two tests approaches:

Overall, where rather huge input data is parsed (a few megabytes) and
the wall clock is used to measure the time of the whole procedure. Procedures
are repeated several times (10, 100, 1000 or more), the minimum running
time is noted.
To run these tests type make report-overall.rst; the tests suite uses executable
bin/benchmark.
The results from various computers are available in the repository.
Microbenchmarks, where small inputs are parsed and exact number of CPU
cycles is obtained. Procedures are also repeated many times (1000) and then
minumum and average cycles to process a single input byte are calculated.
To run these tests type make microbenchmarks.rst; the tests suite uses
executable bin/benchmark-cpuclocks.
The all results from various computers are available in the repository,
below are only the measurements from the Skylake computer.

Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
The table below compares speedup to the reference procedure for various
input sizes and data distributions. Detailed measurements for each size and
different distribution parameters are presented in the subsequent sections.

	speedup over scalar procedure
	scalar (std)	scalar (hybrid)	SSE	SSE (block)	SSE (simplified)
size [B]	distribution	samples	min	avg	max	min	avg	max	min	avg	max	min	avg	max	min	avg	max
1024	Gaussian distribution	16	0.45	0.87	1.29	1.65	2.36	3.26	1.53	3.21	5.44	1.66	3.16	4.97	1.58	3.37	5.78
1024	Fixed length	16	0.42	0.87	1.29	1.63	2.41	3.53	1.28	3.32	5.23	1.47	3.30	5.05	1.30	3.54	5.79
1024	Uniform distribution	16	0.40	0.86	1.32	1.64	2.41	3.36	2.04	3.70	5.54	2.22	3.53	4.97	2.13	3.88	5.79
4096	Gaussian distribution	16	0.39	0.69	1.24	1.69	3.01	4.67	1.91	4.56	7.21	2.13	4.88	7.84	2.01	4.77	7.68
4096	Fixed length	16	0.36	0.69	1.39	1.94	2.87	4.32	1.60	4.44	6.86	1.71	4.66	8.34	1.62	4.70	7.83
4096	Uniform distribution	16	0.43	0.78	1.38	2.07	3.24	4.55	3.19	5.33	7.40	3.36	5.61	7.87	3.27	5.61	7.79
65536	Gaussian distribution	16	0.45	0.77	1.30	1.40	1.74	2.13	1.73	4.46	5.88	1.76	4.57	6.00	1.77	4.60	6.25
65536	Fixed length	16	0.40	0.75	1.36	1.55	1.87	2.37	1.46	4.68	6.72	1.48	5.00	7.31	1.50	4.95	7.53
65536	Uniform distribution	16	0.55	0.86	1.36	1.38	1.75	2.34	3.32	4.98	6.39	3.29	5.01	6.52	3.43	5.20	6.89
102400	Gaussian distribution	16	0.46	0.77	1.26	1.38	1.71	2.03	1.78	4.28	5.66	1.81	4.44	5.81	1.82	4.42	6.00
102400	Fixed length	16	0.40	0.75	1.36	1.53	1.84	2.31	1.29	4.59	6.49	1.46	4.91	6.99	1.31	4.85	7.16
102400	Uniform distribution	16	0.54	0.86	1.38	1.35	1.71	2.32	3.22	4.78	6.38	3.21	4.83	6.43	3.31	4.97	6.85

Gaussian distribution (max at 8 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	8.782	9.103	1.00
scalar (std)	22.874	23.586	0.38
scalar (hybrid)	5.005	5.148	1.75
SSE	4.898	4.974	1.79
SSE (block)	4.603	4.651	1.91
SSE (simplified)	4.687	4.730	1.87

Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	12.729	13.130	1.00
scalar (std)	17.501	18.150	0.73
scalar (hybrid)	4.067	4.273	3.13
SSE	4.002	4.069	3.18
SSE (block)	3.903	3.945	3.26
SSE (simplified)	3.914	3.970	3.25

Gaussian distribution (max at 7 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	8.836	9.191	1.00
scalar (std)	23.365	24.139	0.38
scalar (hybrid)	5.104	5.250	1.73
SSE	4.129	4.195	2.14
SSE (block)	3.476	3.518	2.54
SSE (simplified)	3.975	4.039	2.22

Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	13.646	14.115	1.00
scalar (std)	18.029	18.596	0.76
scalar (hybrid)	4.042	4.274	3.38
SSE	3.591	3.644	3.80
SSE (block)	3.294	3.334	4.14
SSE (simplified)	3.431	3.481	3.98

Gaussian distribution (max at 6 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	9.434	9.852	1.00
scalar (std)	24.766	25.547	0.38
scalar (hybrid)	5.571	5.728	1.69
SSE	2.833	2.882	3.33
SSE (block)	2.977	3.024	3.17
SSE (simplified)	2.738	2.786	3.45

Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	14.200	14.750	1.00
scalar (std)	17.992	18.563	0.79
scalar (hybrid)	4.091	4.330	3.47
SSE	3.207	3.266	4.43
SSE (block)	2.885	2.932	4.92
SSE (simplified)	3.058	3.113	4.64

Gaussian distribution (max at 5 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	10.384	10.864	1.00
scalar (std)	25.763	26.495	0.40
scalar (hybrid)	6.102	6.273	1.70
SSE	2.786	2.827	3.73
SSE (block)	2.710	2.744	3.83
SSE (simplified)	2.699	2.752	3.85

Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	15.168	15.622	1.00
scalar (std)	18.088	18.872	0.84
scalar (hybrid)	4.244	4.442	3.57
SSE	3.023	3.096	5.02
SSE (block)	2.864	2.934	5.30
SSE (simplified)	2.908	2.984	5.22

Gaussian distribution (max at 1 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	18.091	18.877	1.00
scalar (std)	34.484	35.258	0.52
scalar (hybrid)	7.037	7.275	2.57
SSE	3.160	3.213	5.72
SSE (block)	3.282	3.352	5.51
SSE (simplified)	3.120	3.194	5.80

Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	21.390	22.112	1.00
scalar (std)	18.566	19.088	1.15
scalar (hybrid)	5.160	5.286	4.15
SSE	3.067	3.126	6.97
SSE (block)	2.904	2.963	7.37
SSE (simplified)	2.961	3.022	7.22

Gaussian distribution (max at 4 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	11.966	12.550	1.00
scalar (std)	26.982	27.705	0.44
scalar (hybrid)	5.896	6.082	2.03
SSE	3.234	3.284	3.70
SSE (block)	3.026	3.074	3.95
SSE (simplified)	3.117	3.183	3.84

Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	16.371	16.967	1.00
scalar (std)	17.832	18.445	0.92
scalar (hybrid)	4.173	4.395	3.92
SSE	2.909	3.001	5.63
SSE (block)	2.760	2.813	5.93
SSE (simplified)	2.848	2.908	5.75

Gaussian distribution (max at 3 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	13.280	13.948	1.00
scalar (std)	29.238	30.157	0.45
scalar (hybrid)	6.225	6.450	2.13
SSE	3.225	3.309	4.12
SSE (block)	2.963	3.012	4.48
SSE (simplified)	3.238	3.286	4.10

Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	18.193	18.831	1.00
scalar (std)	18.642	19.287	0.98
scalar (hybrid)	4.119	4.246	4.42
SSE	3.126	3.190	5.82
SSE (block)	3.016	3.090	6.03
SSE (simplified)	3.000	3.048	6.06

Gaussian distribution (max at 2 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	15.942	16.733	1.00
scalar (std)	32.551	33.334	0.49
scalar (hybrid)	6.372	6.605	2.50
SSE	3.130	3.190	5.09
SSE (block)	3.041	3.121	5.24
SSE (simplified)	3.020	3.082	5.28

Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	20.088	20.800	1.00
scalar (std)	18.382	19.013	1.09
scalar (hybrid)	4.574	4.835	4.39
SSE	3.184	3.264	6.31
SSE (block)	3.018	3.069	6.66
SSE (simplified)	2.988	3.042	6.72

Uniform distribution (1 .. 1 digit) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	18.677	19.206	1.00
scalar (std)	34.479	35.207	0.54
scalar (hybrid)	7.953	8.184	2.35
SSE	3.014	3.099	6.20
SSE (block)	3.117	3.172	5.99
SSE (simplified)	2.977	3.062	6.27

Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	21.981	22.621	1.00
scalar (std)	16.860	17.391	1.30
scalar (hybrid)	5.285	5.481	4.16
SSE	3.029	3.085	7.26
SSE (block)	2.942	3.000	7.47
SSE (simplified)	3.099	3.179	7.09

Uniform distribution (1 .. 2 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	17.695	18.444	1.00
scalar (std)	34.917	35.633	0.51
scalar (hybrid)	7.245	7.401	2.44
SSE	3.324	3.387	5.32
SSE (block)	3.323	3.398	5.33
SSE (simplified)	3.236	3.321	5.47

Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	21.164	21.860	1.00
scalar (std)	18.567	19.126	1.14
scalar (hybrid)	4.937	5.156	4.29
SSE	2.940	2.995	7.20
SSE (block)	2.924	2.975	7.24
SSE (simplified)	2.785	2.826	7.60

Uniform distribution (1 .. 3 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	16.385	17.254	1.00
scalar (std)	33.359	34.075	0.49
scalar (hybrid)	6.877	7.145	2.38
SSE	2.892	2.947	5.67
SSE (block)	3.088	3.158	5.31
SSE (simplified)	2.752	2.844	5.95

Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	20.530	21.221	1.00
scalar (std)	18.995	19.580	1.08
scalar (hybrid)	4.647	4.811	4.42
SSE	3.037	3.089	6.76
SSE (block)	2.941	2.992	6.98
SSE (simplified)	2.914	2.972	7.05

Uniform distribution (1 .. 4 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	14.926	15.613	1.00
scalar (std)	31.275	32.210	0.48
scalar (hybrid)	6.445	6.677	2.32
SSE	3.393	3.458	4.40
SSE (block)	3.126	3.184	4.77
SSE (simplified)	3.293	3.344	4.53

Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	19.047	19.920	1.00
scalar (std)	19.500	20.130	0.98
scalar (hybrid)	4.539	4.723	4.20
SSE	3.325	3.383	5.73
SSE (block)	3.132	3.193	6.08
SSE (simplified)	3.182	3.235	5.99

Uniform distribution (1 .. 5 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	13.748	14.405	1.00
scalar (std)	30.105	30.941	0.46
scalar (hybrid)	6.144	6.413	2.24
SSE	3.475	3.533	3.96
SSE (block)	3.171	3.238	4.34
SSE (simplified)	3.293	3.356	4.17

Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	18.297	19.010	1.00
scalar (std)	19.209	19.845	0.95
scalar (hybrid)	4.477	4.652	4.09
SSE	3.293	3.352	5.56
SSE (block)	3.135	3.188	5.84
SSE (simplified)	3.182	3.230	5.75

Uniform distribution (1 .. 6 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	12.912	13.548	1.00
scalar (std)	29.511	30.337	0.44
scalar (hybrid)	5.958	6.253	2.17
SSE	3.430	3.487	3.76
SSE (block)	3.201	3.265	4.03
SSE (simplified)	3.283	3.337	3.93

Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	17.288	17.956	1.00
scalar (std)	19.250	19.841	0.90
scalar (hybrid)	4.356	4.481	3.97
SSE	3.205	3.276	5.39
SSE (block)	2.990	3.046	5.78
SSE (simplified)	3.116	3.172	5.55

Uniform distribution (1 .. 7 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	11.974	12.552	1.00
scalar (std)	27.884	28.562	0.43
scalar (hybrid)	6.214	6.385	1.93
SSE	3.253	3.309	3.68
SSE (block)	3.057	3.100	3.92
SSE (simplified)	3.119	3.167	3.84

Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	16.583	17.366	1.00
scalar (std)	18.947	19.528	0.88
scalar (hybrid)	4.265	4.401	3.89
SSE	3.248	3.311	5.11
SSE (block)	2.989	3.048	5.55
SSE (simplified)	3.165	3.215	5.24

Uniform distribution (1 .. 8 digits) — single separator character

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	11.637	12.165	1.00
scalar (std)	27.107	27.851	0.43
scalar (hybrid)	6.079	6.231	1.91
SSE	3.821	3.870	3.05
SSE (block)	3.492	3.550	3.33
SSE (simplified)	3.688	3.745	3.16

Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters

	cycles per one input byte	speed-up
procedure	min	avg	(min)
scalar	15.739	16.589	1.00
scalar (std)	18.875	19.563	0.83
scalar (hybrid)	4.335	4.463	3.63
SSE	3.611	3.667	4.36
SSE (block)	3.337	3.400	4.72
SSE (simplified)	3.490	3.538	4.51