Parsing series of integers with SIMD
While conversion from a string into an integer value is feasible with SIMD
instructions, this application is unpractical. For typical cases, when a single
value is parsed, scalar procedures — like the standard atoi or
strtol — are faster than any fancy SSE code.
However, SIMD procedures can be really fast and convert in parallel several
numbers. There is only one “but”: the input data has to be regular and valid,
i.e. the input string must contain only ASCII digits. Recently, I updated
article about SSE parsing with the benchmark results. The
speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times
faster than a naive, scalar code.
The obvious question is how these powerful SIMD procedures can be used to
convert real data? By real I mean possibly broken inputs that contain series
of numbers of different length separated with characters from a predefined set.
In this text I try to answer that question; major contributions of this
article are:
- Methods to efficiently parse and validate such strings using SSE
instructions. There is a special variant that handles only unsigned
numbers and also a fully featured variant for signed numbers. - Boosting the SSE procedure with AVX2 or AVX512 instructions when
possible. - A way to combine some of SIMD techniques with scalar conversions.
The article starts with unsigned conversion, because it is easier than a signed
one. The signed conversion shares the core idea, it just adds some extra steps.
The text is accompanied with BSD-licensed software, that includes fully
functional implementations alongside the programs which validate and benchmark
the procedures.
The input is defined as follows:
- It is a string of arbitrary length; but it should be large, say a few
kilobytes or megabytes. - The input contains integer numbers. A number is an optional sign character
‘+’ or ‘-‘ followed by a non-empty sequence of digits ‘0’..’9′. - Numbers are separated with characters from a user-defiend set;
for instance, it might be the comma and blank characters (space, new
line, etc.). - All characters other than digits, separators and signs are considered
invalid. - Parser must be able to check validity of the input, i.e. lack of invalid
characters and the proper input format (numbers are separated with at
least one character, sign characters are present only at the beginning
of a number).
For the input string “123; -52, +432424 -999; 1234568, +879”, and three
separators ‘,’, ‘;’ and ‘ ‘, an algorithm is expected to extract six integers:
[123, -52, 432424, -999, 1234568, 879]. The order of converted numbers must be
exactly the same as in the input string.
A simplified parser is also tested. This parser simply considers all
non-digit and non-sign characters as delimiters.
The conversion algorithm works in a loop, in each iteration exactly one input
vector is loaded from memory and processed.
- The vector is validated.
- Depending on the layout of the digits, the input is normalized to the
format required by the SIMD procedures. At this step we choose: how many
numbers from the input are converted and which SIMD procedure is suitable
for conversion. - After conversion, the input pointer is advanced and the loop continues.
Since SIMD procedures impose specific layout of digits within a vector, an
arbitrary input has to be properly aligned. In order to do this, we need to
identify spans of digits in the input and move each span onto certain
subarrays of a vector. Let’s assume at this point that the input contains either
digits or separators (denoted with _).
For instance, the vector with one-digit numbers 1___2_3__4__5_6_ must be
transformed into 123456__________. Then an SSE procedure will convert in
parallel all the numbers into the array [1, 2, 3, 4, 5, 6]. Likewise, the vector
with two-digit numbers _12__34___56_78_ must be transformed into
12345678________ and then the result will be [12, 34, 56, 78].
Let’s consider a more complicated case, when a string has numbers with
different count of digits. There are a few ways to convert the input
_1_2_34_567_89__:
- If we choose conversion of one-digit numbers, then just the two first spans
can be converted — because we need to keep the order of numbers from the input.
After normalization the input into 12______________ just two values [1,
2] will be produced. The input’s tail _34_567_89__ remain untouched. - If we choose conversion of two-digit numbers, then the three first spans
can be converted. Shorter numbers are completed with zeros, then normalized
input is 010234__________. The result is [1, 2, 34]; this time
a bit shorter input’s tail _567_89__ remain untouched. - Finally, if we choose conversion of four-digit numbers, then the four first
spans can be converted. Again, shorter numbers are completed with zeros,
and normalized input is 0001000200340567. The result is [1, 2, 34, 567],
but still the chunk’s tail, i.e. _89__, is unprocessed.
We can see that in order to convert given span combination we need to know:
- How to shuffle bytes in the input vector?
- Which SIMD procedure can be used then?
- How many numbers converted by the SIMD procedure must be stored?
Obtaining this information seems to be quite complicated, especially when
we look at the last example. Fortunately, all parameters can
be precalculated. A span combination can be saved as a bit-pattern,
where ones represent digits. For example, from vector _1_2_34_567_89__
we get the span pattern0b0101011011101100 = 0x56ec.
A span pattern is used to fetch a record from the precalculated array.
The record contains following fields:
- shuffle_digits — an array of 16 bytes, which is the argument for
the instruction _mm_shuffle_epi8 (pshufb); the instruction
moves bytes at certain positions; - conversion_routine — an enumeration that selects an SSE conversion
procedure; for instance, it tells that shuffled input is an array
of two-digit numbers; - element_count — the number of elements from the SSE conversion
procedure that must be stored in the output collection.
The solution with a precalculated array is suitable only for SSE, as span
patterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider,
such a table would be simply too large, respectively 232 or
264 entries. Additionally, the AVX2 version of pshufb instruction
works on lanes, i.e. 128-bit halves of a vector, thus it is impossible to
shuffle all inputs.
But AVX2 and AVX512 instructions still might be used in some parts of
algorithms, especially in input validation.
To build the whole precalculated table all 65536 span patterns have to be
processed. Processing a pattern consists following steps:
- Determine digit spans. For pattern _ddd__d__ddd_dd_ we have four spans
[(1,3), (6,6), (9,11), (14,15)]. But for pattern ____dddd___ddddd there
is only one span [(4,7)]. The span at the vector’s end might be continued in
the next chunk, and we don’t know how to convert it. However, we may assume
that a span which starts at the vector’s beginning is a new number and thus
it is included into the span’s list. - Find the best conversion scheme for the spans. We systematically check
which SSE conversion is possible, i.e. one-/two-/four-/eight-digit and how
many spans each conversion can process — the procedure which processes the
most items is selected. The output from this step are values for the following
record fields:- element_size — 1, 2, 4, 8 or 0 if SSE conversion is not possible;
- element_count — the numbers of converted spans;
- total_skip — how many input bytes are processed; for pattern
____dddd________ we know that the whole input is processed
(total_skip=16); for pattern ____d____ddddd_d we process only the
first span at index 5, but we know that four non-digit characters after
the span can be also skipped, thus total_skip is 9.
- Construct the shuffle pattern. Having the element size and spans positions
the vector shuffle_digits is built. The characters from a span are mapped
on the appropriate subarrays in the vector. Also zero completion is done in
this step, as the instruction pshufb either copy given byte from the input
vector or put zero if the index has got the most significant bit set.
Let’s consider span pattern _ddd__d__ddd_dd_. It can be converted using
four-digit conversion, so the layout of normalized vector should be:
[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ]
There are four numbers a, b, c and d in the vector; note
that the bytes at 0th index hold the most significant digits.
The first number in pattern has three digits at indices 1, 2 and 3. It means
that a1 = 1, a2 = 2 and a3 = 3; value of a0 should be zeroed,
so a0 = 0x80.
The second number has just one digit at index 6. Thus b3 = 6 and, since
the rest digits have to be zero, b0 = b1 = b2 = 0x80.
The third number has three digits at indices 9, 10, 11. Thus c1 = 9,
c2 = 10, c3 = 11, and c0 = 0x80.
The last, forth number, has got two digits at indices 14 and 15. Thus
d2 = 14, d3 = 15 and the rest have to be zero, so d0 = d1 = 0x80.
The final content of the shuffle_digits vector is:
[ 80 | 1 | 2 | 3 | 80 | 80 | 80 | 6 | 80 | 9 | 10 | 11 | 80 | 80 | 14 | 15 ]
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.
- Load 16 characters into the input vector.
const__m128iinput=_mm_loadu_si128((const__m128i*)data);
- Check for invalid characters (will be discussed later).
- Build the span pattern:
// t0 = input[i] in [‘0’..’9] ? 0xff : 0x00
const__m128it0=decimal_digits_mask(input);constuint16_tspan_mask=_mm_movemask_epi8(t0);
- Fetch the block info with the precalculate parameters.
constBlockInfo&bi=blocks[span_mask];
- From the block info get the shuffle pattern for _mm_shuffle_epi8.
const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)b.shuffle_digits);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);
- The shuffled vector is then issued to the proper conversion routine. The block
info structure keeps the conversion kind (conversion_routine) and also the number
of items to get from the SSE result (element_count). It is not always
possible to use any SSE conversion procedure, thus a scalar code is needed
to handle odd cases.
if(b.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE2Digits){convert_2digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE4Digits){convert_4digits(shuffled,b.element_count,output);}elseif(b.conversion_routine==Conversion::SSE8Digits){convert_8digits(shuffled,b.element_count,output);}else{// case unsupported, a scalar code is called here
}
- Advance the input pointer. The block info structure keeps how many bytes
were actually covered.
data+=b.total_skip;
In case of unsigned inputs we need to classify input characters into three
categories:
- digits, i.e. chars ‘0’ .. ‘9’;
- user-defined set of separators;
- invalid numbers.
In practise we need two masks: one with positions of digits, another with
positions of separators. If the or’ed mask has some zero elements, it means
that there are invalid characters. Below is a schema:
const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_sep=matcher.get_mask(input);const__m128ibytemask_valid=_mm_or_si128(bytemask_digit,bytemask_sep);constuint16_tvalid_mask=_mm_movemask_epi8(bytemask_valid);if(valid_mask!=0xffff){throwstd::logic_error(“wrong input”);}
The most generic SSE implementation works in time proportional to the size
of a separators set. Each character from the set is broadcasted into
a vector and then compared with the input vector. The resulting byte mask is
ored with the previous result.
__m128iget_mask(const__m128i&input,const__m128i&initial)const{__m128iresult=initial;for(size_ti=0;i<n+1;i++){const__m128imask=_mm_cmpeq_epi8(letters[i],input);result=_mm_or_si128(result,mask);}returnresult;}
If the separators set is known at compilation time, then a compiler may
unroll the loop.
The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be used
in same cases. The instruction gets two arguments. One is an input vector.
Another one is interpreted either as a set of individual characters (up to 16) or
characters ranges (up to 8). The result is a mask for the input characters matching
the set (or ranges). In the example below the set variant was used.
__m128iget_mask(const__m128i&input,const__m128i&initial){constuint8_tmode=_SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_UNIT_MASK;return_mm_or_si128(initial,_mm_cmpestrm(set,set_size,input,16,mode));}
- It is unlikely, and sometimes impossible, to process all digits from the
input vector. In a single iteration one to 16 bytes are converted. The table
below shows details — an important conclusion is that almost 95% of
patterns process at least half of an input vector.
bytes processed | patterns | % | cumulative % |
---|---|---|---|
0 | 17 | 0.03% | 0.03% |
2 | 32 | 0.05% | 0.07% |
3 | 48 | 0.07% | 0.15% |
4 | 176 | 0.27% | 0.42% |
5 | 360 | 0.55% | 0.97% |
6 | 542 | 0.83% | 1.79% |
7 | 693 | 1.06% | 2.85% |
8 | 1084 | 1.65% | 4.50% |
9 | 1620 | 2.47% | 6.98% |
10 | 2274 | 3.47% | 10.45% |
11 | 3057 | 4.66% | 15.11% |
12 | 4526 | 6.91% | 22.02% |
13 | 6193 | 9.45% | 31.47% |
14 | 8952 | 13.66% | 45.13% |
15 | 12890 | 19.67% | 64.79% |
16 | 23072 | 35.21% | 100.00% |
- However, the input validation is always done for the whole vector,
all 16 bytes. This problem might be overcome, see Processing larger
inputs. - SSE units are underutilized. The longest one-digit number sequence
is “1;2;3;4;5;6;7;8;”. It contains eight numbers, while the SSE procedure
is able to process 16 numbers. Likewise, the longest two-digit number
sequence is “12;34;56;78;98;;”. It has five numbers, but the SSE procedure
is able to convert eight numbers. Similarly, the longest four-digit numbers
sequence is “1234;5678;9123;;” — it has three four-digit numbers, the SSE
procedure is able to convert four numbers.
SSE procedure | usage in parser | converted numbers | average utilization |
|||
---|---|---|---|---|---|---|
input size | converted nums | # | % | avg | max | |
N/A | — | 264 | 0.40% | — | — | — |
1-digit | 16 | 5799 | 8.85% | 3.52 | 8 | 22% |
2-digit | 8 | 21241 | 32.41% | 3.99 | 7 | 49% |
4-digit | 4 | 30924 | 47.18% | 3.48 | 4 | 87% |
8-digit | 2 | 7308 | 11.15% | 1.97 | 2 | 98% |
Similarly to the unsigned conversion, the signed conversion also requires a
span pattern. But in this case also the sign characters ‘+’ and ‘-‘ are
included in the pattern. For this sample string:
input = “_+123_-45_78_-9_”
the span pattern is:
span_mask = _dddd_ddd_dd_dd_
The conversion unfortunately requires two shuffles:
- Shuffle spans onto proper vector elements — this is exactly the same
as in unsigned conversion. - Populate the first span characters, i.e. possibly sign character,
on the proper vector elements.
For input:
input =
[ ‘_’ | ‘-‘ | ‘1’ | ‘2’ | ‘3’ | ‘_’ | ‘+’ | ‘4’ | ‘5’ | ‘_’ | ‘-‘ | ‘6’ | ‘_’ | ‘7’ | ‘8’ | ‘_’ ]
The conversion processes four four-digit numbers. After shuffling the
digit spans we have:
shuffled =
[ ‘-‘ ‘1’ ‘2’ ‘3’ | 0x00 ‘+’ ‘4’ ‘5’ | 0x00 0x00 ‘-‘ ‘6’ | 0x00 0x00 ‘7’ ‘8’ ]
and the vector of sign characters is:
shuffled_signs =
[ ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘+’ ‘+’ ‘+’ ‘+’ | ‘-‘ ‘-‘ ‘-‘ ‘-‘ | ‘7’ ‘7’ ‘7’ ‘7’ ]
The shuffled_signs vector does not necessarily contain only ‘+’ or ‘-‘,
it might contain also digits, but this is not an error.
Then the conversion works as follows:
- From the shuffled vector the sign characters are removed and
ASCII characters are converted into values:
[ 0 1 2 3 | 0 0 4 5 | 0 0 0 6 | 0 0 7 8 ] - Thus we end up with a vector of unsigned numbers, and such an input
can be converted by the already defined SSE procedures:
[ 123 | 45 | 6 | 78 ] - Finally, negate appropriate elements; we use a well known equation (~x + 1),
which is expressed in the SSE code as (x xor (-1) – (-1)). The value -1
is obtained from comparing shuffled_signs with the ‘-‘:
negated_mask = (shuffled_signs == packed_byte(‘-‘))
[ ff ff ff ff | 00 00 00 00 | ff ff ff ff | 00 00 00 00 ]
or as vector of int32_t:
[ -1 | 0 | -1 | 0 ]
Masking of the sign characters in shuffled vector is done at no cost, as
the instruction _mm_subs_epu8 (psubusb) is used to convert from ASCII to
numeric values:
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);
The instruction performs subtract with saturation, i.e. calculates max(0,
a-b). Since the ASCII codes of ‘-‘ and ‘+’ are respectively 43 and 45 and the
ASCII code of ‘0’ is 48, then the result of such expression is zero for both
sign characters.
The characters ‘+’ and ‘-‘ can be present only at the beginning of a digit
span, i.e. we want to reject inputs like “++12” or “1234-,”.
To properly validate the input we need four masks for:
- separators,
- digits,
- the character ‘-‘,
- the character ‘+’.
The first phase is similar to the unsigned conversion, i.e. all masks are or’ed
together, then zero elements point at invalid characters.
The second phase checks if sign characters are placed properly. For each span
combination we need to precalculate the mask of valid sign positions. Then,
after or’ing the masks for the ‘+’ and ‘-‘ we simply verify if they are placed
on valid positions.
In practise the precalculated data has a bit-mask for invalid sign positions,
thanks to that validation require just a simple and. The snippet below shows
the schema of the second step.
const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);// …
constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=blocks[span_mask];if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}
AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b)
which does a lookup in a 128-byte table using as the indices the seven lowest bits of
each byte of a ZMM register. Invocation of the intrinsics function is weird:
const__m512ilookup_lo=…// elements 0..63
const__m512ilookup_hi=…// and 64..127 of the lookup
const__m512itransformed=_mm512_permutex2var_epi8(lookup_lo,input,lookup_hi);
If the set of valid character fits in the standard ASCII character set, i.e.
just seven lowest bits are set, then the validation step might quickly reject
extended ASCII (7th bit set) and then translate the input using single
invocation of _mm512_permutex2var_epi8. When full 8-bit input
is allowed, then the instruction has to be called twice with two halves
of a 256-byte lookup.
An invocation (or two) of this instruction allows to classify at once
all 64 input bytes into required categories:
- separators — it doesn’t matter how many separator chars are used;
- digits;
- and sign characters.
Category is encoded as a bit at certain position, in sample implementation
following schema is used:
- separator = 0x01,
- digit = 0x80,
- sign = 0x80 | 0x40.
If the transformed vector contains zero byte, it means there are invalid characters.
The span pattern is obtained from the most significant bits by the instruction
_mm512_movepi8_mask (vpmovb2m); it works exactly the same way as
pmovmaskb from SSE. The sign pattern is obtained by the instruction
_mm512_test_epi8_mask (vptestmb), by testing bit 7th (mask 0x40).
Below is a snippet from the implementation:
const__m512iclasses=_mm512_permutex2var_epi8(class_lo,input,class_hi);if(_mm512_test_epi8_mask(classes,classes)!=uint64_t(-1)){throwstd::logic_error(“invalid character”);}uint64_tspan_mask64=_mm512_movepi8_mask(classes);uint64_tsign_mask64=_mm512_test_epi8_mask(classes,_mm512_set1_epi8(int8_t(0x40)));
Below are major steps of the algorithm’s loop, with snippets from the actual
implementation.
- Load 16 characters into the input vector.
const__m128iinput=_mm_loadu_si128((const__m128i*)data);
- Check for invalid characters (as described above).
- Build the span pattern and fetch the conversion parameters.
const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128iascii_plus=_mm_set1_epi8(‘+’);// …
const__m128ibytemask_plus=_mm_cmpeq_epi8(input,ascii_plus);const__m128ibytemask_minus=_mm_cmpeq_epi8(input,ascii_minus);const__m128ibytemask_sign=_mm_or_si128(bytemask_plus,bytemask_minus);const__m128ibytemask_digit=decimal_digits_mask(input);const__m128ibytemask_span=_mm_or_si128(bytemask_digit,bytemask_sign);constuint16_tspan_mask=_mm_movemask_epi8(bytemask_span);constBlockInfo&bi=block_info[span_mask];
- Validate if the sign characters are properly placed.
constuint16_tsign_mask=_mm_movemask_epi8(bytemask_sign);if(sign_mask&bi.invalid_sign_mask){throwstd::runtime_error(“‘+’ or ‘-‘ at invalid position”);}
- If sign_mask is zero, choose faster unsigned conversion path.
- Otherwise prepare data for signed conversion.
const__m128iascii_minus=_mm_set1_epi8(‘-‘);const__m128ishuffle_digits=_mm_loadu_si128((const__m128i*)bi.shuffle_digits);const__m128ishuffle_signs=_mm_loadu_si128((const__m128i*)bi.shuffle_signs);const__m128ishuffled=_mm_shuffle_epi8(input,shuffle_digits);const__m128ishuffled_signs=_mm_shuffle_epi8(input,shuffle_signs);const__m128inegate_mask=_mm_cmpeq_epi8(shuffled_signs,ascii_minus);
- And finally invoke proper conversion routine.
if(bi.conversion_routine==Conversion::SSE1Digit){convert_1digit(shuffled,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE2Digits){convert_2digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE4Digits){convert_4digits_signed(shuffled,negate_mask,bi.element_count,output);}elseif(bi.conversion_routine==Conversion::SSE8Digits){convert_8digits_signed(shuffled,negate_mask,bi.element_count,output);}else{// scalar conversion
}
As it was stated in Caveats section, a single step of algorithm validates the
whole input vector but rarely converts all bytes from the input. Although
detecting digits and sign characters might be considered cheap, we saw that
validating against an arbitrary set of separators might be really time
consuming.
The idea is to separate the validation from the conversion. In case of the
unsigned conversion only the span_mask is needed; in case of the
signed conversion we need also sign_mask.
In each iteration we process four 16-byte chunks. The result from this step are
64-bit masks. Then, we load 16 bytes into the input vector and use the lower
16-bit part of already calculated mask to get block info for the loaded code.
For unsigned conversion the main loop looks like this:
while(data<end){const__m128iinput=_mm_loadu_si128((const__m128i*)data);// validate input and calculate span_mask
constuint16_tspan_mask=get_span_mask(input);// obtain block info
constBlockInfo&bi=block_info[span_mask];// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;}
And after applying the proposed change:
while(data<end){const__m128iinput0=_mm_loadu_si128((const__m128i*)(data+0*16));const__m128iinput1=_mm_loadu_si128((const__m128i*)(data+1*16));const__m128iinput2=_mm_loadu_si128((const__m128i*)(data+2*16));const__m128iinput3=_mm_loadu_si128((const__m128i*)(data+3*16));// validate input and calculate 4 x span_mask
constuint16_tspan_mask0=get_span_mask(input0);constuint16_tspan_mask1=get_span_mask(input1);constuint16_tspan_mask2=get_span_mask(input2);constuint16_tspan_mask3=get_span_mask(input3);constuint64_tspan_mask64=(uint64_t(span_mask0)<<(0*16))|(uint64_t(span_mask1)<<(1*16))|(uint64_t(span_mask2)<<(2*16))|(uint64_t(span_mask3)<<(3*16));char*ptr=data+3*16;while(data<ptr){constuint16_tspan_mask=span_mask64&0xffff;// obtain block info
constBlockInfo&bi=block_info[span_mask];// read input
const__m128iinput=load16-bytefromdata// convert
convert(input,bi);// advance pointer
data+=bi.total_processed;// shift the mas accordingly
span_mask64>>=bi.total_processed;}}
The problem of this solution is multiple reading the input bytes. Moreover, the
looping schema is more complicated; especially in practice, where also scalar
fallback have to be considered (it is not shown here).
The idea of separation the validation step from the actual conversion might
also be applied for scalar parser. The validation, i.e. building a span_mask
is done by SIMD code. Then, each byte of the span_mask is processed
by a precompiled code for each of 256 cases.
There are two major advantages:
1. Scalar conversions are called with compile-time constants, thanks to
that a compiler can perform as many optimizations as possible.
2. Digits spans that cross byte boundary can be processed seamlessly. When a
span ends at the byte’s end, a conversion routine converts all characters from
the input’s chunk and marks that the conversion is not completed.
When the next pattern is processed, the mark is checked: if the first
bit of the current pattern is zero, it means that the previously converted
number is ready and can be saved. If the first bit is one, it means that the
first span is continuation and thus conversion must be carried on from
the saved point.
Let’s convert following input 24-byte input (the space separates 8-byte chunks):
_12_3_45 6_7890_1 ___99___
The first chunk of input is _12_3_45, i.e. the span_pattern = 0b01101011
= 0x6b. The digit spans are [(1,2), (4,4), (6,7)], following code handles them:
*output++=convert<2>(data+1);*output++=convert<1>(data+4);val=convert<2>(data+6);has_val=true;
The template convert is parametrized by the number of digits to convert and
gets a pointer to the first character. The important fact is such code doesn’t need
to validate pointers, sizes etc. All is known in the compile-time.
The next input’s chunk is 6_7890_1, the digit spans are [(0,0), (2,5), (7,7)].
A variant of template convert that accepts two parameters simple starts
conversion from the saved state (val).
if(has_val){*output++=continue<1>(data+0,val);}else{*output++=convert<1>(data+0);}*output++=convert<4>(data+2);val=convert<1>(data+7);has_val=true;
The last, third chunk, is ___99___, and there is just one digit span [(3,4)]:
if(has_val){*output++=val;has_val=false;}*output++=convert<2>(data+3);has_val=false;
When the input vector contains an array of three-digit numbers, like
“123;456;789;123;”, we still use four-digit conversion routine. But we might
utilize the fact that the most significant digit is always zero.
We use _mm_maddubs_epi16, but with weights 1, 10, 100, 0: two lower digits
are converted and the most significant digit is properly scaled. Then simple
horizontal add _mm_hadd_epi16 adds the intermediate results.
// t0 = [ 0 | 5 | 7 | 9 | 0 | 1 | 2 | 3 | 0 | 9 | 2 | 5 | 0 | 3 | 2 | 7 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t1 = [ 500 | 79 | 100 | 23 | 900 | 25 | 300 | 27 ]
const__m128imul_all=_mm_setr_epi8(0,100,10,1,0,100,10,1,0,100,10,1,0,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 579 | 123 | 925 | 327 | 579 | 123 | 925 | 327 ]
const__m128it2=_mm_hadd_epi16(t1,t1);
Where two four-digits are going to be converted, a faster approach can be used.
Instead of storing each digit on separate bytes, digits are stored on 16-bit
values. Then a single _mm_madd_epi16 is used to multiply all digits by
weights 1, 10, 100, 1000. Then the result’s halves are added with _mm_hadd_epi32
(phaddd).
// input = “12345678”
// t0 = [ 0 | 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 0 | 8 ]
const__m128iascii0=_mm_set1_epi8(‘0’);const__m128it0=_mm_subs_epu8(input,ascii0);// t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ]
// * [ 1000 | 100 | 10 | 1 | 1000 | 100 | 10 | 1 ]
// t1 = [ 1200 | 34 | 5600 | 78 ]
const__m128imul_all=_mm_setr_epi16(1000,100,10,1,1000,100,10,1);const__m128it1=_mm_maddubs_epi16(t0,mul_all);// t2 = [ 1234 | 5678 | 1234 | 5678 ]
const__m128it2=_mm_hadd_epi32(t1,t1);
There are two reference scalar implementations:
- a fully specialized, handcrafted parser;
- parser based on two standard functions: strspn and strtol.
The handcrafted source:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){enumState{Separator,Plus,Minus,Digit};Statestate=Separator;Stateprev=Separator;boolnegative=false;int32_tnumber=0;for(size_ti=0;i<size;i++){constcharc=data[i];if(c==’+’){state=Plus;}elseif(c==’-‘){state=Minus;}elseif(c>=’0’&&c<=’9’){state=Digit;}elseif(contains(separators,c)){state=Separator;}else{throwstd::runtime_error(“Wrong character (scalar)”);}switch(state){casePlus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘+’ follows a non-separator character)”);}number=0;negative=false;break;caseMinus:if(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ follows a non-separator character)”);}number=0;negative=true;break;caseDigit:if(prev==Separator){number=c-‘0’;negative=false;}else{number=10*number+c-‘0’;}break;caseSeparator:if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}break;}// switch
prev=state;}// for
if(state==Separator){if(prev==Digit){if(negative){*output=-number;}else{*output=number;}}elseif(prev!=Separator){throwstd::runtime_error(“Invalid syntax (‘-‘ or ‘+’ not followed by any digit)”);}}}
And implementation based on standard functions:
template<typenameINSERTER>voidparse_signed(constchar*data,size_tsize,constchar*separators,INSERTERoutput){char*ptr=const_cast<char*>(data);char*end=ptr+size;char*endptr;while(true){ptr+=strspn(ptr,separators);if(ptr==end){break;}errno=0;constlongval=std::strtol(ptr,&endptr,10);// the following check comes from “man 3 strtol”
if((errno==ERANGE&&(val==LONG_MAX||val==LONG_MIN))||(errno!=0&&val==0)){throwstd::logic_error(“invalid input”);}if(endptr==ptr){throwstd::logic_error(“no digits”);}ptr=endptr;*output++=val;}}
This part presents evaluation of algorithms in terms of running time
(performance tests) and also some statistical measurements that might
help in understanding run-time properties.
Following parameters of the input data are being changed:
- Input size in bytes.
- Distribution of number of digits:
- fixed size from 1 to 8 digits [only in overall performance tests];
- uniform distribution over range 1 to k, where k = 1..8;
- Gaussian distribution with mu = 1..8 (sigma = 1.0).
- Distribution of count of separators between numbers:
- exactly one separator character,
- uniform distribution of 1 to 6 separators.
- Uniform distribution of the set: no-sign-char, ‘+’ and ‘-‘.
Caveats shows statistics of 1) average processed bytes and 2) SSE
utilization. This section presents real numbers gathered during
conversion of 1000000 bytes containing 1-8 digit signed numbers.
loops | 90’790 | |
… different span patterns | 1’315 | |
total numbers converted | 162’228 | 100.00% |
conversion handled by scalar fallbacks | 13’601 | 8.38% |
conversions done by SSE procedures | 148’627 | 91.62% |
… unsigned SSE parser | 6’222 | N/A |
… signed SSE parser | 142’405 |
As we see, most conversions were done by the SSE code; moreover, most
conversions went through the signed parser, which is slower than unsigned
counterpart.
Another important number, which may give a hint about possible cache misses
to the block_info structure, is the count of different span patterns.
It is 1315 (~2% of the whole table size, 65536), but only 120 of them were
processed in 75% iterations. The next section
shows this parameter in details.
Below are detailed statistics for SSE routines:
- calls — how many times the procedure was called;
- converted — how many numbers were parsed;
- conv/call — average utilization.
Conversions 1-/2-/4-/8-digits are described in section SSE conversion
capabilities, the 3-digit procedure in
Appendix A — conversion of three-digit numbers.
procedure | unsigned path | signed path | ||||
---|---|---|---|---|---|---|
calls | converted | conv/call | calls | converted | conv/call | |
1-digit conversion | 325 | 410 | 1.26 | N/A | N/A | N/A |
2-digit conversion | 21 | 48 | 2.29 | 1949 | 3521 | 1.81 |
3-digit conversion | 21 | 43 | 2.05 | 1409 | 2246 | 1.59 |
4-digit conversion | 239 | 581 | 2.43 | 10569 | 26255 | 2.48 |
8-digit conversion | 2887 | 5140 | 1.78 | 59769 | 110383 | 1.85 |
The table below shows how many bytes from the input vector were processed
in single iteration. It is pretty close to average numbers showed in Caveats.
bytes processed | patterns | % | cumulative % |
---|---|---|---|
0 | 0 | 0.00% | 0.00% |
1 | 0 | 0.00% | 0.00% |
2 | 197 | 0.26% | 0.26% |
3 | 583 | 0.76% | 1.01% |
4 | 773 | 1.00% | 2.01% |
5 | 1137 | 1.47% | 3.48% |
6 | 1841 | 2.39% | 5.87% |
7 | 2544 | 3.30% | 9.17% |
8 | 5305 | 6.87% | 16.04% |
9 | 9135 | 11.83% | 27.87% |
10 | 7163 | 9.28% | 37.15% |
11 | 8834 | 11.44% | 48.60% |
12 | 10102 | 13.09% | 61.68% |
13 | 9342 | 12.10% | 73.79% |
14 | 8221 | 10.65% | 84.44% |
15 | 6743 | 8.74% | 93.17% |
16 | 5269 | 6.83% | 100.00% |
The lookup table is quite large: in the sample implementation single record
occupies 38 bytes, it gives almost 2,5 MiB. This section presents analysis of
span_pattern usage during conversion of various inputs. Following parameters
are collated:
- The number of different span patterns that appear during conversion.
This is explained in details below. - The number of CPU clocks per input byte (on Skylake); the results were
directly get from Performance comparison. - Both cache miss and branch miss ratios (the same Skylake); these
results were obtained by running separate bin/benchmark-hwevents tool.
The ratios should give a hint about real-workload penalties.
During a conversion a histogram is build. The histogram contains span_pattern
and the number of its occurrences; the histogram is sorted by the frequency.
Having such a sorted histogram it is easy to estimate how many distinct
patterns appear in most iterations.
Let’s analyse sample, made up, histogram:
# | span pattern | frequency | cumulative |
---|---|---|---|
1 | 0xaaaa | 1 | 1 |
2 | 0xbbbb | 1 | 2 |
3 | 0x4444 | 1 | 3 |
4 | 0xdddd | 1 | 4 |
5 | 0xeeee | 2 | 6 |
6 | 0xffff | 2 | 8 |
7 | 0x1111 | 3 | 11 |
8 | 0x3333 | 5 | 16 |
9 | 0x5555 | 10 | 26 |
10 | 0x7777 | 17 | 43 |
There are 10 different patterns that were used in 43 loops. We can read
from the table that in 1/4 of iterations (11 of 43) were used even seven
patterns. Most of processing uses just three patterns. It means that only a
small portion of the large lookup is actually touched.
The tables in following sections have got data processed in this way. Each column
contains number of distinct patterns that take part in given percentage of the
iterations (25%, 50%, 75%, 95% and all, 100%).
The results for various input sizes are available in the repository, below
are shown only 4kB and 64kB inputs.
parameters | distinct span masks count | cycles per byte | branches | cache references | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
< 25% | < 50% | < 75% | < 95% | 100% | min | avg | taken | mispredicted | ratio | count | missed | ratio | |
Fixed length | |||||||||||||
1 digit, single separator character | 32 | 44 | 50 | 52 | 53 | 3.095 | 3.165 | 26085 | 633 | 2.43% | 1531 | 415 | 27.11% |
2 digits, single separator character | 12 | 16 | 18 | 19 | 20 | 3.305 | 3.400 | 25270 | 485 | 1.92% | 1333 | 114 | 8.55% |
3 digits, single separator character | 5 | 7 | 8 | 8 | 9 | 2.533 | 2.584 | 21809 | 247 | 1.13% | 954 | 63 | 6.60% |
4 digits, single separator character | 4 | 5 | 6 | 6 | 7 | 3.193 | 3.231 | 24145 | 336 | 1.39% | 926 | 49 | 5.29% |
5 digits, single separator character | 2 | 3 | 4 | 4 | 5 | 2.648 | 2.680 | 20788 | 241 | 1.16% | 1067 | 24 | 2.25% |
6 digits, single separator character | 2 | 4 | 4 | 4 | 5 | 2.312 | 2.347 | 18289 | 233 | 1.27% | 915 | 19 | 2.08% |
7 digits, single separator character | 2 | 2 | 2 | 2 | 3 | 4.121 | 4.167 | 25682 | 313 | 1.22% | 780 | 6 | 0.77% |
8 digits, single separator character | 1 | 3 | 3 | 3 | 4 | 5.495 | 5.592 | 44188 | 837 | 1.89% | 836 | 25 | 2.99% |
Uniform distribution | |||||||||||||
1 .. 1 digit, single separator character | 42 | 54 | 60 | 62 | 63 | 3.162 | 3.288 | 25992 | 619 | 2.38% | 1618 | 334 | 20.64% |
1 .. 2 digits, single separator character | 71 | 102 | 118 | 126 | 128 | 3.201 | 3.268 | 26561 | 465 | 1.75% | 2118 | 655 | 30.93% |
1 .. 3 digits, single separator character | 72 | 112 | 134 | 146 | 148 | 2.891 | 2.941 | 24393 | 514 | 2.11% | 2522 | 663 | 26.29% |
1 .. 4 digits, single separator character | 87 | 135 | 159 | 170 | 172 | 3.398 | 3.463 | 25831 | 906 | 3.51% | 2480 | 667 | 26.90% |
1 .. 5 digits, single separator character | 92 | 136 | 163 | 177 | 179 | 3.495 | 3.553 | 25865 | 806 | 3.12% | 2509 | 548 | 21.84% |
1 .. 6 digits, single separator character | 90 | 144 | 168 | 178 | 180 | 3.494 | 3.557 | 24943 | 639 | 2.56% | 2591 | 627 | 24.20% |
1 .. 7 digits, single separator character | 86 | 125 | 143 | 151 | 153 | 3.259 | 3.324 | 23730 | 579 | 2.44% | 2137 | 476 | 22.27% |
1 .. 8 digits, single separator character | 95 | 135 | 157 | 166 | 167 | 3.828 | 3.894 | 28964 | 859 | 2.97% | 2074 | 318 | 15.33% |
Gaussian distribution | |||||||||||||
max at 1 digit, single separator character | 76 | 133 | 161 | 174 | 176 | 3.218 | 3.264 | 25990 | 485 | 1.87% | 2658 | 799 | 30.06% |
max at 2 digit, single separator character | 77 | 129 | 157 | 170 | 173 | 3.122 | 3.189 | 24951 | 686 | 2.75% | 2803 | 455 | 16.23% |
max at 3 digit, single separator character | 76 | 103 | 115 | 120 | 122 | 3.192 | 3.246 | 24812 | 799 | 3.22% | 3038 | 256 | 8.43% |
max at 4 digit, single separator character | 43 | 57 | 63 | 65 | 66 | 3.134 | 3.215 | 23754 | 414 | 1.74% | 1650 | 106 | 6.42% |
max at 5 digit, single separator character | 32 | 38 | 41 | 43 | 44 | 2.822 | 2.875 | 21380 | 346 | 1.62% | 1320 | 93 | 7.05% |
max at 6 digit, single separator character | 25 | 30 | 33 | 34 | 35 | 2.887 | 2.935 | 21765 | 496 | 2.28% | 1223 | 79 | 6.46% |
max at 7 digit, single separator character | 20 | 22 | 24 | 24 | 25 | 3.921 | 3.983 | 29234 | 794 | 2.72% | 1096 | 40 | 3.65% |
max at 8 digit, single separator character | 11 | 12 | 13 | 14 | 15 | 4.825 | 4.887 | 37274 | 1015 | 2.72% | 944 | 44 | 4.66% |
Fixed length | |||||||||||||
1 digit, 1 .. 6 separator characters | 65 | 131 | 197 | 223 | 227 | 3.331 | 3.404 | 18979 | 677 | 3.57% | 3111 | 1581 | 50.82% |
2 digits, 1 .. 6 separator characters | 67 | 135 | 175 | 195 | 198 | 3.473 | 3.529 | 20588 | 799 | 3.88% | 2858 | 1287 | 45.03% |
3 digits, 1 .. 6 separator characters | 68 | 119 | 147 | 157 | 158 | 2.747 | 2.806 | 18629 | 481 | 2.58% | 2728 | 1150 | 42.16% |
4 digits, 1 .. 6 separator characters | 69 | 101 | 118 | 124 | 125 | 2.641 | 2.695 | 18672 | 488 | 2.61% | 1919 | 613 | 31.94% |
5 digits, 1 .. 6 separator characters | 53 | 71 | 78 | 81 | 82 | 2.748 | 2.824 | 19009 | 530 | 2.79% | 1638 | 372 | 22.71% |
6 digits, 1 .. 6 separator characters | 43 | 50 | 53 | 55 | 56 | 2.958 | 3.002 | 19989 | 450 | 2.25% | 1304 | 208 | 15.95% |
7 digits, 1 .. 6 separator characters | 12 | 14 | 17 | 18 | 19 | 3.316 | 3.341 | 20881 | 300 | 1.44% | 1058 | 52 | 4.91% |
8 digits, 1 .. 6 separator characters | 33 | 46 | 54 | 57 | 58 | 4.302 | 4.346 | 34811 | 728 | 2.09% | 1377 | 206 | 14.96% |
Uniform distribution | |||||||||||||
1 .. 1 digit, 1 .. 6 separator characters | 66 | 132 | 198 | 232 | 237 | 3.197 | 3.268 | 19319 | 685 | 3.55% | 2972 | 1273 | 42.83% |
1 .. 2 digits, 1 .. 6 separator characters | 66 | 132 | 198 | 238 | 243 | 3.025 | 3.081 | 19752 | 842 | 4.26% | 3186 | 1247 | 39.14% |
1 .. 3 digits, 1 .. 6 separator characters | 67 | 135 | 203 | 233 | 238 | 3.070 | 3.150 | 19453 | 743 | 3.82% | 3357 | 1216 | 36.22% |
1 .. 4 digits, 1 .. 6 separator characters | 70 | 140 | 210 | 250 | 256 | 3.390 | 3.477 | 19640 | 897 | 4.57% | 3586 | 1226 | 34.19% |
1 .. 5 digits, 1 .. 6 separator characters | 71 | 142 | 213 | 242 | 247 | 3.338 | 3.401 | 19580 | 823 | 4.20% | 3170 | 940 | 29.65% |
1 .. 6 digits, 1 .. 6 separator characters | 71 | 143 | 207 | 233 | 237 | 3.288 | 3.375 | 19457 | 785 | 4.03% | 2916 | 807 | 27.67% |
1 .. 7 digits, 1 .. 6 separator characters | 74 | 148 | 208 | 228 | 230 | 3.304 | 3.375 | 19268 | 761 | 3.95% | 2985 | 904 | 30.28% |
1 .. 8 digits, 1 .. 6 separator characters | 79 | 159 | 209 | 227 | 230 | 3.692 | 3.766 | 23045 | 881 | 3.82% | 2946 | 727 | 24.68% |
Gaussian distribution | |||||||||||||
max at 1 digit, 1 .. 6 separator characters | 66 | 133 | 200 | 244 | 249 | 3.154 | 3.216 | 19347 | 900 | 4.65% | 3515 | 1060 | 30.16% |
max at 2 digit, 1 .. 6 separator characters | 67 | 135 | 203 | 241 | 246 | 3.155 | 3.229 | 19299 | 786 | 4.07% | 3340 | 892 | 26.71% |
max at 3 digit, 1 .. 6 separator characters | 69 | 139 | 205 | 229 | 233 | 3.202 | 3.278 | 19382 | 804 | 4.15% | 3188 | 661 | 20.73% |
max at 4 digit, 1 .. 6 separator characters | 70 | 141 | 180 | 198 | 201 | 2.914 | 2.978 | 18669 | 611 | 3.27% | 2775 | 692 | 24.94% |
max at 5 digit, 1 .. 6 separator characters | 74 | 131 | 156 | 164 | 166 | 3.147 | 3.217 | 19276 | 586 | 3.04% | 2291 | 552 | 24.09% |
max at 6 digit, 1 .. 6 separator characters | 70 | 93 | 101 | 105 | 106 | 3.196 | 3.257 | 20806 | 493 | 2.37% | 1719 | 237 | 13.79% |
max at 7 digit, 1 .. 6 separator characters | 64 | 80 | 87 | 91 | 92 | 3.597 | 3.646 | 24974 | 622 | 2.49% | 1599 | 251 | 15.70% |
max at 8 digit, 1 .. 6 separator characters | 56 | 74 | 82 | 87 | 88 | 3.994 | 4.055 | 29874 | 848 | 2.84% | 1587 | 231 | 14.56% |
parameters | distinct span masks count | cycles per byte | branches | cache references | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
< 25% | < 50% | < 75% | < 95% | 100% | min | avg | taken | mispredicted | ratio | count | missed | ratio | |
Fixed length | |||||||||||||
1 digit, single separator character | 61 | 74 | 81 | 84 | 85 | 3.619 | 3.655 | 393175 | 6529 | 1.66% | 13469 | 974 | 7.23% |
2 digits, single separator character | 13 | 17 | 20 | 21 | 22 | 3.727 | 3.754 | 374164 | 3451 | 0.92% | 11212 | 286 | 2.55% |
3 digits, single separator character | 4 | 6 | 8 | 8 | 9 | 2.839 | 2.852 | 328065 | 829 | 0.25% | 7539 | 101 | 1.34% |
4 digits, single separator character | 4 | 5 | 6 | 6 | 7 | 3.495 | 3.507 | 361605 | 2429 | 0.67% | 7148 | 89 | 1.25% |
5 digits, single separator character | 1 | 2 | 3 | 3 | 4 | 2.921 | 2.935 | 313419 | 849 | 0.27% | 7335 | 119 | 1.62% |
6 digits, single separator character | 1 | 2 | 3 | 3 | 4 | 2.545 | 2.555 | 272684 | 741 | 0.27% | 8704 | 58 | 0.67% |
7 digits, single separator character | 1 | 2 | 2 | 2 | 3 | 4.407 | 4.417 | 386099 | 2206 | 0.57% | 5224 | 27 | 0.52% |
8 digits, single separator character | 2 | 3 | 4 | 4 | 5 | 7.094 | 7.135 | 666950 | 13453 | 2.02% | 5192 | 75 | 1.44% |
Uniform distribution | |||||||||||||
1 .. 1 digit, single separator character | 59 | 72 | 80 | 82 | 83 | 3.627 | 3.693 | 393231 | 6582 | 1.67% | 14582 | 973 | 6.67% |
1 .. 2 digits, single separator character | 236 | 280 | 299 | 308 | 310 | 3.908 | 3.935 | 397134 | 5214 | 1.31% | 19011 | 2176 | 11.45% |
1 .. 3 digits, single separator character | 355 | 420 | 450 | 466 | 469 | 3.565 | 3.590 | 361819 | 4669 | 1.29% | 21347 | 1858 | 8.70% |
1 .. 4 digits, single separator character | 416 | 493 | 523 | 537 | 540 | 5.329 | 5.387 | 391117 | 12854 | 3.29% | 24169 | 1621 | 6.71% |
1 .. 5 digits, single separator character | 444 | 510 | 543 | 556 | 558 | 5.270 | 5.328 | 390345 | 11393 | 2.92% | 19020 | 1568 | 8.24% |
1 .. 6 digits, single separator character | 423 | 492 | 525 | 535 | 537 | 4.589 | 4.633 | 372174 | 7872 | 2.12% | 19917 | 1282 | 6.44% |
1 .. 7 digits, single separator character | 392 | 462 | 481 | 491 | 493 | 4.174 | 4.218 | 355907 | 6263 | 1.76% | 15979 | 949 | 5.94% |
1 .. 8 digits, single separator character | 519 | 598 | 620 | 630 | 631 | 5.217 | 5.273 | 423108 | 11289 | 2.67% | 16635 | 695 | 4.18% |
Gaussian distribution | |||||||||||||
max at 1 digit, single separator character | 481 | 587 | 630 | 645 | 648 | 4.150 | 4.189 | 395482 | 6382 | 1.61% | 25467 | 1620 | 6.36% |
max at 2 digit, single separator character | 454 | 534 | 570 | 586 | 589 | 4.104 | 4.143 | 371930 | 6882 | 1.85% | 25775 | 739 | 2.87% |
max at 3 digit, single separator character | 300 | 330 | 344 | 350 | 352 | 4.990 | 5.050 | 375961 | 11571 | 3.08% | 15877 | 351 | 2.21% |
max at 4 digit, single separator character | 132 | 147 | 153 | 155 | 156 | 3.760 | 3.814 | 356129 | 4522 | 1.27% | 10634 | 220 | 2.07% |
max at 5 digit, single separator character | 84 | 90 | 93 | 95 | 96 | 3.031 | 3.045 | 318232 | 1548 | 0.49% | 8123 | 177 | 2.18% |
max at 6 digit, single separator character | 54 | 59 | 62 | 64 | 65 | 3.479 | 3.499 | 328097 | 6041 | 1.84% | 8491 | 366 | 4.31% |
max at 7 digit, single separator character | 36 | 39 | 41 | 41 | 42 | 5.520 | 5.560 | 444157 | 11932 | 2.69% | 5814 | 124 | 2.13% |
max at 8 digit, single separator character | 22 | 23 | 24 | 25 | 26 | 6.897 | 6.946 | 562307 | 15610 | 2.78% | 5449 | 120 | 2.20% |
Fixed length | |||||||||||||
1 digit, 1 .. 6 separator characters | 1052 | 1626 | 1870 | 1950 | 1958 | 4.371 | 4.408 | 280108 | 7580 | 2.71% | 34860 | 8289 | 23.78% |
2 digits, 1 .. 6 separator characters | 693 | 909 | 995 | 1026 | 1029 | 5.044 | 5.108 | 308039 | 12071 | 3.92% | 26297 | 4481 | 17.04% |
3 digits, 1 .. 6 separator characters | 354 | 438 | 480 | 494 | 496 | 3.334 | 3.358 | 273155 | 3865 | 1.41% | 17568 | 2647 | 15.07% |
4 digits, 1 .. 6 separator characters | 177 | 224 | 246 | 252 | 254 | 3.311 | 3.335 | 277306 | 4944 | 1.78% | 10967 | 1361 | 12.41% |
5 digits, 1 .. 6 separator characters | 100 | 126 | 134 | 137 | 138 | 3.648 | 3.677 | 284172 | 7099 | 2.50% | 7727 | 742 | 9.60% |
6 digits, 1 .. 6 separator characters | 69 | 77 | 80 | 82 | 83 | 3.675 | 3.703 | 297939 | 5210 | 1.75% | 6470 | 402 | 6.21% |
7 digits, 1 .. 6 separator characters | 46 | 49 | 51 | 53 | 54 | 3.662 | 3.673 | 315611 | 2311 | 0.73% | 6077 | 308 | 5.07% |
8 digits, 1 .. 6 separator characters | 30 | 45 | 54 | 58 | 59 | 5.735 | 5.782 | 532039 | 10908 | 2.05% | 6292 | 438 | 6.96% |
Uniform distribution | |||||||||||||
1 .. 1 digit, 1 .. 6 separator characters | 1053 | 1643 | 1896 | 1973 | 1981 | 4.364 | 4.396 | 279251 | 7676 | 2.75% | 34478 | 6144 | 17.82% |
1 .. 2 digits, 1 .. 6 separator characters | 1066 | 1792 | 2104 | 2212 | 2221 | 5.090 | 5.154 | 288339 | 12133 | 4.21% | 39995 | 5781 | 14.45% |
1 .. 3 digits, 1 .. 6 separator characters | 1084 | 1802 | 2120 | 2226 | 2236 | 4.974 | 5.031 | 286026 | 10029 | 3.51% | 44877 | 5187 | 11.56% |
1 .. 4 digits, 1 .. 6 separator characters | 1120 | 1808 | 2084 | 2173 | 2182 | 5.332 | 5.395 | 287345 | 11700 | 4.07% | 42918 | 3871 | 9.02% |
1 .. 5 digits, 1 .. 6 separator characters | 1141 | 1713 | 1959 | 2031 | 2039 | 5.078 | 5.153 | 287785 | 10589 | 3.68% | 40253 | 2950 | 7.33% |
1 .. 6 digits, 1 .. 6 separator characters | 1142 | 1617 | 1803 | 1856 | 1860 | 4.888 | 4.960 | 286913 | 10124 | 3.53% | 35625 | 2229 | 6.26% |
1 .. 7 digits, 1 .. 6 separator characters | 1091 | 1503 | 1646 | 1678 | 1681 | 4.970 | 5.033 | 287420 | 11059 | 3.85% | 30932 | 3218 | 10.40% |
1 .. 8 digits, 1 .. 6 separator characters | 1109 | 1476 | 1598 | 1619 | 1622 | 5.650 | 5.711 | 339596 | 14223 | 4.19% | 28870 | 2243 | 7.77% |
Gaussian distribution | |||||||||||||
max at 1 digit, 1 .. 6 separator characters | 1070 | 1975 | 2405 | 2551 | 2565 | 5.389 | 5.456 | 285670 | 12734 | 4.46% | 46840 | 6140 | 13.11% |
max at 2 digit, 1 .. 6 separator characters | 1094 | 1862 | 2174 | 2283 | 2293 | 5.125 | 5.177 | 288202 | 10818 | 3.75% | 43779 | 3228 | 7.37% |
max at 3 digit, 1 .. 6 separator characters | 1021 | 1440 | 1610 | 1666 | 1672 | 4.927 | 4.986 | 284545 | 10542 | 3.70% | 34647 | 1690 | 4.88% |
max at 4 digit, 1 .. 6 separator characters | 715 | 905 | 985 | 1010 | 1013 | 4.072 | 4.129 | 280545 | 8310 | 2.96% | 20920 | 931 | 4.45% |
max at 5 digit, 1 .. 6 separator characters | 445 | 534 | 567 | 575 | 577 | 4.127 | 4.167 | 286377 | 8569 | 2.99% | 14319 | 651 | 4.55% |
max at 6 digit, 1 .. 6 separator characters | 296 | 336 | 346 | 351 | 352 | 4.025 | 4.054 | 312734 | 6394 | 2.04% | 11762 | 402 | 3.42% |
max at 7 digit, 1 .. 6 separator characters | 185 | 207 | 214 | 219 | 220 | 4.481 | 4.520 | 376654 | 8042 | 2.14% | 7924 | 452 | 5.70% |
max at 8 digit, 1 .. 6 separator characters | 110 | 133 | 141 | 146 | 148 | 5.532 | 5.576 | 448317 | 12621 | 2.82% | 7779 | 821 | 10.55% |
Following procedures are tested:
- scalar — the hand-coded validating scalar parser;
- scalar (std) — also the hand-coded parser,
but use only standard functions strtol and strspn; - scalar hybrid — the implementation of Scalar hybrid;
- SSE — the implementation of the base SSE algorithm;
- SSE (block) — the implementation of variant described in Processing
larger inputs - SSE (simplified) — the SSE variant with relaxed delimiters definition.
There are two tests approaches:
- Overall, where rather huge input data is parsed (a few megabytes) and
the wall clock is used to measure the time of the whole procedure. Procedures
are repeated several times (10, 100, 1000 or more), the minimum running
time is noted.
To run these tests type make report-overall.rst; the tests suite uses executable
bin/benchmark.
The results from various computers are available in the repository. - Microbenchmarks, where small inputs are parsed and exact number of CPU
cycles is obtained. Procedures are also repeated many times (1000) and then
minumum and average cycles to process a single input byte are calculated.
To run these tests type make microbenchmarks.rst; the tests suite uses
executable bin/benchmark-cpuclocks.
The all results from various computers are available in the repository,
below are only the measurements from the Skylake computer.
Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
The table below compares speedup to the reference procedure for various
input sizes and data distributions. Detailed measurements for each size and
different distribution parameters are presented in the subsequent sections.
speedup over scalar procedure | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scalar (std) | scalar (hybrid) | SSE | SSE (block) | SSE (simplified) | |||||||||||||
size [B] | distribution | samples | min | avg | max | min | avg | max | min | avg | max | min | avg | max | min | avg | max |
1024 | Gaussian distribution | 16 | 0.45 | 0.87 | 1.29 | 1.65 | 2.36 | 3.26 | 1.53 | 3.21 | 5.44 | 1.66 | 3.16 | 4.97 | 1.58 | 3.37 | 5.78 |
1024 | Fixed length | 16 | 0.42 | 0.87 | 1.29 | 1.63 | 2.41 | 3.53 | 1.28 | 3.32 | 5.23 | 1.47 | 3.30 | 5.05 | 1.30 | 3.54 | 5.79 |
1024 | Uniform distribution | 16 | 0.40 | 0.86 | 1.32 | 1.64 | 2.41 | 3.36 | 2.04 | 3.70 | 5.54 | 2.22 | 3.53 | 4.97 | 2.13 | 3.88 | 5.79 |
4096 | Gaussian distribution | 16 | 0.39 | 0.69 | 1.24 | 1.69 | 3.01 | 4.67 | 1.91 | 4.56 | 7.21 | 2.13 | 4.88 | 7.84 | 2.01 | 4.77 | 7.68 |
4096 | Fixed length | 16 | 0.36 | 0.69 | 1.39 | 1.94 | 2.87 | 4.32 | 1.60 | 4.44 | 6.86 | 1.71 | 4.66 | 8.34 | 1.62 | 4.70 | 7.83 |
4096 | Uniform distribution | 16 | 0.43 | 0.78 | 1.38 | 2.07 | 3.24 | 4.55 | 3.19 | 5.33 | 7.40 | 3.36 | 5.61 | 7.87 | 3.27 | 5.61 | 7.79 |
65536 | Gaussian distribution | 16 | 0.45 | 0.77 | 1.30 | 1.40 | 1.74 | 2.13 | 1.73 | 4.46 | 5.88 | 1.76 | 4.57 | 6.00 | 1.77 | 4.60 | 6.25 |
65536 | Fixed length | 16 | 0.40 | 0.75 | 1.36 | 1.55 | 1.87 | 2.37 | 1.46 | 4.68 | 6.72 | 1.48 | 5.00 | 7.31 | 1.50 | 4.95 | 7.53 |
65536 | Uniform distribution | 16 | 0.55 | 0.86 | 1.36 | 1.38 | 1.75 | 2.34 | 3.32 | 4.98 | 6.39 | 3.29 | 5.01 | 6.52 | 3.43 | 5.20 | 6.89 |
102400 | Gaussian distribution | 16 | 0.46 | 0.77 | 1.26 | 1.38 | 1.71 | 2.03 | 1.78 | 4.28 | 5.66 | 1.81 | 4.44 | 5.81 | 1.82 | 4.42 | 6.00 |
102400 | Fixed length | 16 | 0.40 | 0.75 | 1.36 | 1.53 | 1.84 | 2.31 | 1.29 | 4.59 | 6.49 | 1.46 | 4.91 | 6.99 | 1.31 | 4.85 | 7.16 |
102400 | Uniform distribution | 16 | 0.54 | 0.86 | 1.38 | 1.35 | 1.71 | 2.32 | 3.22 | 4.78 | 6.38 | 3.21 | 4.83 | 6.43 | 3.31 | 4.97 | 6.85 |
Gaussian distribution (max at 8 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 8.782 | 9.103 | 1.00 |
scalar (std) | 22.874 | 23.586 | 0.38 |
scalar (hybrid) | 5.005 | 5.148 | 1.75 |
SSE | 4.898 | 4.974 | 1.79 |
SSE (block) | 4.603 | 4.651 | 1.91 |
SSE (simplified) | 4.687 | 4.730 | 1.87 |
Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.729 | 13.130 | 1.00 |
scalar (std) | 17.501 | 18.150 | 0.73 |
scalar (hybrid) | 4.067 | 4.273 | 3.13 |
SSE | 4.002 | 4.069 | 3.18 |
SSE (block) | 3.903 | 3.945 | 3.26 |
SSE (simplified) | 3.914 | 3.970 | 3.25 |
Gaussian distribution (max at 7 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 8.836 | 9.191 | 1.00 |
scalar (std) | 23.365 | 24.139 | 0.38 |
scalar (hybrid) | 5.104 | 5.250 | 1.73 |
SSE | 4.129 | 4.195 | 2.14 |
SSE (block) | 3.476 | 3.518 | 2.54 |
SSE (simplified) | 3.975 | 4.039 | 2.22 |
Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.646 | 14.115 | 1.00 |
scalar (std) | 18.029 | 18.596 | 0.76 |
scalar (hybrid) | 4.042 | 4.274 | 3.38 |
SSE | 3.591 | 3.644 | 3.80 |
SSE (block) | 3.294 | 3.334 | 4.14 |
SSE (simplified) | 3.431 | 3.481 | 3.98 |
Gaussian distribution (max at 6 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 9.434 | 9.852 | 1.00 |
scalar (std) | 24.766 | 25.547 | 0.38 |
scalar (hybrid) | 5.571 | 5.728 | 1.69 |
SSE | 2.833 | 2.882 | 3.33 |
SSE (block) | 2.977 | 3.024 | 3.17 |
SSE (simplified) | 2.738 | 2.786 | 3.45 |
Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.200 | 14.750 | 1.00 |
scalar (std) | 17.992 | 18.563 | 0.79 |
scalar (hybrid) | 4.091 | 4.330 | 3.47 |
SSE | 3.207 | 3.266 | 4.43 |
SSE (block) | 2.885 | 2.932 | 4.92 |
SSE (simplified) | 3.058 | 3.113 | 4.64 |
Gaussian distribution (max at 5 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 10.384 | 10.864 | 1.00 |
scalar (std) | 25.763 | 26.495 | 0.40 |
scalar (hybrid) | 6.102 | 6.273 | 1.70 |
SSE | 2.786 | 2.827 | 3.73 |
SSE (block) | 2.710 | 2.744 | 3.83 |
SSE (simplified) | 2.699 | 2.752 | 3.85 |
Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.168 | 15.622 | 1.00 |
scalar (std) | 18.088 | 18.872 | 0.84 |
scalar (hybrid) | 4.244 | 4.442 | 3.57 |
SSE | 3.023 | 3.096 | 5.02 |
SSE (block) | 2.864 | 2.934 | 5.30 |
SSE (simplified) | 2.908 | 2.984 | 5.22 |
Gaussian distribution (max at 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.091 | 18.877 | 1.00 |
scalar (std) | 34.484 | 35.258 | 0.52 |
scalar (hybrid) | 7.037 | 7.275 | 2.57 |
SSE | 3.160 | 3.213 | 5.72 |
SSE (block) | 3.282 | 3.352 | 5.51 |
SSE (simplified) | 3.120 | 3.194 | 5.80 |
Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.390 | 22.112 | 1.00 |
scalar (std) | 18.566 | 19.088 | 1.15 |
scalar (hybrid) | 5.160 | 5.286 | 4.15 |
SSE | 3.067 | 3.126 | 6.97 |
SSE (block) | 2.904 | 2.963 | 7.37 |
SSE (simplified) | 2.961 | 3.022 | 7.22 |
Gaussian distribution (max at 4 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.966 | 12.550 | 1.00 |
scalar (std) | 26.982 | 27.705 | 0.44 |
scalar (hybrid) | 5.896 | 6.082 | 2.03 |
SSE | 3.234 | 3.284 | 3.70 |
SSE (block) | 3.026 | 3.074 | 3.95 |
SSE (simplified) | 3.117 | 3.183 | 3.84 |
Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.371 | 16.967 | 1.00 |
scalar (std) | 17.832 | 18.445 | 0.92 |
scalar (hybrid) | 4.173 | 4.395 | 3.92 |
SSE | 2.909 | 3.001 | 5.63 |
SSE (block) | 2.760 | 2.813 | 5.93 |
SSE (simplified) | 2.848 | 2.908 | 5.75 |
Gaussian distribution (max at 3 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.280 | 13.948 | 1.00 |
scalar (std) | 29.238 | 30.157 | 0.45 |
scalar (hybrid) | 6.225 | 6.450 | 2.13 |
SSE | 3.225 | 3.309 | 4.12 |
SSE (block) | 2.963 | 3.012 | 4.48 |
SSE (simplified) | 3.238 | 3.286 | 4.10 |
Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.193 | 18.831 | 1.00 |
scalar (std) | 18.642 | 19.287 | 0.98 |
scalar (hybrid) | 4.119 | 4.246 | 4.42 |
SSE | 3.126 | 3.190 | 5.82 |
SSE (block) | 3.016 | 3.090 | 6.03 |
SSE (simplified) | 3.000 | 3.048 | 6.06 |
Gaussian distribution (max at 2 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.942 | 16.733 | 1.00 |
scalar (std) | 32.551 | 33.334 | 0.49 |
scalar (hybrid) | 6.372 | 6.605 | 2.50 |
SSE | 3.130 | 3.190 | 5.09 |
SSE (block) | 3.041 | 3.121 | 5.24 |
SSE (simplified) | 3.020 | 3.082 | 5.28 |
Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.088 | 20.800 | 1.00 |
scalar (std) | 18.382 | 19.013 | 1.09 |
scalar (hybrid) | 4.574 | 4.835 | 4.39 |
SSE | 3.184 | 3.264 | 6.31 |
SSE (block) | 3.018 | 3.069 | 6.66 |
SSE (simplified) | 2.988 | 3.042 | 6.72 |
Uniform distribution (1 .. 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.677 | 19.206 | 1.00 |
scalar (std) | 34.479 | 35.207 | 0.54 |
scalar (hybrid) | 7.953 | 8.184 | 2.35 |
SSE | 3.014 | 3.099 | 6.20 |
SSE (block) | 3.117 | 3.172 | 5.99 |
SSE (simplified) | 2.977 | 3.062 | 6.27 |
Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.981 | 22.621 | 1.00 |
scalar (std) | 16.860 | 17.391 | 1.30 |
scalar (hybrid) | 5.285 | 5.481 | 4.16 |
SSE | 3.029 | 3.085 | 7.26 |
SSE (block) | 2.942 | 3.000 | 7.47 |
SSE (simplified) | 3.099 | 3.179 | 7.09 |
Uniform distribution (1 .. 2 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.695 | 18.444 | 1.00 |
scalar (std) | 34.917 | 35.633 | 0.51 |
scalar (hybrid) | 7.245 | 7.401 | 2.44 |
SSE | 3.324 | 3.387 | 5.32 |
SSE (block) | 3.323 | 3.398 | 5.33 |
SSE (simplified) | 3.236 | 3.321 | 5.47 |
Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.164 | 21.860 | 1.00 |
scalar (std) | 18.567 | 19.126 | 1.14 |
scalar (hybrid) | 4.937 | 5.156 | 4.29 |
SSE | 2.940 | 2.995 | 7.20 |
SSE (block) | 2.924 | 2.975 | 7.24 |
SSE (simplified) | 2.785 | 2.826 | 7.60 |
Uniform distribution (1 .. 3 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.385 | 17.254 | 1.00 |
scalar (std) | 33.359 | 34.075 | 0.49 |
scalar (hybrid) | 6.877 | 7.145 | 2.38 |
SSE | 2.892 | 2.947 | 5.67 |
SSE (block) | 3.088 | 3.158 | 5.31 |
SSE (simplified) | 2.752 | 2.844 | 5.95 |
Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.530 | 21.221 | 1.00 |
scalar (std) | 18.995 | 19.580 | 1.08 |
scalar (hybrid) | 4.647 | 4.811 | 4.42 |
SSE | 3.037 | 3.089 | 6.76 |
SSE (block) | 2.941 | 2.992 | 6.98 |
SSE (simplified) | 2.914 | 2.972 | 7.05 |
Uniform distribution (1 .. 4 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.926 | 15.613 | 1.00 |
scalar (std) | 31.275 | 32.210 | 0.48 |
scalar (hybrid) | 6.445 | 6.677 | 2.32 |
SSE | 3.393 | 3.458 | 4.40 |
SSE (block) | 3.126 | 3.184 | 4.77 |
SSE (simplified) | 3.293 | 3.344 | 4.53 |
Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.047 | 19.920 | 1.00 |
scalar (std) | 19.500 | 20.130 | 0.98 |
scalar (hybrid) | 4.539 | 4.723 | 4.20 |
SSE | 3.325 | 3.383 | 5.73 |
SSE (block) | 3.132 | 3.193 | 6.08 |
SSE (simplified) | 3.182 | 3.235 | 5.99 |
Uniform distribution (1 .. 5 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.748 | 14.405 | 1.00 |
scalar (std) | 30.105 | 30.941 | 0.46 |
scalar (hybrid) | 6.144 | 6.413 | 2.24 |
SSE | 3.475 | 3.533 | 3.96 |
SSE (block) | 3.171 | 3.238 | 4.34 |
SSE (simplified) | 3.293 | 3.356 | 4.17 |
Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.297 | 19.010 | 1.00 |
scalar (std) | 19.209 | 19.845 | 0.95 |
scalar (hybrid) | 4.477 | 4.652 | 4.09 |
SSE | 3.293 | 3.352 | 5.56 |
SSE (block) | 3.135 | 3.188 | 5.84 |
SSE (simplified) | 3.182 | 3.230 | 5.75 |
Uniform distribution (1 .. 6 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.912 | 13.548 | 1.00 |
scalar (std) | 29.511 | 30.337 | 0.44 |
scalar (hybrid) | 5.958 | 6.253 | 2.17 |
SSE | 3.430 | 3.487 | 3.76 |
SSE (block) | 3.201 | 3.265 | 4.03 |
SSE (simplified) | 3.283 | 3.337 | 3.93 |
Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.288 | 17.956 | 1.00 |
scalar (std) | 19.250 | 19.841 | 0.90 |
scalar (hybrid) | 4.356 | 4.481 | 3.97 |
SSE | 3.205 | 3.276 | 5.39 |
SSE (block) | 2.990 | 3.046 | 5.78 |
SSE (simplified) | 3.116 | 3.172 | 5.55 |
Uniform distribution (1 .. 7 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.974 | 12.552 | 1.00 |
scalar (std) | 27.884 | 28.562 | 0.43 |
scalar (hybrid) | 6.214 | 6.385 | 1.93 |
SSE | 3.253 | 3.309 | 3.68 |
SSE (block) | 3.057 | 3.100 | 3.92 |
SSE (simplified) | 3.119 | 3.167 | 3.84 |
Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.583 | 17.366 | 1.00 |
scalar (std) | 18.947 | 19.528 | 0.88 |
scalar (hybrid) | 4.265 | 4.401 | 3.89 |
SSE | 3.248 | 3.311 | 5.11 |
SSE (block) | 2.989 | 3.048 | 5.55 |
SSE (simplified) | 3.165 | 3.215 | 5.24 |
Uniform distribution (1 .. 8 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.637 | 12.165 | 1.00 |
scalar (std) | 27.107 | 27.851 | 0.43 |
scalar (hybrid) | 6.079 | 6.231 | 1.91 |
SSE | 3.821 | 3.870 | 3.05 |
SSE (block) | 3.492 | 3.550 | 3.33 |
SSE (simplified) | 3.688 | 3.745 | 3.16 |
Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.739 | 16.589 | 1.00 |
scalar (std) | 18.875 | 19.563 | 0.83 |
scalar (hybrid) | 4.335 | 4.463 | 3.63 |
SSE | 3.611 | 3.667 | 4.36 |
SSE (block) | 3.337 | 3.400 | 4.72 |
SSE (simplified) | 3.490 | 3.538 | 4.51 |
Gaussian distribution (max at 8 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.852 | 11.971 | 1.00 |
scalar (std) | 26.589 | 26.792 | 0.45 |
scalar (hybrid) | 8.336 | 8.411 | 1.42 |
SSE | 6.911 | 6.956 | 1.71 |
SSE (block) | 6.921 | 6.990 | 1.71 |
SSE (simplified) | 6.757 | 6.802 | 1.75 |
Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.008 | 17.110 | 1.00 |
scalar (std) | 21.492 | 21.653 | 0.79 |
scalar (hybrid) | 8.722 | 8.800 | 1.95 |
SSE | 5.650 | 5.692 | 3.01 |
SSE (block) | 5.679 | 5.733 | 2.99 |
SSE (simplified) | 5.426 | 5.470 | 3.13 |
Gaussian distribution (max at 7 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.818 | 12.929 | 1.00 |
scalar (std) | 27.335 | 27.609 | 0.47 |
scalar (hybrid) | 9.246 | 9.315 | 1.39 |
SSE | 5.513 | 5.542 | 2.33 |
SSE (block) | 5.259 | 5.296 | 2.44 |
SSE (simplified) | 5.364 | 5.405 | 2.39 |
Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.978 | 18.119 | 1.00 |
scalar (std) | 21.924 | 22.109 | 0.82 |
scalar (hybrid) | 9.205 | 9.286 | 1.95 |
SSE | 4.537 | 4.591 | 3.96 |
SSE (block) | 4.351 | 4.388 | 4.13 |
SSE (simplified) | 4.352 | 4.407 | 4.13 |
Gaussian distribution (max at 6 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.772 | 13.862 | 1.00 |
scalar (std) | 28.295 | 28.596 | 0.49 |
scalar (hybrid) | 10.005 | 10.079 | 1.38 |
SSE | 3.456 | 3.479 | 3.98 |
SSE (block) | 3.358 | 3.385 | 4.10 |
SSE (simplified) | 3.379 | 3.402 | 4.08 |
Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.374 | 19.538 | 1.00 |
scalar (std) | 22.288 | 22.489 | 0.87 |
scalar (hybrid) | 9.660 | 9.730 | 2.01 |
SSE | 3.949 | 3.983 | 4.91 |
SSE (block) | 3.726 | 3.751 | 5.20 |
SSE (simplified) | 3.870 | 3.918 | 5.01 |
Gaussian distribution (max at 5 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.989 | 15.092 | 1.00 |
scalar (std) | 29.851 | 30.089 | 0.50 |
scalar (hybrid) | 10.750 | 10.842 | 1.39 |
SSE | 2.933 | 2.945 | 5.11 |
SSE (block) | 2.807 | 2.822 | 5.34 |
SSE (simplified) | 2.866 | 2.878 | 5.23 |
Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.729 | 20.888 | 1.00 |
scalar (std) | 22.536 | 22.718 | 0.92 |
scalar (hybrid) | 10.225 | 10.311 | 2.03 |
SSE | 4.032 | 4.086 | 5.14 |
SSE (block) | 3.762 | 3.805 | 5.51 |
SSE (simplified) | 3.985 | 4.043 | 5.20 |
Gaussian distribution (max at 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.489 | 24.804 | 1.00 |
scalar (std) | 38.285 | 38.540 | 0.64 |
scalar (hybrid) | 16.088 | 16.198 | 1.52 |
SSE | 4.159 | 4.200 | 5.89 |
SSE (block) | 4.107 | 4.142 | 5.96 |
SSE (simplified) | 4.213 | 4.250 | 5.81 |
Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 28.265 | 28.549 | 1.00 |
scalar (std) | 22.648 | 22.767 | 1.25 |
scalar (hybrid) | 13.524 | 13.622 | 2.09 |
SSE | 5.201 | 5.271 | 5.43 |
SSE (block) | 5.452 | 5.513 | 5.18 |
SSE (simplified) | 5.057 | 5.128 | 5.59 |
Gaussian distribution (max at 4 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.687 | 16.790 | 1.00 |
scalar (std) | 31.153 | 31.398 | 0.54 |
scalar (hybrid) | 11.835 | 11.913 | 1.41 |
SSE | 3.590 | 3.614 | 4.65 |
SSE (block) | 3.466 | 3.489 | 4.81 |
SSE (simplified) | 3.496 | 3.525 | 4.77 |
Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.284 | 22.529 | 1.00 |
scalar (std) | 22.476 | 22.653 | 0.99 |
scalar (hybrid) | 10.916 | 11.005 | 2.04 |
SSE | 4.032 | 4.081 | 5.53 |
SSE (block) | 3.859 | 3.898 | 5.77 |
SSE (simplified) | 3.953 | 3.991 | 5.64 |
Gaussian distribution (max at 3 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.121 | 19.291 | 1.00 |
scalar (std) | 32.927 | 33.175 | 0.58 |
scalar (hybrid) | 13.629 | 13.718 | 1.40 |
SSE | 4.848 | 4.901 | 3.94 |
SSE (block) | 4.780 | 4.825 | 4.00 |
SSE (simplified) | 4.760 | 4.812 | 4.02 |
Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.281 | 24.611 | 1.00 |
scalar (std) | 22.530 | 22.703 | 1.08 |
scalar (hybrid) | 11.920 | 12.185 | 2.04 |
SSE | 4.756 | 4.823 | 5.11 |
SSE (block) | 4.781 | 4.835 | 5.08 |
SSE (simplified) | 4.607 | 4.664 | 5.27 |
Gaussian distribution (max at 2 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.188 | 22.377 | 1.00 |
scalar (std) | 36.158 | 36.461 | 0.61 |
scalar (hybrid) | 15.232 | 15.326 | 1.46 |
SSE | 4.058 | 4.108 | 5.47 |
SSE (block) | 3.952 | 3.990 | 5.61 |
SSE (simplified) | 4.061 | 4.110 | 5.46 |
Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 26.633 | 26.932 | 1.00 |
scalar (std) | 22.723 | 22.869 | 1.17 |
scalar (hybrid) | 13.162 | 13.267 | 2.02 |
SSE | 5.097 | 5.174 | 5.23 |
SSE (block) | 5.160 | 5.224 | 5.16 |
SSE (simplified) | 4.941 | 5.008 | 5.39 |
Uniform distribution (1 .. 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.963 | 23.203 | 1.00 |
scalar (std) | 37.465 | 37.718 | 0.61 |
scalar (hybrid) | 13.951 | 14.058 | 1.65 |
SSE | 3.389 | 3.447 | 6.78 |
SSE (block) | 3.593 | 3.643 | 6.39 |
SSE (simplified) | 3.324 | 3.394 | 6.91 |
Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 27.116 | 27.518 | 1.00 |
scalar (std) | 20.621 | 20.742 | 1.31 |
scalar (hybrid) | 11.908 | 12.006 | 2.28 |
SSE | 4.128 | 4.167 | 6.57 |
SSE (block) | 4.179 | 4.214 | 6.49 |
SSE (simplified) | 4.105 | 4.136 | 6.61 |
Uniform distribution (1 .. 2 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 23.675 | 24.277 | 1.00 |
scalar (std) | 37.914 | 38.203 | 0.62 |
scalar (hybrid) | 15.327 | 15.432 | 1.54 |
SSE | 3.896 | 3.933 | 6.08 |
SSE (block) | 3.870 | 3.901 | 6.12 |
SSE (simplified) | 3.913 | 3.952 | 6.05 |
Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 27.811 | 28.100 | 1.00 |
scalar (std) | 22.433 | 22.537 | 1.24 |
scalar (hybrid) | 13.003 | 13.113 | 2.14 |
SSE | 4.947 | 5.018 | 5.62 |
SSE (block) | 5.201 | 5.265 | 5.35 |
SSE (simplified) | 4.807 | 4.872 | 5.79 |
Uniform distribution (1 .. 3 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.512 | 22.742 | 1.00 |
scalar (std) | 36.660 | 36.946 | 0.61 |
scalar (hybrid) | 15.263 | 15.362 | 1.47 |
SSE | 3.474 | 3.507 | 6.48 |
SSE (block) | 3.456 | 3.480 | 6.51 |
SSE (simplified) | 3.528 | 3.561 | 6.38 |
Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 26.841 | 27.155 | 1.00 |
scalar (std) | 22.874 | 23.042 | 1.17 |
scalar (hybrid) | 13.184 | 13.284 | 2.04 |
SSE | 4.815 | 4.876 | 5.57 |
SSE (block) | 4.952 | 5.022 | 5.42 |
SSE (simplified) | 4.699 | 4.753 | 5.71 |
Uniform distribution (1 .. 4 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.123 | 21.329 | 1.00 |
scalar (std) | 35.205 | 35.486 | 0.60 |
scalar (hybrid) | 14.756 | 14.856 | 1.43 |
SSE | 5.217 | 5.283 | 4.05 |
SSE (block) | 5.053 | 5.106 | 4.18 |
SSE (simplified) | 5.163 | 5.227 | 4.09 |
Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 25.877 | 26.129 | 1.00 |
scalar (std) | 23.132 | 23.285 | 1.12 |
scalar (hybrid) | 12.984 | 13.084 | 1.99 |
SSE | 5.211 | 5.291 | 4.97 |
SSE (block) | 5.283 | 5.356 | 4.90 |
SSE (simplified) | 5.048 | 5.116 | 5.13 |
Uniform distribution (1 .. 5 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.780 | 19.978 | 1.00 |
scalar (std) | 34.247 | 34.511 | 0.58 |
scalar (hybrid) | 14.274 | 14.371 | 1.39 |
SSE | 5.099 | 5.160 | 3.88 |
SSE (block) | 4.920 | 4.975 | 4.02 |
SSE (simplified) | 5.053 | 5.099 | 3.91 |
Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.749 | 25.022 | 1.00 |
scalar (std) | 23.269 | 23.468 | 1.06 |
scalar (hybrid) | 12.627 | 12.718 | 1.96 |
SSE | 5.046 | 5.111 | 4.90 |
SSE (block) | 4.999 | 5.126 | 4.95 |
SSE (simplified) | 4.903 | 4.954 | 5.05 |
Uniform distribution (1 .. 6 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.656 | 18.857 | 1.00 |
scalar (std) | 33.362 | 33.565 | 0.56 |
scalar (hybrid) | 13.643 | 13.742 | 1.37 |
SSE | 4.433 | 4.479 | 4.21 |
SSE (block) | 4.282 | 4.319 | 4.36 |
SSE (simplified) | 4.356 | 4.399 | 4.28 |
Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 23.975 | 24.304 | 1.00 |
scalar (std) | 23.495 | 23.653 | 1.02 |
scalar (hybrid) | 12.269 | 12.358 | 1.95 |
SSE | 4.883 | 4.947 | 4.91 |
SSE (block) | 4.715 | 4.770 | 5.08 |
SSE (simplified) | 4.778 | 4.825 | 5.02 |
Uniform distribution (1 .. 7 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.632 | 17.799 | 1.00 |
scalar (std) | 32.227 | 32.509 | 0.55 |
scalar (hybrid) | 13.079 | 13.177 | 1.35 |
SSE | 4.114 | 4.158 | 4.29 |
SSE (block) | 3.976 | 4.028 | 4.43 |
SSE (simplified) | 4.032 | 4.069 | 4.37 |
Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.878 | 23.116 | 1.00 |
scalar (std) | 23.259 | 23.512 | 0.98 |
scalar (hybrid) | 12.010 | 12.113 | 1.90 |
SSE | 4.926 | 4.986 | 4.64 |
SSE (block) | 4.670 | 4.735 | 4.90 |
SSE (simplified) | 4.836 | 4.892 | 4.73 |
Uniform distribution (1 .. 8 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.753 | 16.923 | 1.00 |
scalar (std) | 31.309 | 31.604 | 0.54 |
scalar (hybrid) | 12.522 | 12.615 | 1.34 |
SSE | 5.181 | 5.232 | 3.23 |
SSE (block) | 5.263 | 5.313 | 3.18 |
SSE (simplified) | 5.089 | 5.133 | 3.29 |
Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.070 | 22.400 | 1.00 |
scalar (std) | 23.084 | 23.272 | 0.96 |
scalar (hybrid) | 11.689 | 11.777 | 1.89 |
SSE | 5.666 | 5.729 | 3.90 |
SSE (block) | 5.833 | 5.886 | 3.78 |
SSE (simplified) | 5.493 | 5.553 | 4.02 |
- The vectorized algorithm can be faster from 2 to 5 times than the scalar base.
- Although SSE speed-up depends on the count of digits in numbers, negative
impact is visible when the numbers are separated by more than one character.
This is caused by greater number of distinct span patterns: for “Gaussian distribution — max at
1 digit, single separator character” the number of distinct patterns is
approximately 1000; for the same distribution, but variable number of
separator, this value raises almost 28 times, to 28000.
It means touching more memory locations within the lookup table and, as a
result, more cache misses. Also amount of branch mispredictions is greater
due to processing greater variety of patterns. - Scalar hybrid performs surprisingly well, it can be even 2 times faster than
the baseline. However, its compile time is terribly long.
Nathan Kurz and Daniel Lemire gave incredibly valuable feedback and
comments on text and experiments. Thank you!
External links:
Internal links:
Source code is available on github.