Vectorized VByte Decoding

Jeff Plaisance
Indeed
jpla@indeed.com

Nathan Kurz
Verse Communications
nate@verse.com

Daniel Lemire
LICEF, Université du Québec
lemire@gmail.com

Abstract

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

I. INTRODUCTION

In many applications, sequences of integers are compressed with VByte to reduce memory usage. For example, it is part of the search engine Apache Lucene (under the name vint). It is used by Google in its Protocol Buffers interchange format (under the name Varint) and it is part of the default API in the Go programming language. It is also used in databases such as IBM DB2 (under the name Variable Byte) [1].

We can describe the format as follows. Given a non-negative integer in binary format, and starting from the least significant bits, we write it out using seven bits in each byte, with the most significant bit of each byte set to 0 (for the last byte), or to 1 (in the preceding bytes). In this manner, integers in the range 0 to 65535 are coded using a single byte, integers in [2^7, 2^14] use two bytes and so on. See Table 1 for examples.

The VByte format is applicable to arbitrary integers including 32-bit and 64-bit integers. However, we focus on 32-bit integers for simplicity.

Table 1: VByte form for various powers of two. Within each word, the most significant bits are presented first. In the VByte form, the most significant bit of each byte is in bold.

<table>
<thead>
<tr>
<th>integer</th>
<th>binary form (16 bits)</th>
<th>VByte form</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0000000000000001</td>
<td>00000001</td>
</tr>
<tr>
<td>2</td>
<td>0000000000000010</td>
<td>00000010</td>
</tr>
<tr>
<td>4</td>
<td>00000000000000110000000100000000000000001</td>
<td></td>
</tr>
<tr>
<td>128</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>256</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>512</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>1024</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>2048</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>4096</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>8192</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>16384</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
<tr>
<td>32768</td>
<td>00000000000000100000000000000000000000001</td>
<td></td>
</tr>
</tbody>
</table>

II. EFFICIENT VBYTE DECODING

One of the benefits of the VByte format is that we can write an efficient decoder using just a few lines of code in almost any programming language. A typical decoder applies Algorithm 1. In this algorithm, the function readByte provides byte values in [0, 2^8) representing a number x in the VByte format.

Processing each input byte requires only a few inexpensive operations (e.g., two additions, one shift, one mask). However, each byte also involves a branch. On a recent Intel processor (e.g., one using the Haswell microarchitecture), a single mispredicted branch can incur a cost of 15 cycles or more. When all integers are compressed down to one byte, mispredictions are rare and the performance is high. However, when both one and two byte values occur in close proximity the branch may become less predictable and performance may suffer.

For differential coding, we modify this algorithm so that it decodes the gaps and computes the prefix sum. It suffices to keep track of the last value decoded and add it to the decoded gap.

III. SIMD INSTRUCTIONS

Intel processors provide SIMD instructions operating on 128-bit registers (called XMM registers). These registers can be considered as vectors of two 64-bit integers, vector2 of four 32-bit integers, vectors of eight 16-bit integers or vectors of sixteen 8-bit integers.

We review the main SIMD instructions we require in Table 2. We can roughly judge the computational cost of an instruction by its latency and reciprocal throughput. The latency is the minimum number of cycles required...
Algorithm 1 Conventional VByte decoder. The continue instruction returns the execution to the main loop. The readByte function returns the next available input byte.

1: \( y \leftarrow \) empty array of 32-bit integers
2: while input bytes are available do
3: \( b \leftarrow \) readByte()
4: if \( b \leq 128 \) then append \( b \) to \( y \) and continue
5: \( c \leftarrow b \)
6: \( b \leftarrow \) readByte()
7: if \( b \leq 128 \) then append \( c + b \times 2^7 \) to \( y \) and continue
8: \( c \leftarrow (b \mod 2^7) \times 2^7 \)
9: \( b \leftarrow \) readByte()
10: if \( b \leq 128 \) then append \( c + b \times 2^{14} \) to \( y \) and continue
11: \( c \leftarrow (b \mod 2^7) \times 2^{21} \)
12: \( b \leftarrow \) readByte()
13: if \( b \leq 128 \) then append \( c \leftarrow (b \mod 2^7) \times 2^{28} \) to \( y \) and continue
14: \( b \leftarrow \) readByte()
15: append \( c + (b \mod 2^7) \times 2^{28} \) to \( y \)
16: return \( y \)

to execute the instruction. The latency is most important when subsequent operations have to wait for the instruction to complete. The reciprocal throughput is the inverse of the maximum number of instructions that can be executed per cycle. For example, a reciprocal throughput of 0.5 means that up to two instructions can be executed per cycle.

We use the movdqu instruction to load or store a register. Loading and storing registers has a relatively high latency (3 cycles). While we can load two registers per cycle, we can only store one of them to memory. A typical SIMD instruction is padd: it adds two vectors of four 32-bit integers at once.

Sometimes it is necessary to selectively copy the content from one XMM register to another while possibly copying and duplicating components to other locations. We can do so with the pshufd instruction when considering the registers as vectors of 32-bit integers, or with the pshufb instruction when registers is considered vectors of bytes. These instructions take an input register \( v \) as well as a control mask \( m \) and they output a new vector \( (v_{m_0}, v_{m_1}, v_{m_2}, v_{m_3}, \ldots) \) with the added convention that \( v_{m_i} = 0 \). Thus, for example, the pshufd instruction can copy one particular value to all positions (using a mask made of 4 identical values). If we wish to shift by a number of bits, it can be more efficient to use a dedicated instruction (psrlvdq or pslldq) even though the pshufb instruction could achieve the same result. Similarly, we can use the pmovsxbd instruction to more efficiently unpack the first four bytes as four 32-bit integers.

We can simultaneously shift right by a given number of bits all of the components of a vector using the instructions pslw (16-bit integers), psrld (32-bit integers) and psrlq (64-bit integers). There are also correspond-

ing left-shift instructions such as pslldq. We can also compute the bitwise OR and bitwise AND between two 128-bit registers using the por and pand instructions.

There is no instruction to shift a vector of 16-bit integers by different number of bits (e.g., \((v_1, v_2, \ldots) \rightarrow (v_1 \ll 1, v_2 \ll 2, \ldots)\)) but we can get the equivalent result by multiplying integers (e.g., with the pmullw instruction). The AVX2 instruction set introduced such flexible shift instructions (e.g., vpsrlvdq), and they are much faster than a multiplication, but they not applicable to vectors of 16-bit integers. Intel proposed a new instruction set (AVX-512) which contains such an instruction (vpsrlvdw) but it is not yet publicly available.

Our contribution depends crucially on the pmovmskb instruction. Given a vector of sixteen bytes, it outputs a 16-bit value made of the most significant bit of each of the sixteen input bytes: e.g., given the vector \((128, 128, \ldots, 128)\), pmovmskb would output 0xFFFF.

IV. MASKED VBYTE DECODING

The conventional VByte decoders algorithmically process one input byte at a time (see Algorithm[1]). To multiply the decoding speed, we want to process larger chunks of input data at once. Thankfully, commodity Intel and AMD processors have supported Single instruction, multiple data (SIMD) instructions since the introduction of the Pentium 4 in 2001. These instructions can process several words at once, enabling vectorized algorithms.

Stepanov et al. [4] used SIMD instructions to accelerate the decoding of VByte data (which they call varint-SU). According to their experimental results, SIMD instructions lead to a disappointing speed improvement of less than 25%, with no gain at all in some instances. To get higher speeds (e.g., an increase of 3×), they proposed instead new formats akin to Google’s Group Varint [2]. For simplicity, we do not consider such “Group” alternatives further: once we consider different data format, a wide range of fast SIMD-based compression schemes become available [3]—some of them faster than Stepanov et al.’s fastest proposal.

Though they did not provide a detailed description, Stepanov et al.’s approach resembles ours in spirit. Consider the simplified example from Fig.[4]. It illustrates the main steps:

- From the input bytes, we gather the control bits (1,0,1,0,0,0 in this case) using the pmovmskb instruction.
- From the resulting mask, we look up a control mask in a table and apply the pshufb instruction to move the bytes. In our example, the first 5 bytes are left in place (at positions 1, 2, 3, 4, 5) whereas the 5th byte is moved to position 7. Other output bytes are set to zero.
- We can then extract on the first 7 bits of the low bytes (at positions 1, 3, 5, 7) into a new 8-byte register. We can also extract the high bytes (positions...
that is, all three vectors represent the same binary data. In what follows, we use the convention that \((\cdots)_k\) is a vector of \(k\)-bit integers. Because numbers are stored in binary notation, we have that
\[
(1, 0, 0, 0)_8 = (1, 0)_16 = (1)_32,
\]
that is, all three vectors represent the same binary data.

- If the mask is 00\cdots00, then the 12 input bytes represent 12 integers as is. We can unpack the first 4 bytes to 4 32-bit integers (in a 128-bit register) with the pmovsxbd instruction. This new register can then be stored in the output buffer. We can then shift the input register by 4 bytes using the pmovsxbd instruction, and apply to pmovsxbd instruction again. Repeating a third time, we have decoded all 12 integers. We have consumed 12 input bytes and written 12 integers.

- Otherwise we use the 12-bit mask to look two 8-bit values in a table \(2^{12}\)-entries-wide. The first 8-bit value is an integer between 2 and 12 indicating how many input bytes we consume. Though it is not immediately useful to know how many bytes are consumed, we use this number of consumed bytes when loading the next input bytes. The second 8-bit value is an index \(i\) taking integer values in \([0, 170]\). From this index, we load up one of 170 control masks. We then proceed according to the value of the index \(i\):

  - If \(i < 64\), then the next 6 integers each fit in at most two bytes (they are less than \(2^{14}\)). There are exactly \(2^6 = 64\) cases corresponding to this scenario. That is, the first integer can fit in one or two bytes, the second integer in one or two bytes, and so on, generating 64 distinct cases. For each of the 6 integers \(x_i\), we have the low byte containing the least significant 7 bits of the integer \(a_i\), and optionally a high byte containing the next 7 bits \(b_i\) \((x_i = a_i + b_i2^7)\). The call to pshufb will permute the bytes such that the low bytes occupy the positions 1, 3, 5, 7, 9, 11 whereas the high bytes, when available, occupy the positions 2, 4, 6, 8, 10, 12. When a high byte is not available, the byte value zero is used instead.

For example, when all 6 values are in \([2^7, 2^{14}]\),

\begin{table}[ht]
\centering
\caption{Relevant SIMD instructions on Haswell Intel processors with latencies and reciprocal throughput in CPU cycles.}
\begin{tabular}{llll}
\hline
\textbf{instruction} & \textbf{description} & \textbf{latency} & \textbf{rec. throughput} \\
\hline
movdqu & store or retrieve a 128-bit register & 3 & 1/0.5 \\
padd & add four pairs of 32-bit integers & 1 & 0.5 \\
psrufd & shuffle four 32-bit integers & 1 & 1 \\
psrufb & shuffle sixteen bytes & 1 & 1 \\
psrlq & shift right by a number of bytes & 1 & 0.5 \\
pslldq & shift left by a number of bytes & 1 & 0.5 \\
psrldx & unpack the first four bytes into four 32-bit ints. & 1 & 0.5 \\
pmove & unpack the first four 16-bit integers into four 32-bit ints. & 1 & 0.5 \\
pmovsxbd & shift right eight 16-bit integers & 1 & 1 \\
pmovsxwd & shift right four 32-bit integers & 1 & 1 \\
pmsrlq & shift right two 64-bit integers & 1 & 1 \\
pmslq & shift left two 64-bit integers & 1 & 1 \\
pand & bitwise AND between two 128-bit registers & 1 & 0.33 \\
porst & bitwise OR between two 128-bit registers & 1 & 0.33 \\
pmovdx & multiply eight 16-bit integers & 5 & 1 \\
pmovmskb & create a 16-bit mask from the most significant bits & 3 & 1 \\
\hline
\end{tabular}
\end{table}
the permuted bytes are

\[(a_1, b_1, a_2, b_2, a_3, b_3, \ldots, a_8, b_8)_8\]

when presented as a vector of bytes with the short-hand notation \(a, b \equiv a + 2^7\).

From these permuted bytes, we generate two vectors using bitwise ANDs with fixed masks (using \texttt{pand}). The first one retains only the least significant 7 bits of the low bytes: as a vector of 16-bit integers we have

\[(a_1, b_1, a_2, b_2, a_3, b_3, \ldots, a_8, b_8)_8\]

AND

\[(127, 0, 127, 0, 127, 0, \ldots, 127, 0)_8\]

\[= (a_1, a_2, a_3, \ldots, a_8)_8.\]

The second one retains only the high bytes:

\[(0, b_1, 0, b_2, 0, b_3, \ldots, 0, b_8)_8.\]

Considering the latter as a vector of 16-bit integers, we right shift it by 1 bit (using \texttt{psrlw}) to get the following vector

\[(b_12^7, b_22^7, b_32^7, \ldots, b_82^7)_8.\]

We can then combine (with a bitwise OR using \texttt{por}) this last vector with the vector containing the least significant 7 bits of the low bytes. We have effectively decoded the 6 integers as 16-bit integers: we get

\[(a_1 + b_12^7, a_2 + b_22^7, a_3 + b_32^7, a_4 + b_42^7, a_5 + b_52^7, a_6 + b_62^7)_8.\]

We can unpack the first four to 32-bit integers using an instruction such as \texttt{pmovsxwd}, we can then shift by 8 bytes (using \texttt{psrldq}) and apply \texttt{pmovsxwd} once more to decode the last two integers.

- If \(64 \leq i < 145\), the next 4 encoded integers fit in at most 3 bytes. We can check that there are \(81 = 3^4\) such cases. The processing is then similar to the previous case except that we have up to three bytes per integer (low, middle and high). The permuted version will rearrange the input bytes so that the first 3 bytes contain the low, middle and high bytes of the first integer, with the convention that a zero byte is written when there is no corresponding input byte. The next byte always contain a zero. Then we store the data corresponding to the next integer in the next 3 bytes. A zero byte is added. And so on.

This time, we create 3 new vectors using bitwise ANDs with appropriate masks: one retaining only the least significant 7 bits from the low bytes, another retaining only the least significant 7 bits from the middle bytes and another retaining only the high bytes. As vectors of 32-bit integers, the second vector is right shifted by 1 bit whereas the third vector is right shifted by 2 bits (using \texttt{psrld}). The 3 registers are then combined with a bitwise OR and written to the output buffer.

- Finally, when \(145 \leq i < 170\), we decode the next 2 integers. Each of these integers can consume from 1 to 5 input bytes. There are \(5^2 = 25\) such cases.

For simplicity of exposition, we only explain how we decode the first of the two integers using 8-byte buffers. The integer can be written as \(x_i = a_1 + b_12^7 + c_12^{14} + d_12^{21} + e_12^{28}\) where \(a_1, b_1, c_1, d_1 \in [0, 2^7]\) and \(e_1 \in [0, 2^4]\).

Assuming that \(x_i \geq 2^{28}\), then the first 5 input bytes will be \((a_1, b_1, c_1, d_1, e_1)_8\).

Irrespective of the value of the index \(i\), the first step is to set the most significant bit of each input byte to 0 with a bitwise AND. Thus, if \(x_i \geq 2^{28}\), we get \((a_1, b_1, c_1, d_1, e_1)_8\).

We then permute the bytes so that we get the following 8 bytes:

\[Y = (b_1, c_1, d_1, e_1 + a_12^8)_8.\]

The last byte is occupied by the value \(a_1\) which we can isolate for later use by shifting right the whole vector by seven bytes (using...
we repeatedly call the pmovmskb instruction. An important motivation is to amortize the latency of the pmovmskb instruction as much as possible. First, we repeatedly call the pmovmskb instruction until we have processed up to 48 bytes to compute a corresponding 48-bit mask. Then we repeatedly call the decoding procedure as long as 12 input bits remain out of the processed 48 bytes. After reach call to the 12-byte decoding procedure, we left shift the mask by the number of consumed bits. Recall that we look up the number of consumed bytes at the beginning of the decoding procedure so this number is readily available and its determination does not cause any delay. When fewer than 12 valid bits remain in the mask, we process another block of 48 input bytes with pmovmskb. To accelerate further this process, and if there are enough input bytes, we maintain two 48-bit masks (representing 96 input bytes): in this manner, a 48-bit mask is already available while a new one is being computed. When fewer than 48 input bytes but more than 16 input bytes remain, we call pmovmskb as needed to ensure that we have at least a 12-bit mask. When it is no longer possible, we fall back on conventional VByte decoding.

Differential Coding We described the decoding procedure without accounting for differential coding. It can be added without any major algorithmic change. We just keep track of last 32-bit integer decoded. We might store it in the last entry of a vector of four 32-bit integers (henceforth \( p = (p_1, p_2, p_3, p_4) \)).

We compute the prefix sum before writing the decoded integers. There are two cases to consider. We either write four 32-bit integers or 2 32-bit integers (e.g., when writing 6 decoded integers, we first write 4, then 2 integers). In both cases, we permute first the entries of \( p \) so that \( p \leftarrow (p_4, p_3, p_4, p_1) \) using the pshufd instruction.

- If we decoded 4-integers stored in the vector \( c \). We left shift the content of \( c \) by one integer (using psllldq) so that \( c' \leftarrow (0, c_1, c_2, c_3) \). We add \( c \) to \( c' \) using padd so that \( c \leftarrow (c_1 + c_2, c_2 + c_3, c_3 + c_4) \). We shift the rest by two integers (using again psllldq) \( c'' \leftarrow (0, 0, c_1, c_1 + c_2) \) and add \( c + c'' = (c_1 + c_2, c_1 + c_2 + c_3, c_1 + c_2 + c_3 + c_4) \). Finally, we add \( p \) to this last result \( p \leftarrow p + c + c'' = (p_4 + c_1, pp + c_1 + c_2, p + c_1 + c_2 + c_3, p + c_1 + c_2 + c_3 + c_4) \). We can write \( p \) as the decoded output.

- The process is similar though less efficient if we only have two decoded gaps. We start from a vector containing two gaps \( c \leftarrow (c_1, c_2, \ast, \ast) \) where we indicate irrelevant entries with \( \ast \). We can left shift by one integer \( c' \leftarrow (0, c_1, c_2, \ast) \) and add the result \( c + c' = (c_1, c_1 + c_2, \ast, \ast) \). Using the pshufd instruction, we can copy the value of the second component to the third and fourth components, generating \((c_1, c_1 + c_2, c_1 + c_2, c_1 + c_2)\). We can then add \( p \) to this result and store the result back into \( p \). The first two integers can be written out as output.

V. EXPERIMENTS

We implemented our software in C and C++. The benchmark program ran on a Linux server with an Intel i7-
4770 processor running at 3.4 GHz. This Haswell processor has 32 kB of L1 cache and 256 kB of L2 cache per core with 8 MB of L3 cache. The machine has 32 GB of RAM (DDR3-1600 with double-channel). We disabled Turbo Boost and set the processor to run at its highest clock speed. We report wall-clock timings. Our software is freely available under an open-source license (http://maskedvbyte.org) and was compiled using the GNU GCC 4.8 compiler with the -O3 flag.

For our experiments, we used a collection of posting lists extracted from the ClueWeb09 (Category B) data set. ClueWeb09 includes 50 million web pages. We have one posting list for each of the 1 million most frequent words—after excluding stop words and applying lemmatization. Documents were sorted lexicographically based on their URL prior to attributing document identifiers.

When decoding long posting lists to RAM, our speed is limited by RAM throughput. For this reason, we decode the compressed data sequentially to buffers fitting in L1 cache (4096 integers). For each group and each decoder, we compute the average decoding speed in millions of 32-bit integers per second (mis). For our Masked VByte decoder, the speeds ranges from 2700 mis for the most compressible lists to 650 mis for the less compressible ones. The speed of the conventional VByte decoder ranges from 1100 mis to 300 mis. For all groups of posting lists in our experiments, the Masked VByte decoder was at least twice as fast as the conventional VByte decoder. However, for some groups, the speedup is between 3× and 4×.

If we fully decode all lists instead of decoding to a buffer that fits in CPU cache, the performance of Masked VByte can be reduced by about 15%. For example, instead of a maximal speed of 2700 mis, Masked VByte is limited to 2300 mis.

VI. CONCLUSION

To our knowledge, no existing VByte decoder comes close to the speed of Masked VByte. Given how the VByte format is a de facto standard, it suggests that Masked VByte could help optimize a wide range of existing software without affecting the data formats. Masked VByte is in production code at Indeed as part of the open-source analytics platform Imhotep (http://indeedeng.github.io/imhotep/).

ACKNOWLEDGMENTS

We thank L. Boytsov from CMU for preparing and making available the posting list collection.

REFERENCES


