Implementing a world fastest Java int-to-int hash map*

by Mikhail Vorontsov

* Fastest among int-to-int map implementations I have tested in my previous article in the tests I have implemented for that article.

I would like to thank Sebastiano Vigna and Roman Leventov for sharing their hash map wisdom with me. Some implementation ideas were inspired by “Code Optimization: Effective Memory Usage” by Kris Kaspersky .

This article will give you a step by step overview of various implementation tricks used in the modern hash map implementations. At the end of this article you will have a probably fastest Java int-to-int hash map implementation available at the moment of writing of this article.

Open indexing

Most of modern hash maps are based on the idea of open indexing. What does it mean? Your map is based on the array of keys (values will always be placed at the matching array index, so forget about them for now). You have to find your key in the array of keys for each map operation. How does it implemented?

First of all, you need the initial lookup position in the array. It may be calculated by any function which maps a key into an integer in the range [0, array.length - 1]. A key is usually mapped into an integer by means of hashCode method. A simplest function here could be Math.abs(key.hashCode() % array.length) (keep in mind that % result could be negative).

As you understand, a mapping of large set of keys into a small set of integer values means that you may end up with some collisions (they are called hash collisions) – same results of the initial function for the different keys. Collisions are resolved by trying to apply another function to the original array index. The simplest of such functions is (prevIdx + 1) % array.length. There is one requirement for such functions – if applied in a loop, they should cover the whole set or array cells, so that you can use the whole array capacity. Another example of such function is incrementing the index by one prime number if the array length is another prime number.

Free and removed cells

In theory, that’s enough to implement your own hash map. In practice, you need to distinguish free and removed cells from occupied cells (you can avoid using removed cells if you’ll do extra work in remove method – see how it is implemented in the latest FastUtil). Removed cells are also known as “tombstones”.

Your keys array is initially filled with free “cells”. You set a cell into “removed” state if you need to remove an existing key.

Let’s take a look at an example:

Open indexing example

This int key map uses the initial and next functions defined above:

1 2	initial = Math.abs( key % array.length ); nextIdx = ( prevIdx + 1 ) % array.length;

initial = Math.abs( key % array.length );
nextIdx = ( prevIdx + 1 ) % array.length;

This map originally contained keys 1, 2, 3 and 4, but key=3 was subsequently removed from a map, so it was replaced with a removed (“R”) placeholder.

Let’s see what should we do to find the following keys:

Key	Description
2	Start function points at a cell with index=2 at once. We have key=2 at a cell with index=2, so no further lookup is required.
3	Start function points at a cell with index=3. This cell is “removed”, so we have to apply “nextIdx” function in a loop until we either find a key or a free cell. We check cell index=4 next – bad luck, key is not equal. Then we check cell index=5: it is a free cell, so we stop the lookup – key is not found.

Next, let’s see what will happen if we want to add key=10: initial = key % array.length = 10 % 9 = 1. Cell at index=1 is already occupied with another key, so we can not use it. So is cell at index=2. The cell at index=3 is “removed”, so we can reuse it and put key=10 into it.

Removed cells cleanup

In many cases your hash map may degrade to O(n²) complexity if you would keep the removed cells in the map. Fastest maps are implementing the removed cells cleanup one way or another. As a result, all other map methods will need to distinguish just 2 cell states: free or used. Besides that, remove method is usually called infrequently compared to get and less frequently than put, which means that some extra complexity during key removal will be paid off by fasted execution of other methods. This article will use FastUtil cleanup logic.

Key scrambling

The initial index function I have mentioned above ( initial = Math.abs( key % array.length ); ) will put consecutive keys in the consecutive array cells. This is highly undesirable if your next cell function is simply picking up the next array cell, because it will cause the long lookup chains to be created in a pretty common case.

In order to avoid it, we need to “scramble” the key, shuffling its bits. I will rely on FastUtil scrambling code:

private static final int INT_PHI = 0x9E3779B9;
 
public static int phiMix( final int x ) {
    final int h = x * INT_PHI;
    return h ^ (h >> 16);
}

private static final int INT_PHI = 0x9E3779B9;

public static int phiMix( final int x ) {
    final int h = x * INT_PHI;
    return h ^ (h >> 16);
}

As a result, consecutive keys will not be placed in the consecutive array cells, thus keeping the average hash chain length under control. As for “random” keys case, you are likely to end up with a pretty good distribution of keys over the keys array as well.

Now you are definitely ready to implement your own hash map. We will be implementing an int-int map in the next several sections of this article.

Version 1: Base `int-int` map

Let’s start from a simplest possible implementation (which will leave us enough space for optimization). This implementation will look similar to Trove 3.0 TIntIntHashMap (though I was not copying its source code while writing this example).

It will use 3 arrays: one int[] for keys, one int[] for values and one boolean[] for cell usage flags (true==used). We will allocate the arrays of requested size (it will be size / fillFactor + 1), despite the fact that all production quality implementations are rounding this number up. Initial and nextIdx functions are the simplicity itself:

1 2	initial = Math.abs( Tools.phiMix(key) % array.length); nextIdx = ( prevIdx + 1 ) % array.length;

initial = Math.abs( Tools.phiMix(key) % array.length);
nextIdx = ( prevIdx + 1 ) % array.length;

You can find the source code of this and all other maps at the end of this article.

I will compare each implementation results to the previous article results. Let’s compare it to the fastest implementation – Koloboke (note that I will give only get test results for the intermediate versions of a map to save space – you can find the full test results at the end of an article). All tests for this article are running on the random key set. All maps have the fill factor set to 0.75.

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.primitive.KolobokeMutableMapTest	1867	2471	3129	7546	11191
tests.maptests.article_examples.IntIntMap1Test	2768	3671	6105	12313	16073

That’s already great! The very first unoptimized attempt to implement a hashmap is less than 2 times slower than the Koloboke implementation.

Version 2: avoiding expensive `%` operation – arrays capacity is power of 2 now

Surprisingly many people think that old wisdom about extremely slow integer division/modulo operations is no longer valid – new CPUs are so smart and fast!!! Unfortunately this is still incorrect – both integer division operations are pretty slow and should be avoided in the performance critical code.

Previous version of a hashmap used % operation for both start and next index calculations, so it was guaranteed to be executed at least once per hash map method call. We can avoid it if our arrays size would be equal to a power of 2 (first power of 2 higher than the expected capacity). In this case our simple index functions could be transformed into:

1 2	initial = Tools.phiMix( key ) & (array.length-1); nextIdx = ( prevIdx + 1 ) & (array.length-1);

initial = Tools.phiMix( key ) & (array.length-1);
nextIdx = ( prevIdx + 1 ) & (array.length-1);

array.length-1 should be, of course, cached into a separate mask class field. Why array.length-1 could be used as a mask? It is a known fact that if K = 2^N, then X % K == X & (K - 1). Using & operation gives us an extra benefit of always non-negative results (highest bit is always cleared by such masks), which further speeds up the calculation.

Keep in mind that all high performance hash maps are relying on this optimization.

Let’s compare this implementation results with the previous maps:

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.primitive.KolobokeMutableMapTest	1867	2471	3129	7546	11191
tests.maptests.article_examples.IntIntMap1Test	2768	3671	6105	12313	16073
tests.maptests.article_examples.IntIntMap2Test	2254	2767	4869	10543	16724

This optimization alone allowed us to make half a way from the slowest to the fastest implementation! Nevertheless, there is a long way ahead of us.

Version 3: getting rid of `m_used` array

The previous version of a map used 3 separate arrays to store the map data. It means that for a random key a map has to access 3 different areas of memory, each time likely causing a CPU cache miss. High performance code should minimize the possible number of CPU cache misses it causes on the normal operation path. The simplest optimization we can do is to get rid of m_used array and encode the cell usage flag in a smarter way.

The problem is that we are implementing an int-to-int map, so we may expect any int to be used as a key (how pathetic would be a general purpose hashmap stating that some key N is reserved and could not be used…). It means that we need some extra storage for usage flags, isn’t it? Yes, it is. But the point is that we can use O(1) instead of O(n) bytes for this storage!

The idea is to choose one special key value for free cells. There are 2 known strategies after that (I will use the first one):

Store a value matching to the free cell in a separate field. You also need some indication if a free key is actually used (a boolean and an int or just Boolean is fine for this purpose). Each map method should check if a key argument is equal to a free key prior to the normal logic and act accordingly.
Pick a random free key. If you are requested to insert it in the map, select a new random free key, which is not present in the map so far (and replace all old free keys with a new value). Now you need to store the free key instead of its corresponding values, but you would not get any other benefits. Koloboke is the only implementation which uses this strategy.

As a side note I want to mention that dealing with free keys on maps with Object keys is much easier:

1	private static final Object FREE_KEY = new Object();

private static final Object FREE_KEY = new Object();

This key is not accessible to other classes, so it could never be passed into your map implementation as a key. Those smart guys who remember about reflection are reminded that the hash map keys must properly implement equals and hashCode methods, which is not the case for this private object

As I have mentioned before, our implementation will use a hardcoded free key (0, which is a cheapest to compare constant) and store the corresponding value in the separate fields:

private static final int FREE_KEY = 0;
 
/** Keys */
private int[] m_keys;
/** Values */
private int[] m_values;
 
/** Do we have 'free' key in the map? */
private boolean m_hasFreeKey;
/** Value of 'free' key */
private int m_freeValue;

private static final int FREE_KEY = 0;

/** Keys */
private int[] m_keys;
/** Values */
private int[] m_values;

/** Do we have 'free' key in the map? */
private boolean m_hasFreeKey;
/** Value of 'free' key */
private int m_freeValue;

Let’s compare this implementation results with the previous maps:

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.primitive.KolobokeMutableMapTest	1867	2471	3129	7546	11191
tests.maptests.article_examples.IntIntMap1Test	2768	3671	6105	12313	16073
tests.maptests.article_examples.IntIntMap2Test	2254	2767	4869	10543	16724
tests.maptests.article_examples.IntIntMap3Test	2050	2269	3548	9074	13750

As you can see, the impact of this change grows with the map size (the bigger is your map, the less help you get from the CPU cache). Nevertheless, we are still pretty far from the Koloboke implementation. But actually this map would already be the third in my previous article, after Koloboke and FastUtil.

Versions 4 and 4a – replacing arrays of keys and values with a single array

This step follows the direction of the previous step – now we want to use a single array to store keys and values. This will allow us to be able to access/modify values at a very low cost, because they will be located next to the keys.

There are 2 possible implementations in case of int-to-int map:

Use long[] – a key and a value will share the single long cell. Usefulness of this method is limited to certain types of keys and values.
Use a single int[] – keys and values will be interleaved (the only meaningful interleaving strategy is storing a value right after a key). This strategy has a small disadvantage of limiting the maximal map capacity to 1 billion cells (the maximal array size in Java is equal to Integer.MAX_VALUE). I believe this is not a problem for the most of use cases.

The difference between 2 those scenarios is the need to use bit arithmetic / type conversions to extract a key and a value from a long cell. My tests have shown that these operations have a noticeable negative impact on the map performance. Nevertheless, I have included both long[] (IntIntMap4) and int[] (IntIntMap4a) versions to this article.

Micro optimizations

Both versions would be pretty fast, but you need more to become the fastest. The important thing you should understand about a hashmap that its basic operations have O(1) complexity provided that you were not too greedy with the fill factor. It actually means that you must count the instructions on the hash hit path (the very first cell you check if either free or contains the requested key). Optimizing a hash collision loop is also important, but (I repeat this) you must be very careful on the hash hit path, because most of your operations will end up with a hash hit.

Having this in mind, you may want to inline some of helper methods, especially the ones which could save you an instruction or two if inlined. Take a look, for example, at the get method from the previous map version:

public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    final int idx = getReadIndex( key );
    return idx != -1 ? m_values[ idx ] : NO_VALUE;
}
 
private int getReadIndex( final int key )
{
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return idx;
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return -1;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return -1;
        if ( m_keys[ idx ] == key )
            return idx;
    }
    return -1;
}

public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    final int idx = getReadIndex( key );
    return idx != -1 ? m_values[ idx ] : NO_VALUE;
}

private int getReadIndex( final int key )
{
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return idx;
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return -1;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return -1;
        if ( m_keys[ idx ] == key )
            return idx;
    }
    return -1;
}

The first check (key == FREE_KEY) is unlikely to happen, so it will be predicted correctly by CPU in most of cases. Essentially, it can be excluded from consideration here.

Next you get the cell index to use from a helper getReadIndex method. It is perfectly object-oriented, but it causes an extra check ( return idx != -1 ? m_values[ idx ] : NO_VALUE ) to be made. This check, unlike previous ones, is unpredictable unless your map is essentially read-only and most of your get calls are made for the existing keys. Unpredictable checks should be avoided on the critical path (if possible).

It is easy to achieve in this case – just inline the body of getReadIndex into get:

public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return m_values[ idx ];
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return NO_VALUE;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return NO_VALUE;
        if ( m_keys[ idx ] == key )
            return m_values[ idx ];
    }
    return NO_VALUE;
}

public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return m_values[ idx ];
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return NO_VALUE;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return NO_VALUE;
        if ( m_keys[ idx ] == key )
            return m_values[ idx ];
    }
    return NO_VALUE;
}

As you can see, the only transformation I did was the replacement of getReadIndex return values with get return values. You can do a few similar transformations here and there. It also worth to manually inline getStartIndex and getNextIndex calls – you can’t tell for sure if they will be inlined by JIT. As a result, you may end up with the following get method (taken from IntIntMap4a – an int[] version):

public int get( final int key )
{
    int ptr = ( Tools.phiMix( key ) & m_mask) << 1;
 
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    int k = m_data[ ptr ];
 
    if ( k == FREE_KEY )
        return NO_VALUE;  //end of chain already
    if ( k == key ) //we check FREE prior to this call
        return m_data[ ptr + 1 ];
 
    while ( true )
    {
        ptr = (ptr + 2) & m_mask2; //that's next index
        k = m_data[ ptr ];
        if ( k == FREE_KEY )
            return NO_VALUE;
        if ( k == key )
            return m_data[ ptr + 1 ];
    }
}

public int get( final int key )
{
    int ptr = ( Tools.phiMix( key ) & m_mask) << 1;

    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    int k = m_data[ ptr ];

    if ( k == FREE_KEY )
        return NO_VALUE;  //end of chain already
    if ( k == key ) //we check FREE prior to this call
        return m_data[ ptr + 1 ];

    while ( true )
    {
        ptr = (ptr + 2) & m_mask2; //that's next index
        k = m_data[ ptr ];
        if ( k == FREE_KEY )
            return NO_VALUE;
        if ( k == key )
            return m_data[ ptr + 1 ];
    }
}

As you can see, we have just the following operations on the hash hit path:

A check for free key - highly predictable => extremely cheap
4 bit operations and a multiplication to calculate the start index
Extraction of a key - the only truly expensive operation, can't be avoided
Checks for free key / our key - your scenario may cause these checks to be predicted correctly, but they are not predictable in general case. Anyway, the arguments to compare will reside in L1 cache at worst, so both checks are cheap.
If key is found - extraction of a value. Our data layout ensures that a value will be fetched into L1 cache with a key, so reading a value is very cheap.

The hash hit path costs us:

2 highly predictable comparisons with constants
2 non predictable comparisons with constant / a key which is most likely already loaded into a register
4 bit operations (all arguments are either constants or values located no further than in L1 cache)
1 multiplication (unfortunately unavoidable if you need to shuffle the bits)
1 memory read which is likely to cause CPU cache miss (loading a key to check)
1 memory read which will be served out of L1 cache (after a previous mem read operation)

11 operations only to get a value by a key (and only one of them is likely to be expensive) - no wonder that a test is able to fetch 10 million keys a second in the worst case (when a map is too large and CPU cache can't help us and we also likely to check a few keys due to a large fill factor).

Test results

Now we will present the test results for all 5 map implementations and a Koloboke map as the fastest int-to-int map from my previous article.

"get" test results

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.primitive.KolobokeMutableMapTest	1867	2471	3129	7546	11191
tests.maptests.article_examples.IntIntMap1Test	2768	3671	6105	12313	16073
tests.maptests.article_examples.IntIntMap2Test	2254	2767	4869	10543	16724
tests.maptests.article_examples.IntIntMap3Test	2050	2269	3548	9074	13750
tests.maptests.article_examples.IntIntMap4Test	1902	2296	3229	7749	10661
tests.maptests.article_examples.IntIntMap4aTest	1738	22209	2927	6582	9969

int-int 'get' results

"put" test results

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.primitive.KolobokeMutableMapTest	3262	4549	5600	12182	16140
tests.maptests.article_examples.IntIntMap1Test	9048	10555	11322	23004	28974
tests.maptests.article_examples.IntIntMap2Test	4305	4816	9435	19805	26770
tests.maptests.article_examples.IntIntMap3Test	3865	4063	7455	15274	21014
tests.maptests.article_examples.IntIntMap4Test	3562	4676	5866	12999	16304
tests.maptests.article_examples.IntIntMap4aTest	3411	4401	5374	11289	15095

int-int 'put' results

"remove" test results

Map size:	10.000	100.000	1.000.000	10.000.000	100.000.000
tests.maptests.article_examples.IntIntMap1Test	8301	9142	9313	16507	24134
tests.maptests.article_examples.IntIntMap2Test	3915	3890	6227	13450	20456
tests.maptests.article_examples.IntIntMap3Test	3339	3270	4901	10425	16120
tests.maptests.article_examples.IntIntMap4Test	3098	3052	3988	8670	12019
tests.maptests.article_examples.IntIntMap4aTest	2870	2876	3823	8127	11503
tests.maptests.primitive.KolobokeMutableMapTest	2836	3042	3923	8228	12007

int-int 'remove' results

As you can see, the results are pretty obvious - IntIntMap4a is faster than a Koloboke int-int map, which was the fastest map in my previous article.

Summary

If you want to optimize your hash map for speed, you have to do as much as you can of the following list:

Use underlying array(s) with capacity equal to a power of 2 - it will allow you to use cheap & instead of expensive % for array index calculations.
Do not store the state in the separate array - use dedicated fields for free/removed keys and values.
Interleave keys and values in the one array - it will allow you to load a value into memory for free.
Implement a strategy to get rid of 'removed' cells - you can sacrifice some of remove performance in favor of more frequent get/put.
Scramble the keys while calculating the initial cell index - this is required to deal with the case of consecutive keys.

Source code

Classes written for this article are added to my previous hash map testing project on GitHub: https://github.com/mikvor/hashmapTest.

Please note you should run this project via tests.MapTestRunner class:

mvn clean install
java -cp target/benchmarks.jar tests.MapTestRunner

The post Implementing a world fastest Java int-to-int hash map* appeared first on Java Performance Tuning Guide.