All posts

Building an HNSW Index From Scratch In C++ (2/2)

--

Introduction

In the previous blog post, we implemented a basic HNSW index from scratch. In this post, we’ll explore several optimizations to improve both build time and query time. A quick note : in the previous post, the brute-force approach for the C++ version was quite naive. For the sake of correctness, in this post, we will be optimizing the baseline as well :

 1vector<int> bruteForceKNN(
 2    const vector<vector<float>> &data,
 3    const vector<float> &query,
 4    size_t k
 5) {
 6    vector<pair<float, int>> dists(data.size());
 7    for (size_t i = 0; i < data.size(); ++i) {
 8        float d = 0;
 9        for (size_t j = 0; j < query.size(); ++j)
10            d += (data[i][j] - query[j]) * (data[i][j] - query[j]);
11        dists[i] = {d, static_cast<int>(i)};
12    }
13    partial_sort(dists.begin(), dists.begin() + static_cast<ptrdiff_t>(k), dists.end());
14    vector<int> result(k);
15    for (size_t i = 0; i < k; ++i)
16        result[i] = dists[i].second;
17    return result;
18}
Note

The benchmarks ran for quite some time and it was only after that I realized you could optimize this further by using std::nth_element which reduces complexity to O(N + klogk) instead of O(NlogN) 🫠

Alright, Show Me The Optimizations

There are 2 major things we aim to improve : query time and build time

Query Time Improvements

In order to improve query times, there are a couple of areas we can focus on :

  • Distance calculation
  • Improving the searchLayer method

Optimizing Distance Calculations

To keep things simple, we’ll focus on L2 distance only. The basic implementation looks like this :

1inline double l2_scalar(const float *a, const float *b, std::size_t dim) {
2    double sum = 0.0;
3    for (std::size_t i = 0; i < dim; ++i) {
4        double d = static_cast<double>(a[i]) - static_cast<double>(b[i]);
5        sum += d * d;
6    }
7    return sum;
8}

(Ignore the static_cast ... I had a bunch of compiler flags enabled to ensure best practices πŸ˜…. Just assume it's casting the float to a double)


We can optimize this using SIMD. SIMD (single instruction, multiple data) is a form of parallel programming where the same operation is applied to multiple data points simultaneously.


Usually, this is implemented with vector registers (e.g., 128-bit, 256-bit, 512-bit), where each β€œlane” of the vector holds one piece of data. AVX, SSE, NEON, etc., all work this way.


In our case, all we are doing is a subtraction of 2 values and adding the square of their differences. So, parallelizing this is a bit straight-forward - from a vector of N elements, we can process (say) 8 or 256 elements at once, depending on the SIMD width. When I started learning this, I picked up the AVX-256 instruction set (a VERY VERY small subset of it), so I'll be going with that. Here's a basic SIMD implementation of the L2 distance function :

 1// We need this as the instructions that we are about to use are compiler intrinsics
 2#include <immintrin.h>
 3
 4inline double l2_avx(const float *a, const float *b, std::size_t dim) {
 5
 6    // instead of double sum = 0;
 7    __m256 sum = _mm256_setzero_ps();
 8
 9    std::size_t i = 0;
10    for (; i + 8 <= dim; i += 8) {
11        // a[i] but 8 elements loaded at once
12        __m256 va = _mm256_loadu_ps(a + i);
13
14        // b[i] but 8 elements loaded at once
15        __m256 vb = _mm256_loadu_ps(b + i);
16
17        // (a[i] - b[i]) but 8 elements subtracted at once
18        __m256 diff = _mm256_sub_ps(va, vb);
19
20        // ((a[i] - b[i]) ^ 2) but 8 elements squared at once
21        __m256 sq = _mm256_mul_ps(diff, diff);
22
23        // sum += (a[i] - b[i]) ^ 2
24        sum = _mm256_add_ps(sum, sq);
25    }
26
27    // Do the same operation for remaining the remaining (dim % 8) elements
28    float tmp[8];
29
30    // Move the 8 elements at once from sum (vector) to tmp (vector)
31    _mm256_storeu_ps(tmp, sum);
32
33    double total = static_cast<double>(tmp[0]) + static_cast<double>(tmp[1]) +
34                   static_cast<double>(tmp[2]) + static_cast<double>(tmp[3]) +
35                   static_cast<double>(tmp[4]) + static_cast<double>(tmp[5]) +
36                   static_cast<double>(tmp[6]) + static_cast<double>(tmp[7]);
37
38    for (; i < dim; ++i) {
39        double d = static_cast<double>(a[i]) - static_cast<double>(b[i]);
40        total += d * d;
41    }
42
43    return total;
44}
Note

Wherever you see __m256, mentally, register it as a float arr[8]. It's nothing but a 256-bit vector of [8 x float] with all elements :

1// In avxintrin.h
2
3/// This intrinsic corresponds to the <c> VXORPS </c> instruction.
4///
5/// \returns A 256-bit vector of [8 x float] with all elements set to zero.
6static __inline __m256 __DEFAULT_FN_ATTRS_CONSTEXPR _mm256_setzero_ps(void) {
7return __extension__ (__m256){ 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
8}

I also wanted to port this to Arm because I'll be running my benchmark on a Mac (at least for now). Needless to say, a bit of prompting was involved because I'm not gonna learn Arm specific thing now c'mon 😭. Anyhow, this is how the NEON (this is what Arm uses) code for it looks like :

 1#include <arm_neon.h>
 2inline double l2_neon(const float *a, const float *b, std::size_t dim) {
 3
 4    float32x4_t sum = vdupq_n_f32(0.0f);
 5    std::size_t i = 0;
 6    for (; i + 4 <= dim; i += 4) {
 7        float32x4_t va = vld1q_f32(a + i);
 8        float32x4_t vb = vld1q_f32(b + i);
 9        float32x4_t diff = vsubq_f32(va, vb);
10        sum = vmlaq_f32(sum, diff, diff);
11    }
12
13    float tmp[4];
14    vst1q_f32(tmp, sum);
15    double total = static_cast<double>(tmp[0]) + static_cast<double>(tmp[1]) +
16                   static_cast<double>(tmp[2]) + static_cast<double>(tmp[3]);
17    for (; i < dim; ++i) {
18        double d = static_cast<double>(a[i]) - static_cast<double>(b[i]);
19        total += d * d;
20    }
21    return total;
22}
Note

NEON is 128-bit only (as indicated by float32x4_t, i += 4, etc)

Optimizing The searchLayer Method

We won’t use SIMD in this case; instead, we’ll optimize this on a per-query basis. After profiling my benchmarking code with dtrace and a basic flamegraph, I found out that this :

1vector<bool> vis;

was in the hot-path. For small datasets, this adds negligible overhead, but for larger datasets, resetting it on every query is slow. So, to optimize that, we store a global list of numbers instead of bools to indicate whether a node was visited or not in the current search call. We do that by having yet another global variable which is incremented in each search invocation. So, something like this :

 1// These are defined on HnswCPU class (so, not "global" global)
 2vector<uint32_t> visited;
 3uint32_t visitTag;
 4
 5vector<int> HnswCPU::searchLayer(const float *query, int entry, int ef, int level) {
 6    if (visited.size() < nodes.size())
 7        visited.resize(nodes.size());
 8
 9    visitTag++;
10}

The cool thing about this is, we can use it in this call as well and reap the rewards :

 1vector<int> HnswCPU::selectNeighborsWithHeuristic(
 2    const float *query,
 3    const vector<int> &candidates,
 4    int max_neighbors,
 5    int layer,
 6    bool extendCandidates,
 7    bool keepPrunedConnections
 8) {
 9
10    if (visited.size() < nodes.size())
11        visited.resize(nodes.size());
12
13    visitTag++;
14    ...
15}

Doing this, of course, requires a few minor changes, such as :

 1{
 2    ...
 3    // Doing this instead of if(visited[nei])
 4    if (visited[static_cast<size_t>(nei)] == visitTag)
 5        continue;
 6
 7    // And doing this instead of doing visisted = true
 8    visited[static_cast<size_t>(nei)] = visitTag;
 9    ...
10
11}
Note

This approach is inspired from hnswlib. They have a slightly different approach using a VistedListPool but for the sake of simplicity, I just went with a simple vector<uint32_t> instead

Build Time Improvements

This for me was a struggle. First, I dabbled with multithreading - I thought "huh, nodes ... a bunch of them, parallelly add them hehe". Python is a bit more forgiving here 😭.


The errors that I received in C++ were in a league of its own. I spun up Claude Sonnet 4.6 Opus Thinking and even that gave up. ChatGPT was wrecked when I prompt engineered the ASAN (address sanitizer) errors 😭😭😭.


Ohhh, there's going to be a whole blog about these AI mess-ups soon (I'll link it here once it's live). The debugging session I went through for a few days threw my weekly blog publishing schedule off by a lot.


It was at that point I realized, it's better to take a look at hnswlib. Apparently, they don't rely on multithreading shenanigans. They do have mutex and lock_guards in place but that's only to be agnostic to the caller environment (they expose a Python binding, so usually it's the Python environment). So, after doing my due diligence, I deleted the branch that delved into multi-threading and just focused on profiling my code and optimizing from there.


Two major bottlenecks I observed were the embedding storage and the list of neighbors. Initially I stored both in the graph's node itself :

1struct Node {
2    int id;
3    int level;
4    vector<vector<int>> neighbors;
5    vector<int> embeddings;
6};

This was intuitive but had a significant performance impact. This is how the memory layout of neighbors sort of looked like :

 1nodes
 2 β”œβ”€ Node
 3 β”‚   β”œβ”€ vector<vector<int>>
 4 β”‚   β”‚   β”œβ”€ vector<int> (level0)
 5 β”‚   β”‚   β”œβ”€ vector<int> (level1)
 6 β”‚   β”‚   └─ vector<int> (level2)
 7 β”‚
 8 β”œβ”€ Node
 9 β”‚   β”œβ”€ vector<vector<int>>
10 β”‚
11 └─ Node

Which has many problems like :

  • 3+ allocations per node
  • pointer chasing
  • poor cache locality
  • unpredictable memory

A similar case arose for the embeddings as well. So, I moved both the data storage and the neighbor list to the Hnsw class itself in a single array, with helper accessor methods, to improve cache locality :

 1// Lightweight wrapper over a couple of values now
 2struct Node {
 3    int level;
 4    int neighbor_offset;
 5
 6    Node(int lvl = 0, int offset = 0) : level(lvl), neighbor_offset(offset) {}
 7};
 8
 9class HnswCPU : public HnswIndex {
10  private:
11
12    ...
13
14    vector<Node> nodes;
15    vector<int> neighbors;
16    vector<float> embeddings;
17
18    const float *getEmbedding(int id) const {
19        return &embeddings[static_cast<size_t>(id) * dim];
20    }
21
22    int *getNeighborPtr(int id, int level);
23    const int *getNeighborPtr(int id, int level) const;
24    int getNeighborCount(int level) const;
25
26    ...
27
28}

Earlier, for a dataset of size 100k, it used to take around 15 minutes to build. With these changes, we observe a significant speedup of around 80-100x (now it's around 30-50s at best). Of course, there are still a lot of micro-optimizations that can be made to speed it up even more but for the time being, I'm leaving things here

Benchmarks

We will be benchmarking Build Time (s), Query Time (us) and their respective percentage deltas with the following configs :

  • Scalar C++ version v/s Brute Force C++
  • SIMD C++ version v/s Brute Force C++
  • Python hnswlib v/s Numpy Brute Force

We will also be looking at Python's hnswlib vs proxima (yeah I named it πŸ™ƒ) :

  • With different N, same K and Dim
  • With different K, same N and Dim
  • With different Dim, same K and N

I thought of including the build time speedups of my 1st code v/s the optimized ones as well but the former never completed the benchmark script and the 500k dataset ones, took like 30 minutes - 45 minutes each. So, yeah sorry about that 🫠

System Information

PropertyValue
Operating SystemmacOS 14.6
Architecturearm64
CPUarm
CPU Cores8 (logical: 8)
Memory24.0 GB
Disk460.4 GB
Python Version3.14.0
The benchmarks are run on a Mac, so for SIMD L2 distance, we use NEON instead of AVX. Also, the tables are hugeeee, so please focus on the Speedup call alone πŸ˜…

Results

For the sake of brevity, I'm not include all the combinations of N, Dim and K. The trend that you see for the following configuration (Dim = 128, K = 10) is similar for the other combinations of (Dim, K).

If you want the full benchmark data, please check out my Proxima's GitHub Repo. That contains, all the datapoints, comparison scripts, plotting scripts, etc. I also added a decent setup guide in the projec README so hopefully you should be able to replicate a similar set of results πŸ˜‰

Scalar C++ version v/s Brute Force C++

(Dim = 128, K = 10)

NBuild (s)Query (Β΅s)Brute (Β΅s)SpeedupRecall
1,0000.12126.8150.590.40x1.00
5,0001.15218.55250.591.15x1.00
10,0002.63229.14485.892.12x1.00
50,00015.59251.332447.629.74x1.00
100,00032.13255.484926.0019.28x1.00
500,000172.53264.6724883.9994.02x1.00

SIMD C++ version v/s Brute Force C++

(Dim = 128, K = 10)

NBuild (s)Query (Β΅s)Brute (Β΅s)SpeedupRecall
1,0000.12131.9350.900.39x1.00
5,0001.23222.72248.781.12x1.00
10,0002.67237.73497.962.09x1.00
50,00015.67253.252471.799.76x1.00
100,00032.03260.144959.3219.06x1.00
500,000168.96274.4824865.2090.59x1.00

Python hnswlib v/s Numpy Brute Force

(Dim = 128, K = 10)

NBuild (s)Query (Β΅s)Brute (Β΅s)SpeedupRecall
1,0000.0218.0986.364.77x1.00
5,0000.2639.00543.7013.94x1.00
10,0000.6748.251121.0623.24x1.00
50,0005.3786.756408.1373.87x0.96
100,00015.29146.8713617.2092.72x0.92
500,000131.74237.3680773.80340.30x0.65

Python version v/s Brute Force C++

Build Time (Dim = 128, K = 10)

NC++ Scalar Build (s)Python Build (s)Build Ξ” %
1,0000.12150.0241418.41
5,0001.14770.2553381.69
10,0002.62790.6652301.32
50,00015.59165.3677191.95
100,00032.125215.2923109.43
500,000172.5330131.743228.25

Query Time (Dim = 128, K = 10)

NC++ Scalar Query (Β΅s)C++ SIMD Query (Β΅s)Python Query (Β΅s)Query Ξ” %
1,000126.81131.9318.09629.35
5,000218.55222.7239.00471.07
10,000229.14237.7348.25392.74
50,000251.33253.2586.75191.94
100,000255.48260.14146.8777.12
500,000264.67274.48237.3615.64
Note

Once we hit larger dataset sizes, we see our implementation's Ξ” % decreases and it gets closer to hnswlib. Again, there are improvements to be made, but the speedup is real πŸ₯Ή

Plots For The Other Cases

Once again, all these plots are available on my GitHub repo as proper PNGs

Build Time Comparison

Build Time Comparison (in s) (click to enlarge)
Build Time Comparison (in s) (click to enlarge)

Query Time Comparison

Query Time Comparison (in us) (click to enlarge)
Query Time Comparison (in us) (click to enlarge)

Conclusion

For small datasets, a well-implemented brute-force search can already be surprisingly competitive. But as datasets grow, approximate nearest neighbor methods like HNSW become far more efficient.In my experiments, moving from a naive implementation to a more optimized one improved build and query times by ~80–100Γ—. Most of the gains came from fixing fundamentals :

  • better memory layout
  • fewer allocations
  • improved cache locality

These changes alone reduced build time for a 100k dataset from ~15 minutes to under a minute.


One thing that surprised me was the results from SIMD acceleration. The gains were much smaller than I had expected, so clearly, the query pipeline was still largely dominated by graph traversal and memory access


Compared to hnswlib, my version is still slower on small datasets, but the gap shrinks as the dataset grows, which suggests the core structure is sound. But in my defence, this project wasn’t about beating existing libraries, it was about understanding how these ANN systems work under the hood.


Also ... C++ gives you a TON of control. As Bjarne Stroustrup (creator of C++) put it:

β€œC makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, it blows your whole leg off.”

After finishing this project, I'm pretty sure my brain got blown off too. Totally worth it though 😭😎

More like this

Comments