C++ Interface¶
The C++ interface is the fast path: it exposes the full feature set (search, updates, filtered search, OOD, in-memory workloads) and matches the numbers reported in the papers. For best raw performance, see also the SPDK backend.
This guide walks through:
- Build configuration — pick search-only or search+update at compile time.
- Prepare datasets — download, format conversion, ground truth.
- Build the index — Vamana or PiPNN, plus the optional in-memory entry point.
- Search — basic search and search modes.
- Update (insert / delete) — concurrent insert/search/delete workloads.
- In-memory workloads — load the SSD index entirely into DRAM.
- Filtered search — attribute-constrained ANNS via speculative filtering.
- OOD search — NGFix refinement for cross-modal workloads.
Build Configuration¶
Two flags in CMakeLists.txt control the build profile:
| Flag | Effect |
|---|---|
-DREAD_ONLY_TESTS |
Disables update paths; higher search throughput. |
-DNO_MAPPING |
Disables the tag↔ID mapping table; required together with -DREAD_ONLY_TESTS for search-only. |
- Search-only (best search performance): enable both flags.
- Search+Update: disable both flags.
Re-run bash ./build.sh after toggling.
Prepare Datasets¶
1. Download. SIFT, DEEP1B, SPACEV. If the originals are unavailable, Big ANN benchmarks mirrors them.
SPACEV1B may ship as several sub-files. Concatenate them and save the numpy array as bin:
# bin format:
# | 4 bytes num_vecs | 4 bytes dim | flattened vectors |
def bin_write(vectors, filename):
with open(filename, 'wb') as f:
num_vecs, vector_dim = vectors.shape
f.write(struct.pack('<i', num_vecs))
f.write(struct.pack('<i', vector_dim))
f.write(vectors.tobytes())
def bin_read(filename):
with open(filename, 'rb') as f:
num_vecs = struct.unpack('<i', f.read(4))[0]
vector_dim = struct.unpack('<i', f.read(4))[0]
data = f.read(num_vecs * vector_dim * 4) # 4 bytes per float
vectors = np.frombuffer(data, dtype=np.float32).reshape((num_vecs, vector_dim))
return vectors
The dataset should include a ground truth file for the full set. Some datasets also include ground truth for subsets (first $k$ vectors) — e.g., SIFT100M's GT lives in idx_100M.ivecs inside the SIFT1B archive.
2. Convert format (if needed):
# convert .vecs to .bin
build/tests/utils/vecs_to_bin uint8 bigann_base.bvecs bigann.bin # for int8/uint8 vecs (SIFT)
build/tests/utils/vecs_to_bin float base.fvecs deep.bin # for float vecs (DEEP)
build/tests/utils/vecs_to_bin int32 idx_1000M.ibin # for int32/uint32 vecs (groundtruth)
# Generate 100M subsets (e.g., for SIFT and DEEP).
build/tests/utils/change_pts uint8 bigann.bin 100000000 # bigann.bin -> bigann.bin100000000
mv bigann.bin100000000 bigann_100M.bin
build/tests/utils/change_pts float deep.bin 100000000
mv deep.bin100000000 deep_100M.bin
# Compute ground truth for the 100M subset (SIFT100M example).
# compute_groundtruth <type> <metric> <data> <query> <topk> <output> null null
build/tests/utils/compute_groundtruth uint8 l2 bigann_100M.bin query.bin 1000 100M_gt.bin null null
Build the Index¶
PipeANN supports two on-disk graph builders with the same file format:
- Vamana (recommended) — DiskANN-style builder. Alpha-RNG pruning, one-by-one vector insertion.
- PiPNN (experimental) — partitions the dataset into overlapping sub-problems and leverages dense matrix multiplication kernels.
L1 * L2should be comparable to Vamana'sL.
Same command for both:
# build_disk_index <type> <data> <prefix> <R> <L_or_L1> <PQ_bytes> <M_GB> <threads> <metric> <nbr_type> [L2]
# Vamana: omit L2, or pass 0.
build/tests/build_disk_index uint8 data.bin index 96 128 32 256 112 l2 pq
# PiPNN: pass L1 in L_or_L1, and L2 as the last argument.
build/tests/build_disk_index uint8 data.bin index 96 9 32 256 112 l2 pq 10
Parameters:
| Parameter | Meaning |
|---|---|
R |
Maximum out-neighbors. |
L_or_L1 |
Vamana: build-time candidate pool L. PiPNN: L1. |
L2 |
0 or omitted → Vamana. L2 > 0 → PiPNN. Typically L1 * L2 ≈ L. |
PQ_bytes |
Bytes per PQ vector. 32 is a good default; raise if accuracy is low. |
M_GB |
Max memory (GB). PiPNN currently ignores this budget. |
nbr_type |
pq (supports update), rabitq (1-bit, search-only), rabitq{3-5} (3–5-bit, search-only). |
Recommended Vamana parameters:
| Dataset | Type | R | L | PQ_bytes | Memory | Threads |
|---|---|---|---|---|---|---|
| 100M subsets | uint8/float/int8 | 96 | 128 | 32 | 256GB | 112 |
| SIFT1B | uint8 | 128 | 200 | 32 | 500GB | 112 |
| SPACEV1B | int8 | 128 | 200 | 32 | 500GB | 112 |
Expect ~5h for 100M datasets and ~1d for billion-scale.
In-Memory Entry-Point Index (optional)¶
An in-memory index optimizes the entry point. Skip it by setting mem_L=0 at search time.
build/tests/utils/gen_random_slice uint8 data.bin index_SAMPLE_RATE_0.01 0.01
build/tests/build_memory_index uint8 index_SAMPLE_RATE_0.01_data.bin index_SAMPLE_RATE_0.01_ids.bin index_mem.index 32 64 1.2 $(nproc) l2
The output lives in two files: index_mem.index and index_mem.index.tags.
This index boosts performance for 100-dimensional datasets (SIFT, DEEP, and SPACEV) but may degrade performance for higher-dimensional datasets (e.g., Wiki).
Note
PipeANN uses the same SSD layout for the in-memory and on-SSD indexes. It is not compatible with DiskANN's or old-version PipeANN's in-memory index format.
Search¶
# search_disk_index <type> <prefix> <threads> <beam_width> <query> <gt> <topk> <metric> <nbr_type> <mode> <mem_L> <Ls...>
build/tests/search_disk_index uint8 index_prefix 1 32 query.bin gt.bin 10 l2 pq 2 10 10 20 30 40
Search modes (mode):
| Mode | Algorithm |
|---|---|
0 |
DiskANN best-first search. |
1 |
Starling page-reordered search. Requires a reordered index produced by the original Starling code; align the partition file via build/tests/pad_partition. |
2 |
PipeANN pipelined search (recommended). |
3 |
CoroSearch — coroutine-based inter-query parallel search. |
Example output:
Search parameters: #threads: 1, beamwidth: 32
... some outputs during index loading ...
L I/O Width QPS AvgLat(us) P99 Lat Mean IOs Recall@10
=============================================================================
10 32 1871.92 512.01 939.00 23.24 67.40
20 32 1678.32 560.96 926.00 32.22 84.76
30 32 1551.03 601.63 945.00 41.19 91.13
40 32 1420.42 654.29 1007.00 50.11 94.28
Search a DiskANN Index¶
If you already have a DiskANN on-disk index, you can search it directly with PipeANN. Just build an in-memory entry-point index from a 1% sample first:
export INDEX_PREFIX=/mnt/nvme2/indices/bigann/100m # on-disk index filename is 100m_disk.index
export DATA_PATH=/mnt/nvme/data/bigann/100M.bbin
# Build in-memory entry point index (~10min for 1B vectors)
build/tests/utils/gen_random_slice uint8 ${DATA_PATH} ${INDEX_PREFIX}_SAMPLE_RATE_0.01 0.01
build/tests/build_memory_index uint8 ${INDEX_PREFIX}_SAMPLE_RATE_0.01_data.bin ${INDEX_PREFIX}_SAMPLE_RATE_0.01_ids.bin ${INDEX_PREFIX}_mem.index 32 64 1.2 $(nproc) l2
# Search with PipeANN
build/tests/search_disk_index uint8 ${INDEX_PREFIX} 1 32 query.bin gt.bin 10 l2 pq 2 10 10 20 30 40
Update (Insert / Delete)¶
Update support requires -DREAD_ONLY_TESTS and -DNO_MAPPING to be disabled in CMakeLists.txt.
1. Generate ground truths for updates.
Computing exact ground truth at every insertion step is costly. PipeANN uses a shortcut: select top-10 vectors per interval from the top-1000 (or larger) of the full dataset.
# gt_update <gt_file> <index_pts> <total_pts> <batch_pts> <topk> <output_dir> <insert_only>
# Insert 100M vectors (batch=1M) into 100M index; truth.bin contains top-1000 of the 200M dataset.
build/tests/utils/gt_update truth.bin 100000000 200000000 1000000 10 /path/to/gt 1
# Insert 100M and delete the original 100M.
build/tests/utils/gt_update truth.bin 100000000 200000000 1000000 10 /path/to/gt 0
2. Search-insert workload (test_insert_search). Inserts vectors while concurrently searching.
# test_insert_search <type> <data> <L_disk> <step_size> <steps> <ins_thds> <srch_thds> <mode> ...
build/tests/test_insert_search uint8 data_200M.bin 128 1000000 100 10 32 2 index_prefix query.bin /path/to/gt 0 10 4 32 10 20 30 40 50
3. Search-insert-delete workload (overall_performance). Sliding window — inserts new and deletes old.
# overall_performance <type> <data> <L_disk> <index> <query> <gt> <recall> <beam> <steps> <Ls...>
build/tests/overall_performance uint8 data_200M.bin 128 index_prefix query.bin /path/to/gt 10 4 100 20 30
Notes:
- The index is not crash-consistent during updates; call
save()for consistent snapshots. - PipeSearch is used for both search and insert. Defaults:
W=8for insert,W=32for search. - The in-memory entry-point index is immutable during updates but still useful for entry-point optimization.
In-Memory Workloads¶
PipeANN can load the entire SSD index into DRAM as an in-memory baseline (e.g., for comparison against Vamana).
Search-only (search_disk_index_mem). Same CLI as search_disk_index, but loads the index into RAM first.
build/tests/search_disk_index_mem uint8 index_prefix 1 32 query.bin gt.bin 10 l2 pq 2 10 10 20 30 40
Search-insert-delete (overall_perf_mem). Same CLI as overall_performance, in-memory.
build/tests/overall_perf_mem uint8 data_200M.bin 128 index_prefix query.bin /path/to/gt 10 4 100 20 30
Filtered Search¶
PipeANN supports filtered ANNS with arbitrary attribute constraints via speculative filtering — both memory-efficient (only lightweight probabilistic filters live in RAM, not full attributes) and high-performance.
How it works. Speculative filtering explores a superset of valid vectors using in-memory probabilistic structures (Bloom filters, quantized values). Once a candidate set is found, exact attribute verification runs against full attributes stored alongside vectors on SSD. A cost model routes each query to the best strategy (speculative pre-filter / speculative in-filter / post-filter).
Supported attributes. Label filtering (OR/AND) and range filtering [l, r), plus their Boolean combinations (AND/OR/NOT). Custom attribute types can be added by implementing AttrIndex and Selector.
Example: YFCC10M LabelAnd¶
From the NeurIPS'23 BigANN benchmark. Dataset: 10M 192-dim uint8 vectors, each with 1–1517 labels. Query: find vectors containing all query labels.
1. Build the filtered index:
# build_disk_index_filtered <type> <data> <prefix> <R> <R_dense> <L> <PQ_bytes> <M_GB> <threads>
# <metric> <nbr_type> <label_type_1> <label_file_1> ...
build/tests/build_disk_index_filtered uint8 base.10M.u8bin yfcc10M 48 1500 72 64 500 112 l2 pq label_spmat base.metadata.10M.spmat
Output:
1. The graph index on SSD — each record stores [vector | neighbors | attributes | 2-hop neighbors] (attributes are used for exact verification).
2. Separate attribute index files on SSD (e.g., yfcc10M.label.0 is the inverted label index), used for pre-filter scans. Only PQ-compressed vectors and lightweight probabilistic filters live in memory.
2. Configure the query. Create a JSON config that specifies base attribute indexes and the query selector:
{
"base": [
{ "key": 0, "type": "label", "file": "yfcc10M.label.0" }
],
"query": {
"key": 0, "base_key": 0, "type": "label_and",
"file": "query.metadata.public.100K.spmat"
}
}
3. Search with filter:
# search_disk_index_filtered <type> <prefix> <threads> <beamwidth> <query> <gt> <topk>
# <metric> <nbr_type> <config.json> <mem_L> <Ls...>
build/tests/search_disk_index_filtered uint8 yfcc10M 32 32 query.public.100K.u8bin GT.public.ibin 10 l2 pq config.json 0 10 15 20 30 40 50
Abbreviated result on YFCC10M (LabelAnd):
L BW QPS Avg(us) #Pre #In EstIO #Post AvgIO Recall@10
====================================================
10 32 11355.8 2746.0 41738.0 58262.0 174.6 0.0 70.7 74.3
20 32 9605.9 3273.3 49604.0 50396.0 251.4 0.0 93.5 89.8
30 32 8093.9 3897.8 54081.0 45919.0 313.3 0.0 118.5 93.9
50 32 6144.2 5151.0 59951.0 40049.0 405.0 0.0 171.0 97.3
JSON Config Reference¶
The config has two top-level keys: base (attribute indexes) and query (selector tree).
| Key | Location | Type | Description |
|---|---|---|---|
base |
Root | array | Base attribute index definitions built at index time. |
query |
Root | object | Query selector tree — either a leaf or a Boolean node. |
base array item:
| Field | Type | Description |
|---|---|---|
key |
uint32 | Unique identifier; referenced by base_key in the query selector. |
type |
string | "label" (inverted label index) or "range" (numeric range index). |
file |
string | Path to the attribute index file (output of build_disk_index_filtered). |
query leaf selector (label / label_and / range):
| Field | Type | Description |
|---|---|---|
type |
string | "label" (OR semantics), "label_and" (AND semantics), or "range" ([l, r)). |
key |
uint32 | Query attribute key. |
base_key |
uint32 | References the key of the corresponding base entry. |
file |
string | Path to the query attribute file (.spmat). |
query Boolean selector (and / or / not):
| Field | Type | Description |
|---|---|---|
type |
string | "and", "or", "not". |
children |
array | Child selectors (leaf or Boolean). |
Complex example — LabelOr OR Range:
{
"base": [
{ "key": 0, "type": "label", "file": "100M.label.0" },
{ "key": 1, "type": "range", "file": "100M.label.1" }
],
"query": {
"type": "or",
"children": [
{ "key": 0, "base_key": 0, "type": "label", "file": "metadata_query.spmat" },
{ "key": 1, "base_key": 1, "type": "range", "file": "metadata_width_query.spmat" }
]
}
}
Updates are supported, but attribute index updates are currently sub-optimal (in-memory only).
Out-of-Distribution (OOD) Search¶
For OOD workloads — queries and base vectors come from different distributions (e.g., text queries against image embeddings) — PipeANN supports NGFix refinement. A fraction of each node's out-edges (R_ood) is replaced with "refine" edges selected from real training-query traversals. Total out-degree stays R = R_base + R_ood, so disk layout, memory footprint, and the search algorithm are unchanged — only graph topology differs.
When to use. Queries are noticeably OOD (text-to-image, cross-modal retrieval, multi-modal embeddings such as LAION). For in-distribution workloads, the extra build time is not worth it.
1. Prepare training queries. A .bin file in the same dtype and dimension as the base vectors. NGFix recommends a training set comparable in size to the base set. Held-out historical queries work best; public OOD benchmarks (Text-to-Image, LAION) ship with dedicated query.train.* files.
2. Build with OOD refinement. Example on Text-to-Image 10M (200-dim float IP), using the 50M learn-query split as training data:
# build_disk_index <type> <data> <prefix> <R> <L> <PQ_bytes> <M> <T> <metric> <nbr_type>
# [L2] [train_query_path] [R_ood] [L_ood]
#
# R is the total out-degree (R_base + R_ood). R_ood is the number of refine edges per node.
# L_ood is the beam width used when computing AKNN for each training query (default 1500).
build/tests/build_disk_index float \
/mnt/nvme/data/text2image/10M.bin \
/mnt/nvme/indices/text2image/10M \
96 128 64 256 112 mips pq 0 \
/mnt/nvme/data/text2image/query.learn.50M.fbin 48 1500
Recommended parameters:
| Field | Recommendation |
|---|---|
R_ood |
R / 2 (e.g., R=96 → R_ood=48). Must be < R. |
L_ood |
1500 (default). Larger → more accurate refine AKNN but slower build. |
train_query_path |
Comparable size, same dtype/dim as base. |
3. Search. OOD metadata is embedded in the graph, so search uses the same command as a regular index:
build/tests/search_disk_index float /mnt/nvme/indices/text2image/10M \
1 32 /mnt/nvme/data/text2image/query.public.100K.fbin \
/mnt/nvme/data/text2image/t2i_new_groundtruth.public.100K.bin \
10 mips pq 2 10 10 20 30 40
Tip
NGFix is compatible with filtered search — combine R_ood with range_dense / attribute indexes as needed.