Exact nearest neighbor search (k-nearest-neighbor or KNN) is prohibitively expensive at higher dimensions, because approaches to segment the search space that work in 2D or 3D like quadtree or k-d tree devolve to linear scans at higher dimensions. This is one aspect of what is called “the curse of dimensionality.”
With larger datasets, it is almost always more useful to get an approximate answer in logarithmic time, than the exact answer in linear time. This is abbreviated as ANN (approximate nearest neighbor) search.
There are two broad categories of ANN index:
Partition-based indexes, like LSH or IVF or SCANN
Graph indexes, like HNSW or DiskANN
Graph-based indexes tend to be simpler to implement and faster, but more importantly they can be constructed and updated incrementally. This makes them a much better fit for a general-purpose index than partitioning approaches that only work on static datasets that are completely specified up front. That is why all the major commercial vector indexes use graph approaches.
JVector is a graph index in the DiskANN family tree.
|