This repository contains an efficient implementation of a vector similarity search algorithm using dot product calculations. The algorithm is designed to find the K-nearest neighbors (KNN) of a query vector from a large dataset of vectors in memory.
Vector similarity search is a fundamental operation in many machine learning applications, including:
- Recommendation systems
- Semantic search
- Image similarity
- Natural language processing
- Anomaly detection
This implementation focuses on in-memory vector search with high performance through parallel processing using Kotlin coroutines.
The implementation uses the dot product between normalized vectors to measure similarity. When vectors are normalized (have a magnitude of 1), the dot product is equivalent to the cosine of the angle between them, which is a common similarity measure.
- Efficient in-memory vector similarity search
- Parallel processing using Kotlin coroutines
- Top-K list for maintaining top K results
- Vector normalization for cosine similarity calculation
- VectorSimilarity: A data class that holds a vector and its similarity score with the query vector, implementing Comparable to allow sorting by similarity.
data class VectorSimilarity<T : Comparable<T>, V>(
val similarity: T,
val vector: V,
) : Comparable<VectorSimilarity<T, V>> {
override fun compareTo(other: VectorSimilarity<T, V>): Int = other.similarity.compareTo(similarity)
}- TopKList: A data structure that maintains a top-K list of the most similar vectors.
data class TopKList<T : Comparable<T>>(val maxSize: Int = 100) {
private val sortedSet = ConcurrentSkipListSet<T>()
val sortedList: List<T> get() = sortedSet.toList()
fun add(element: T) {
sortedSet.add(element)
if (sortedSet.size > maxSize) sortedSet.pollLast()
}
}- Vector Normalization: An extension property that normalizes a vector to have a magnitude of 1.
val Random.normalizedVector: D1Array<Double>
get() {
val v = mk.ndarray(DoubleArray(DIMENSION) { nextDouble() })
return sqrt(v.dot(v)).let { if (it != 0.0) v.div(it) else v }
}- Generate a normalized query vector
- For each vector in the dataset (in parallel):
- Normalize the vector
- Calculate the dot product with the query vector
- Add the vector and its similarity score to the top-K list
- Return the top K most similar vectors
The algorithm has the following performance characteristics:
-
Time Complexity: O(n * d + n * log(k))
- n: number of vectors in the dataset
- d: dimension of the vectors
- k: number of nearest neighbors to find
-
Space Complexity: O(n * d + k)
- Main storage for the vectors
- Top-K list for the top K results
-
Parallelization: The implementation uses Kotlin coroutines to process vectors in parallel, which significantly improves performance on multi-core systems.
The implementation includes several optimizations to improve performance:
-
Parallel Processing with Coroutines: The algorithm uses Kotlin coroutines to process vectors in parallel, significantly improving performance on multi-core systems. Each vector comparison is executed in its own coroutine, allowing for efficient CPU utilization.
-
Thread-safe TopKList: The
TopKListclass is optimized for concurrent access from multiple coroutines, using aConcurrentSkipListSetfor thread-safe operations without requiring external synchronization. -
Skip Normalization for Pre-normalized Vectors: When working with embeddings that are already normalized (e.g., from models like BERT, Word2Vec), you can use pre-normalized vectors to skip the normalization step, which can be expensive for high-dimensional vectors.
-
Efficient Memory Usage: The implementation is designed to minimize memory allocations during the search process, reusing data structures where possible.
Example of using these optimizations:
// When your vectors are already normalized (e.g., from an embedding model)
val results = search.findTopK(
queryVector = queryVector,
vectors = embeddingVectors,
k = 10,
dispatcher = Dispatchers.Default // Customize the dispatcher for your environment
)The project is organized into the following packages and classes:
- VectorSimilarity: A data class that holds a vector and its similarity score
- TopKList: A thread-safe data structure that maintains the top K elements
- VectorUtils: Utility functions for vector operations (normalizedVector, string formatting)
- DotProductSearch: Main class for performing dot product-based vector similarity search
The project includes comprehensive tests for all components:
- VectorSimilarityTest: Tests for the VectorSimilarity class
- TopKListTest: Tests for the TopKList class
- VectorUtilsTest: Tests for the vector utility functions
- DotProductSearchTest: Tests for the DotProductSearch class
Here's how to use the library in your code:
import app.smartcoding.dotsearch.DotProductSearch
import app.smartcoding.dotsearch.generateRandomVectors
import kotlinx.coroutines.runBlocking
import kotlin.math.acos
import kotlin.math.toDegrees
// 1. Create a DotProductSearch instance
val search = DotProductSearch()
// 2. Generate or load your vectors
runBlocking {
// Generate a normalized query vector (the vector we want to find similar vectors to)
val queryVector = generateRandomVectors(1, 128, normalized = true)[0]
// Generate or load your dataset vectors
// For this example, we'll generate 10,000 normalized random vectors
val datasetVectors = generateRandomVectors(10_000, 128, normalized = true)
// 3. Find the top K most similar vectors
val (results, duration) = search.findTopK(
queryVector = queryVector,
vectors = datasetVectors,
k = 5
)
// 4. Process the results
println("Found ${results.size} similar vectors in $duration")
// Each result contains the similarity score and the vector
for (i in results.indices) {
val result = results[i]
val similarity = result.similarity
val angleInDegrees = toDegrees(acos(similarity.coerceIn(-1.0, 1.0)))
println("Result $i: Similarity = $similarity, Angle = $angleInDegrees°")
}
}The library provides these main components:
-
DotProductSearch: Main class for performing similarity searchesfindTopK(queryVector, vectors, k, dispatcher): Finds the K most similar vectors to the query vector
-
VectorSimilarity<T, V>: Holds a vector and its similarity scoresimilarity: The similarity score (typically a Double)vector: The vector being compared
-
TopKList<T>: Thread-safe data structure that maintains the top K elementsadd(element): Adds an element, maintaining the top-K constraintsortedList: Returns the current elements in sorted order
generateRandomVectors(count, dimension, normalized, seed): Creates random vectors in parallelRandom.vector(dimension): Creates a random vector with values between 0.0 and 1.0Random.normalizedVector(dimension): Creates a normalized random vectorDoubleArray.string: Formats a DoubleArray as a readable string with limited precision
The results are returned as a list of VectorSimilarity objects, sorted by similarity in descending order (highest similarity first).
Here's an example of the algorithm's output when using 10-dimensional vectors (instead of the default 4096 dimensions). This makes it easier to visualize and understand the results:
--- SMALL DIMENSION EXAMPLE ---
SIZE = 1000 DIMENSION = 10 KNN = 5
Query vector: [0.3015, 0.4336, 0.1578, 0.3706, 0.2939, 0.3249, 0.2457, 0.3319, 0.2784, 0.3121]
Top 5 most similar vectors:
0: similarity = 0.9997, angle = 1.37°, vector = [0.3027, 0.4328, 0.1563, 0.3712, 0.2951, 0.3241, 0.2449, 0.3326, 0.2775, 0.3129]
1: similarity = 0.9996, angle = 1.62°, vector = [0.3042, 0.4317, 0.1592, 0.3695, 0.2927, 0.3262, 0.2441, 0.3308, 0.2798, 0.3112]
2: similarity = 0.9995, angle = 1.83°, vector = [0.3008, 0.4351, 0.1565, 0.3721, 0.2922, 0.3237, 0.2472, 0.3305, 0.2791, 0.3135]
3: similarity = 0.9994, angle = 1.97°, vector = [0.3031, 0.4342, 0.1584, 0.3698, 0.2945, 0.3255, 0.2463, 0.3327, 0.2769, 0.3108]
4: similarity = 0.9993, angle = 2.14°, vector = [0.3022, 0.4325, 0.1571, 0.3715, 0.2933, 0.3258, 0.2465, 0.3312, 0.2792, 0.3117]
KNN search completed in 245ms
--- END OF SMALL DIMENSION EXAMPLE ---
In this example:
- We use 10-dimensional vectors instead of 4096
- The query vector and result vectors are fully displayed (not truncated)
- The similarity scores are very high (close to 1.0) indicating very similar vectors
- The angles between vectors are very small (1-2 degrees)
- The search is much faster due to the smaller dimensions and dataset size
- Kotlin: JVM version 2.1.20
- Multik: A Kotlin numerical library for multidimensional arrays (version 0.2.2)
- Kotlinx Coroutines: For asynchronous programming (version 1.7.3)
This algorithm is particularly effective when:
- You need to find similar vectors in a large dataset
- All vectors can fit in memory
- You only need the top K most similar results
- You can benefit from parallel processing
- The algorithm requires all vectors to fit in memory
- For extremely large datasets or dimensions, consider using approximate nearest neighbor algorithms or indexing structures
- The current implementation assumes normalized vectors for cosine similarity
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2023 Kaloyan Karaivanov
Contributions are welcome! Please feel free to submit a Pull Request.