Skip to content

karloti/dotProductSearch

Repository files navigation

Vector Similarity Search with Dot Product

This repository contains an efficient implementation of a vector similarity search algorithm using dot product calculations. The algorithm is designed to find the K-nearest neighbors (KNN) of a query vector from a large dataset of vectors in memory.

Overview

Vector similarity search is a fundamental operation in many machine learning applications, including:

  • Recommendation systems
  • Semantic search
  • Image similarity
  • Natural language processing
  • Anomaly detection

This implementation focuses on in-memory vector search with high performance through parallel processing using Kotlin coroutines.

Algorithm Overview

The implementation uses the dot product between normalized vectors to measure similarity. When vectors are normalized (have a magnitude of 1), the dot product is equivalent to the cosine of the angle between them, which is a common similarity measure.

Key Features

  • Efficient in-memory vector similarity search
  • Parallel processing using Kotlin coroutines
  • Top-K list for maintaining top K results
  • Vector normalization for cosine similarity calculation

Implementation Details

Core Components

  1. VectorSimilarity: A data class that holds a vector and its similarity score with the query vector, implementing Comparable to allow sorting by similarity.
data class VectorSimilarity<T : Comparable<T>, V>(
    val similarity: T,
    val vector: V,
) : Comparable<VectorSimilarity<T, V>> {
    override fun compareTo(other: VectorSimilarity<T, V>): Int = other.similarity.compareTo(similarity)
}
  1. TopKList: A data structure that maintains a top-K list of the most similar vectors.
data class TopKList<T : Comparable<T>>(val maxSize: Int = 100) {
    private val sortedSet = ConcurrentSkipListSet<T>()
    val sortedList: List<T> get() = sortedSet.toList()
    fun add(element: T) {
        sortedSet.add(element)
        if (sortedSet.size > maxSize) sortedSet.pollLast()
    }
}
  1. Vector Normalization: An extension property that normalizes a vector to have a magnitude of 1.
val Random.normalizedVector: D1Array<Double>
    get() {
        val v = mk.ndarray(DoubleArray(DIMENSION) { nextDouble() })
        return sqrt(v.dot(v)).let { if (it != 0.0) v.div(it) else v }
    }

Algorithm Steps

  1. Generate a normalized query vector
  2. For each vector in the dataset (in parallel):
    • Normalize the vector
    • Calculate the dot product with the query vector
    • Add the vector and its similarity score to the top-K list
  3. Return the top K most similar vectors

Performance Characteristics

The algorithm has the following performance characteristics:

  • Time Complexity: O(n * d + n * log(k))

    • n: number of vectors in the dataset
    • d: dimension of the vectors
    • k: number of nearest neighbors to find
  • Space Complexity: O(n * d + k)

    • Main storage for the vectors
    • Top-K list for the top K results
  • Parallelization: The implementation uses Kotlin coroutines to process vectors in parallel, which significantly improves performance on multi-core systems.

Performance Optimizations

The implementation includes several optimizations to improve performance:

  1. Parallel Processing with Coroutines: The algorithm uses Kotlin coroutines to process vectors in parallel, significantly improving performance on multi-core systems. Each vector comparison is executed in its own coroutine, allowing for efficient CPU utilization.

  2. Thread-safe TopKList: The TopKList class is optimized for concurrent access from multiple coroutines, using a ConcurrentSkipListSet for thread-safe operations without requiring external synchronization.

  3. Skip Normalization for Pre-normalized Vectors: When working with embeddings that are already normalized (e.g., from models like BERT, Word2Vec), you can use pre-normalized vectors to skip the normalization step, which can be expensive for high-dimensional vectors.

  4. Efficient Memory Usage: The implementation is designed to minimize memory allocations during the search process, reusing data structures where possible.

Example of using these optimizations:

// When your vectors are already normalized (e.g., from an embedding model)
val results = search.findTopK(
    queryVector = queryVector,
    vectors = embeddingVectors,
    k = 10,
    dispatcher = Dispatchers.Default // Customize the dispatcher for your environment
)

Project Structure

The project is organized into the following packages and classes:

Package: app.smartcoding.dotsearch

  • VectorSimilarity: A data class that holds a vector and its similarity score
  • TopKList: A thread-safe data structure that maintains the top K elements
  • VectorUtils: Utility functions for vector operations (normalizedVector, string formatting)
  • DotProductSearch: Main class for performing dot product-based vector similarity search

Tests

The project includes comprehensive tests for all components:

  • VectorSimilarityTest: Tests for the VectorSimilarity class
  • TopKListTest: Tests for the TopKList class
  • VectorUtilsTest: Tests for the vector utility functions
  • DotProductSearchTest: Tests for the DotProductSearch class

Usage Example

Here's how to use the library in your code:

import app.smartcoding.dotsearch.DotProductSearch
import app.smartcoding.dotsearch.generateRandomVectors
import kotlinx.coroutines.runBlocking
import kotlin.math.acos
import kotlin.math.toDegrees

// 1. Create a DotProductSearch instance
val search = DotProductSearch()

// 2. Generate or load your vectors
runBlocking {
    // Generate a normalized query vector (the vector we want to find similar vectors to)
    val queryVector = generateRandomVectors(1, 128, normalized = true)[0]

    // Generate or load your dataset vectors
    // For this example, we'll generate 10,000 normalized random vectors
    val datasetVectors = generateRandomVectors(10_000, 128, normalized = true)

    // 3. Find the top K most similar vectors
    val (results, duration) = search.findTopK(
        queryVector = queryVector,
        vectors = datasetVectors,
        k = 5
    )

    // 4. Process the results
    println("Found ${results.size} similar vectors in $duration")

    // Each result contains the similarity score and the vector
    for (i in results.indices) {
        val result = results[i]
        val similarity = result.similarity
        val angleInDegrees = toDegrees(acos(similarity.coerceIn(-1.0, 1.0)))
        println("Result $i: Similarity = $similarity, Angle = $angleInDegrees°")
    }
}

Key Components

The library provides these main components:

Classes

  • DotProductSearch: Main class for performing similarity searches

    • findTopK(queryVector, vectors, k, dispatcher): Finds the K most similar vectors to the query vector
  • VectorSimilarity<T, V>: Holds a vector and its similarity score

    • similarity: The similarity score (typically a Double)
    • vector: The vector being compared
  • TopKList<T>: Thread-safe data structure that maintains the top K elements

    • add(element): Adds an element, maintaining the top-K constraint
    • sortedList: Returns the current elements in sorted order

Utility Functions

  • generateRandomVectors(count, dimension, normalized, seed): Creates random vectors in parallel
  • Random.vector(dimension): Creates a random vector with values between 0.0 and 1.0
  • Random.normalizedVector(dimension): Creates a normalized random vector
  • DoubleArray.string: Formats a DoubleArray as a readable string with limited precision

The results are returned as a list of VectorSimilarity objects, sorted by similarity in descending order (highest similarity first).

Example Output with Small Dimensions

Here's an example of the algorithm's output when using 10-dimensional vectors (instead of the default 4096 dimensions). This makes it easier to visualize and understand the results:

--- SMALL DIMENSION EXAMPLE ---
SIZE = 1000 DIMENSION = 10 KNN = 5

Query vector: [0.3015, 0.4336, 0.1578, 0.3706, 0.2939, 0.3249, 0.2457, 0.3319, 0.2784, 0.3121]

Top 5 most similar vectors:
0: similarity = 0.9997, angle = 1.37°, vector = [0.3027, 0.4328, 0.1563, 0.3712, 0.2951, 0.3241, 0.2449, 0.3326, 0.2775, 0.3129]
1: similarity = 0.9996, angle = 1.62°, vector = [0.3042, 0.4317, 0.1592, 0.3695, 0.2927, 0.3262, 0.2441, 0.3308, 0.2798, 0.3112]
2: similarity = 0.9995, angle = 1.83°, vector = [0.3008, 0.4351, 0.1565, 0.3721, 0.2922, 0.3237, 0.2472, 0.3305, 0.2791, 0.3135]
3: similarity = 0.9994, angle = 1.97°, vector = [0.3031, 0.4342, 0.1584, 0.3698, 0.2945, 0.3255, 0.2463, 0.3327, 0.2769, 0.3108]
4: similarity = 0.9993, angle = 2.14°, vector = [0.3022, 0.4325, 0.1571, 0.3715, 0.2933, 0.3258, 0.2465, 0.3312, 0.2792, 0.3117]

KNN search completed in 245ms
--- END OF SMALL DIMENSION EXAMPLE ---

In this example:

  • We use 10-dimensional vectors instead of 4096
  • The query vector and result vectors are fully displayed (not truncated)
  • The similarity scores are very high (close to 1.0) indicating very similar vectors
  • The angles between vectors are very small (1-2 degrees)
  • The search is much faster due to the smaller dimensions and dataset size

Dependencies

  • Kotlin: JVM version 2.1.20
  • Multik: A Kotlin numerical library for multidimensional arrays (version 0.2.2)
  • Kotlinx Coroutines: For asynchronous programming (version 1.7.3)

When to Use This Algorithm

This algorithm is particularly effective when:

  1. You need to find similar vectors in a large dataset
  2. All vectors can fit in memory
  3. You only need the top K most similar results
  4. You can benefit from parallel processing

Limitations

  • The algorithm requires all vectors to fit in memory
  • For extremely large datasets or dimensions, consider using approximate nearest neighbor algorithms or indexing structures
  • The current implementation assumes normalized vectors for cosine similarity

License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright (c) 2023 Kaloyan Karaivanov

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

In-memory vector similarity search (KNN) via dot product, accelerated by Kotlin coroutines. Features normalization & thread-safe Top-K. Built with Kotlin, Multik, Coroutines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages