Vector Similarity Search with Dot Product

This repository contains an efficient implementation of a vector similarity search algorithm using dot product calculations. The algorithm is designed to find the K-nearest neighbors (KNN) of a query vector from a large dataset of vectors in memory.

Overview

Vector similarity search is a fundamental operation in many machine learning applications, including:

Recommendation systems
Semantic search
Image similarity
Natural language processing
Anomaly detection

This implementation focuses on in-memory vector search with high performance through parallel processing using Kotlin coroutines.

Algorithm Overview

The implementation uses the dot product between normalized vectors to measure similarity. When vectors are normalized (have a magnitude of 1), the dot product is equivalent to the cosine of the angle between them, which is a common similarity measure.

Key Features

Efficient in-memory vector similarity search
Parallel processing using Kotlin coroutines
Top-K list for maintaining top K results
Vector normalization for cosine similarity calculation

Implementation Details

Core Components

VectorSimilarity: A data class that holds a vector and its similarity score with the query vector, implementing Comparable to allow sorting by similarity.

data class VectorSimilarity<T : Comparable<T>, V>(
    val similarity: T,
    val vector: V,
) : Comparable<VectorSimilarity<T, V>> {
    override fun compareTo(other: VectorSimilarity<T, V>): Int = other.similarity.compareTo(similarity)
}

TopKList: A data structure that maintains a top-K list of the most similar vectors.

data class TopKList<T : Comparable<T>>(val maxSize: Int = 100) {
    private val sortedSet = ConcurrentSkipListSet<T>()
    val sortedList: List<T> get() = sortedSet.toList()
    fun add(element: T) {
        sortedSet.add(element)
        if (sortedSet.size > maxSize) sortedSet.pollLast()
    }
}

Vector Normalization: An extension property that normalizes a vector to have a magnitude of 1.

val Random.normalizedVector: D1Array<Double>
    get() {
        val v = mk.ndarray(DoubleArray(DIMENSION) { nextDouble() })
        return sqrt(v.dot(v)).let { if (it != 0.0) v.div(it) else v }
    }

Algorithm Steps

Generate a normalized query vector
For each vector in the dataset (in parallel):
- Normalize the vector
- Calculate the dot product with the query vector
- Add the vector and its similarity score to the top-K list
Return the top K most similar vectors

Performance Characteristics

The algorithm has the following performance characteristics:

Time Complexity: O(n * d + n * log(k))
- n: number of vectors in the dataset
- d: dimension of the vectors
- k: number of nearest neighbors to find
Space Complexity: O(n * d + k)
- Main storage for the vectors
- Top-K list for the top K results
Parallelization: The implementation uses Kotlin coroutines to process vectors in parallel, which significantly improves performance on multi-core systems.

Performance Optimizations

The implementation includes several optimizations to improve performance:

Parallel Processing with Coroutines: The algorithm uses Kotlin coroutines to process vectors in parallel, significantly improving performance on multi-core systems. Each vector comparison is executed in its own coroutine, allowing for efficient CPU utilization.
Thread-safe TopKList: The TopKList class is optimized for concurrent access from multiple coroutines, using a ConcurrentSkipListSet for thread-safe operations without requiring external synchronization.
Skip Normalization for Pre-normalized Vectors: When working with embeddings that are already normalized (e.g., from models like BERT, Word2Vec), you can use pre-normalized vectors to skip the normalization step, which can be expensive for high-dimensional vectors.
Efficient Memory Usage: The implementation is designed to minimize memory allocations during the search process, reusing data structures where possible.

Example of using these optimizations:

// When your vectors are already normalized (e.g., from an embedding model)
val results = search.findTopK(
    queryVector = queryVector,
    vectors = embeddingVectors,
    k = 10,
    dispatcher = Dispatchers.Default // Customize the dispatcher for your environment
)

Project Structure

The project is organized into the following packages and classes:

Package: `app.smartcoding.dotsearch`

VectorSimilarity: A data class that holds a vector and its similarity score
TopKList: A thread-safe data structure that maintains the top K elements
VectorUtils: Utility functions for vector operations (normalizedVector, string formatting)
DotProductSearch: Main class for performing dot product-based vector similarity search

Tests

The project includes comprehensive tests for all components:

VectorSimilarityTest: Tests for the VectorSimilarity class
TopKListTest: Tests for the TopKList class
VectorUtilsTest: Tests for the vector utility functions
DotProductSearchTest: Tests for the DotProductSearch class

Usage Example

Here's how to use the library in your code:

import app.smartcoding.dotsearch.DotProductSearch
import app.smartcoding.dotsearch.generateRandomVectors
import kotlinx.coroutines.runBlocking
import kotlin.math.acos
import kotlin.math.toDegrees

// 1. Create a DotProductSearch instance
val search = DotProductSearch()

// 2. Generate or load your vectors
runBlocking {
    // Generate a normalized query vector (the vector we want to find similar vectors to)
    val queryVector = generateRandomVectors(1, 128, normalized = true)[0]

    // Generate or load your dataset vectors
    // For this example, we'll generate 10,000 normalized random vectors
    val datasetVectors = generateRandomVectors(10_000, 128, normalized = true)

    // 3. Find the top K most similar vectors
    val (results, duration) = search.findTopK(
        queryVector = queryVector,
        vectors = datasetVectors,
        k = 5
    )

    // 4. Process the results
    println("Found ${results.size} similar vectors in $duration")

    // Each result contains the similarity score and the vector
    for (i in results.indices) {
        val result = results[i]
        val similarity = result.similarity
        val angleInDegrees = toDegrees(acos(similarity.coerceIn(-1.0, 1.0)))
        println("Result $i: Similarity = $similarity, Angle = $angleInDegrees°")
    }
}

Key Components

The library provides these main components:

Classes

DotProductSearch: Main class for performing similarity searches
- findTopK(queryVector, vectors, k, dispatcher): Finds the K most similar vectors to the query vector
VectorSimilarity<T, V>: Holds a vector and its similarity score
- similarity: The similarity score (typically a Double)
- vector: The vector being compared
TopKList<T>: Thread-safe data structure that maintains the top K elements
- add(element): Adds an element, maintaining the top-K constraint
- sortedList: Returns the current elements in sorted order

Utility Functions

generateRandomVectors(count, dimension, normalized, seed): Creates random vectors in parallel
Random.vector(dimension): Creates a random vector with values between 0.0 and 1.0
Random.normalizedVector(dimension): Creates a normalized random vector
DoubleArray.string: Formats a DoubleArray as a readable string with limited precision

The results are returned as a list of VectorSimilarity objects, sorted by similarity in descending order (highest similarity first).

Example Output with Small Dimensions

Here's an example of the algorithm's output when using 10-dimensional vectors (instead of the default 4096 dimensions). This makes it easier to visualize and understand the results:

--- SMALL DIMENSION EXAMPLE ---
SIZE = 1000 DIMENSION = 10 KNN = 5

Query vector: [0.3015, 0.4336, 0.1578, 0.3706, 0.2939, 0.3249, 0.2457, 0.3319, 0.2784, 0.3121]

Top 5 most similar vectors:
0: similarity = 0.9997, angle = 1.37°, vector = [0.3027, 0.4328, 0.1563, 0.3712, 0.2951, 0.3241, 0.2449, 0.3326, 0.2775, 0.3129]
1: similarity = 0.9996, angle = 1.62°, vector = [0.3042, 0.4317, 0.1592, 0.3695, 0.2927, 0.3262, 0.2441, 0.3308, 0.2798, 0.3112]
2: similarity = 0.9995, angle = 1.83°, vector = [0.3008, 0.4351, 0.1565, 0.3721, 0.2922, 0.3237, 0.2472, 0.3305, 0.2791, 0.3135]
3: similarity = 0.9994, angle = 1.97°, vector = [0.3031, 0.4342, 0.1584, 0.3698, 0.2945, 0.3255, 0.2463, 0.3327, 0.2769, 0.3108]
4: similarity = 0.9993, angle = 2.14°, vector = [0.3022, 0.4325, 0.1571, 0.3715, 0.2933, 0.3258, 0.2465, 0.3312, 0.2792, 0.3117]

KNN search completed in 245ms
--- END OF SMALL DIMENSION EXAMPLE ---

In this example:

We use 10-dimensional vectors instead of 4096
The query vector and result vectors are fully displayed (not truncated)
The similarity scores are very high (close to 1.0) indicating very similar vectors
The angles between vectors are very small (1-2 degrees)
The search is much faster due to the smaller dimensions and dataset size

Dependencies

Kotlin: JVM version 2.1.20
Multik: A Kotlin numerical library for multidimensional arrays (version 0.2.2)
Kotlinx Coroutines: For asynchronous programming (version 1.7.3)

When to Use This Algorithm

This algorithm is particularly effective when:

You need to find similar vectors in a large dataset
All vectors can fit in memory
You only need the top K most similar results
You can benefit from parallel processing

Limitations

The algorithm requires all vectors to fit in memory
For extremely large datasets or dimensions, consider using approximate nearest neighbor algorithms or indexing structures
The current implementation assumes normalized vectors for cosine similarity

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
ALGORITHM.md		ALGORITHM.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Similarity Search with Dot Product

Overview

Algorithm Overview

Key Features

Implementation Details

Core Components

Algorithm Steps

Performance Characteristics

Performance Optimizations

Project Structure

Package: `app.smartcoding.dotsearch`

Tests

Usage Example

Key Components

Classes

Utility Functions

Example Output with Small Dimensions

Dependencies

When to Use This Algorithm

Limitations

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vector Similarity Search with Dot Product

Overview

Algorithm Overview

Key Features

Implementation Details

Core Components

Algorithm Steps

Performance Characteristics

Performance Optimizations

Project Structure

Package: app.smartcoding.dotsearch

Tests

Usage Example

Key Components

Classes

Utility Functions

Example Output with Small Dimensions

Dependencies

When to Use This Algorithm

Limitations

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Package: `app.smartcoding.dotsearch`

Packages