Skip to content

VectorCollection.AddOrUpdateFrom #6

@HavenDV

Description

@HavenDV

dani
I have a question, when I store a stream in the storage, how does it identify the insertion if it should be done or the index already exists, I have seen that DataSource.FromStream does not use a path, doesn't it make it difficult to find an element?
HavenDV — 05/02/2024 9:58 PM
DataSource.FromStream simply retrieves Documents from this Stream. Although there is metadata here, it is not currently used in any way, and the presence of the same data in the VectorCollection is not determined
dani — 05/02/2024 10:01 PM
I think I have not expressed myself well, for example in the code I am testing, I insert files from a repository, can I decide whether to insert or not if the vector already exists in the database?

foreach (var f in files)
{
    if (!ignoreExt.Contains(Path.GetExtension(f.FilePath).ToLower()))
    {
        index = await vectordb.AddDocumentsFromAsync<GitlabLoader>(
        embeddings,
        dimensions: dimensions,
        dataSource: DataSource.FromStream(f.ContentToStream),
        collectionName: collection);
    }
}

HavenDV — 05/02/2024 10:06 PM
IVectorDatabase.IsCollectionExistsAsync probably the best choice at the moment if you can store files in different collections
IVectorCollection.IsEmptyAsync can also be used, but it is not yet implemented/tested for all databases
dani — 05/02/2024 10:11 PM
My idea was to use a collection to store an entire repository, would it be viable to have a method to check if a path already exists? and that DataSource.FromStream has as an option to be able to pass it a path to have it indexed in some way
HavenDV — 05/02/2024 10:16 PM
I understand your problem, I'll think about it, for now the solution is only to recreate the entire collection for all files if any of the files have changed
The problem is that one DataSource can turn into several Documents, and as a result into several vectors in the database. And we need to add metadata to the database, such as FilePath, and then check for the presence of vectors with this metadata
Id for a specific vector won't work because it needs to be unique
dani — 05/02/2024 10:19 PM
You could recover a vector count with the same path, right?
HavenDV — 05/02/2024 10:23 PM
But what if the file changes partially?
dani — 05/02/2024 10:25 PM
Maybe you could have a hash function and if it doesn't match delete and reinsert?
HavenDV — 05/02/2024 10:31 PM
Yeah, that sounds like a good suggestion.

And add AddOrUpdateFrom, which does this automatically.
Also add the ability to pass metadata to the DataSource so that you can control this.
But this seems to go a little beyond the scope of the current release and sounds like a good plan for the near future

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions