Conversation
closes #3586 @stephantul can I ask you to review the metadata ran it using: ``` import mteb model = mteb.get_model("stephantulkens/NIFE-mxbai-embed-large-v1") # dummy small tasks task1 = mteb.get_task("LccSentimentClassification") task2 = mteb.get_task("TwitterHjerneRetrieval") results = mteb.evaluate(model, [task1, task2], encode_kwargs={"device": "cpu"}) # to prevent MPS error ``` which uses the `encode` function for the classification task Is that intended?
| n_parameters=76802304, # TODO: what do we do for routers? Both models I assume? - this is for the query router / student model | ||
| n_active_parameters_override=None, # TODO: not sure how to count this for routers - WDYT? | ||
| n_embedding_parameters=76802304, # this is for the query router / student model |
There was a problem hiding this comment.
@ayush1298 and @Samoed tagging both of you here as well as this related to our work on embedding dimensions.
Short intro: This is a router models which uses a static embedding model for the queries and the original model for the corpus.
I am unsure how we want to record this, but I am leaning towards:
- active_parameters: active parameters using the the embedding embedding. For router models that use different models for the queries and the corpus we use the parameters of the query routers as it best resembles inference (does it though, e.g. for system with many new documents a day?)
- n_parameters: the full set of parameters for all models
- n_embedding parameters: also the full set
Let me know what you think
There was a problem hiding this comment.
The idea seems great. Just have a doubt:
Here, as I understand:
- stephantulkens/NIFE-mxbai-embed-large-v1 is a teacher model, and it can also act as a bigger model in the router
- stephantulkens/NIFE-gte-modernbert-base: is a student model and acts as a smaller model in the router
Now, my doubt is, do we have any such routing kind of task where we can say that we have tested this combination? I think we will be reporting individual model results only. Then, won't it be better to keep it as a standalone dense embedding model?
There was a problem hiding this comment.
@ayush1298 hey! Both models are student models. The names after NIFE are the teachers, but the teacher names are also in the repositories, and can be extracted automatically.
stephantulkens/NIFE-mxbai-embed-large-v1: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
stephantulkens/NIFE-gte-modernbert-base: https://huggingface.co/Alibaba-NLP/gte-modernbert-base
| training_datasets=set(), | ||
| superseded_by=None, | ||
| modalities=["text"], | ||
| model_type=["dense"], # TODO: is router a model type? |
There was a problem hiding this comment.
seems like we need a new model type here
|
from conversation with @stephantul I see that this is currently not a router for that we have to: https://github.com/stephantul/pynife/blob/main/pynife/nife.py I am unsure if we would rather want that as the default sentence transformer model (to avoid having too much model specific code on our side) |
|
I'm fine with re-uploading it as a router, I'll just add a comment that this is meant for MTEB compatibility. Should only take a little bit of time on my end. |
| reference="https://huggingface.co/stephantulkens/NIFE-gte-modernbert-base", | ||
| similarity_fn_name=ScoringFunction.COSINE, | ||
| use_instructions=False, # assumed | ||
| training_datasets=set(), |
There was a problem hiding this comment.
It mentioned this link of datasets: https://huggingface.co/collections/stephantulkens/nife-data
on the github: https://github.com/stephantul/pynife/tree/
There was a problem hiding this comment.
Yep, this is the correct dataset, specifically: https://huggingface.co/collections/stephantulkens/gte-modernbert-embedpress
| citation="""@software{Tulkens2025pyNIFE, | ||
| author = {St\'{e}phan Tulkens}, | ||
| title = {pyNIFE: nearly inference free embeddings in python}, | ||
| year = {2025}, | ||
| publisher = {Zenodo}, | ||
| doi = {10.5281/zenodo.17512919}, | ||
| url = {https://github.com/stephantulkens/pynife}, | ||
| license = {MIT}, | ||
| }""", |
There was a problem hiding this comment.
Here url is wrong: it should be: https://github.com/stephantul/pynife
There was a problem hiding this comment.
@stephantul Seems you need to update citation in your readme
There was a problem hiding this comment.
Thanks for flagging! I updated it just now, sorry for the confusion
closes #3586
@stephantul can I ask you to review the metadata
ran it using:
which uses the
encodefunction for the classification task Is that intended?