Releases: unum-cloud/UForm
v2.1.0
v2.0.2
v2.0.1
Multimodal Matryoshka, Multimodal DPO, and ONNX π
Today we are releasing a new batch of multimodal models trained with Nebius and already available on HuggingFace π€
- Matryoshka style multimodal embeddings ranging from 64 to 256 and 768 dimensions πΌοΈ
- Improved multimodal chat in 1.2B parameters, tuned with Direct Preference Optimization π¬
- ONNX backend, making PyTorch dependency optional for lightning fast deployments β‘
v1.1.1: Polishing the Repo
Great thanks to @lmmx, @blackforestboi, and @kapulkin for their patches to the project!
v1.1.0
v1.0.3
v1.0.2
UForm v1: Multimodal Chat in 1.5 Billion Parameters
UForm v1: Multimodal Chat in 1.5 Billion Parameters
The UForm family of tiny multimodal transformer models just got bigger! In addition to the existing CLIP-like embedding models, we now have a generative model useful for image captioning, visual question answering, and multimodal chats. All that is in just a billion parameters, small enough to fit even on mobile devices π
Repository: https://github.com/unum-cloud/uform
Generative model: https://huggingface.co/unum-cloud/uform-gen
Chat model: https://huggingface.co/unum-cloud/uform-gen-chat
Evaluation Metrics
Being the smallest model of its kind, unum-cloud/uform-gen is hard to compare to others. Next in size are the 5x larger LLaVAs and InstructBLIP, with 7 billion parameters. LLaVA performs noticeably better on VQAv2: 78.5 vs 66.5. On captioning, CLIPScore and RefCLIPScore are relatively close across all models.
| Model | Size | Caption Length | CLIPScore | RefCLIPScore |
|---|---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | Long | 0.878 | 0.529 |
llava-hf/llava-1.5-7b-hf |
7B | Short | 0.886 | 0.531 |
Salesforce/instructblip-vicuna-7b |
7B | Long | 0.902 | 0.534 |
Salesforce/instructblip-vicuna-7b |
7B | Short | 0.848 | 0.523 |
unum-cloud/uform-gen |
1.5B | Long | 0.847 | 0.523 |
unum-cloud/uform-gen |
1.5B | Short | 0.842 | 0.522 |
unum-cloud/uform-gen-chat |
1.5B | Long | 0.860 | 0.525 |
unum-cloud/uform-gen-chat |
1.5B | Short | 0.858 | 0.525 |
Throughput
On RTX 3090, using vanilla PyTorch for inference, with bfloat16 arithmetic and greedy decoding, one should expect the following numbers for throughput.
| Model | Size | Speed | Speedup |
|---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | ~ 40 tokens/second | |
Salesforce/instructblip-vicuna-7b |
7B | ~ 40 tokens/second | |
unum-cloud/uform-gen |
1.5B | ~ 140 tokens/second | x 3.5 |

