Feat (vLLM): initial export support #1444

Giuseppe5 · 2026-01-24T00:54:12Z

Reason for this PR

Initial support for vLLM export.

To do:

Check that input/output quantization work as intended
Test multiple quantizers
Improve quantizers interface
Support for rotation (and smoothquant?)

Changes Made in this PR

We are re-using the inference quantizers also for vLLM.
This is still fake-quantization style, but should be faster than plain torch execution, even in eager mode.

The same template could be easily extended to support real quantization, torch.compile, etc. etc.

Testing Summary

TBD

pablomlago · 2026-01-29T10:52:36Z

requirements/requirements-llm.txt

 torch>=2.4
 tqdm
 transformers[sentencepiece]<5.0
+vllm


I feel like vLLM should be an optional dependency.

Maybe we can do it in a similar way to what we did for lighteval/lm_eval

I'm leaving it for now so that test run and I can see what other things I'm breaking in the process, but I'll remove before this PR is merged

I'm fine with doing it similarly as for lighteval/lm_eval

pablomlago · 2026-02-05T17:31:24Z

src/brevitas/core/function_wrapper/shape.py

+    tensor_shape_list = list(tensor_shape)
+    x = padding_to_multiple(x, group_dim, group_size)
+
+    tensor_shape = x.shape


Is there a way to do simplify this logic? A function like torch.unflatten might be useful.

pablomlago · 2026-02-05T18:10:32Z

src/brevitas/export/inference/vLLM/manager.py

+        return ["brevitas_config.json"]
+
+    def get_quant_method(self, layer: torch.nn.Module,
+                         prefix: str) -> Optional["QuantizeMethodBase"]:


The __init__ uses typing from Python 3.10+, e.g. ignored_layers: list[str] | None = None, while, in this method, the legacy typing is used Optional["QuantizeMethodBase"]. Can we use Python 3.10+ typing for consistency with vLLM.

pablomlago · 2026-02-05T18:11:33Z

src/brevitas/export/inference/vLLM/manager.py

+@dataclass
+class QuantConfigBrevitas(QuantizationConfig):
+
+    def __init__(self, ignored_layers: list[str] | None = None, config: str | None = None):


Is this config: str | None = None the correct typing?

pablomlago · 2026-02-05T18:11:59Z

src/brevitas/export/inference/vLLM/manager.py

+        self.config = config
+
+    @classmethod
+    def from_config(cls, config: dict[str, Any]) -> "QuantConfigTcast":


Suggested change

def from_config(cls, config: dict[str, Any]) -> "QuantConfigTcast":

def from_config(cls, config: dict[str, Any]) -> "QuantConfigBrevitas":

pablomlago · 2026-02-05T18:14:30Z

src/brevitas/export/inference/vLLM/manager.py

+                         prefix: str) -> Optional["QuantizeMethodBase"]:
+        if isinstance(layer, RowParallelLinear) or isinstance(
+                layer, MergedColumnParallelLinear) or isinstance(layer, QKVParallelLinear):
+            if self.ignored_layers and is_layer_skipped(


Is the check self.ignored_layers needed? Maybe

Suggested change

if self.ignored_layers and is_layer_skipped(

if is_layer_skipped(

would suffice, from what I see in other classes, e.g. Fp8Config.

pablomlago · 2026-02-10T09:54:16Z

src/brevitas/export/inference/vLLM/handler.py

+    'IntInferencetHandler': IntInferencetHandler,}
+
+
+class QuantLinear(LinearMethodBase):


Consider adapting this class to better match the structure of examples such as vllm.model_executor.layers.quantization.fp_quant

pablomlago · 2026-02-10T14:25:44Z

src/brevitas/quant/solver/common.py

        raise Exception(f"{impl_type} not recognized.")


+def solve_float_to_int_enum_from_impl(impl_type):


This function introduces an extra maintainability burden, i.e. every time one adds a new FloatToIntImplType, two updates need to be done. However, this pattern matching could be done programatically by iterating over the possible values FloatToIntImplType, resolving them to classes and then selecting the value of the enumeration corresponding to the given class (which would amount to generating the dictionary FLOAT_TO_INT_IMPL_TO_ENUM programatically).

pablomlago · 2026-02-10T14:25:51Z

src/brevitas/quant/solver/common.py

        raise RuntimeError(f"{impl_type} not recognized.")


+def solve_restrict_value_enum_from_impl(impl):


Same as above

pablomlago · 2026-02-10T14:26:13Z

src/brevitas/quant/solver/common.py

    'SolveDtypeDeviceFromTrackedParameterList',
    'SolveRestrictScaleSign']

+# FLOAT_TO_INT_ENUM_TO_IMPL = {FloatToIntImplType.ROUND: RoundSte,


pablomlago · 2026-02-10T14:43:56Z

src/brevitas/nn/equalized_layer.py


        return o

+    def state_dict(self, destination=None, prefix='', keep_vars=False):


This change in how state_dict works is specifically required for vLLM, but I don't see a reason why it should be done always. Is it possible to do this override only in the vLLM export flow?

pablomlago · 2026-02-10T14:46:32Z

src/brevitas/nn/equalized_layer.py

        out = self.layer(*kwargs.values())
        return out

+    def state_dict(self, destination=None, prefix='', keep_vars=False):


Same as below

pablomlago · 2026-02-10T14:46:56Z

src/brevitas/export/inference/handler.py

+        x = x.reshape(shape)
+        return x
+
+    # def compute_scale(self, x, group_dim):


Giuseppe5 force-pushed the vllm_export branch 2 times, most recently from 4e3e36a to 7d4a78c Compare January 27, 2026 15:04

Giuseppe5 added 16 commits January 28, 2026 15:55

Fix

fecfcb6

Feat (vLLM): initial export support

195443c

Cleanup

df68ed8

More cleanup

19aa9c9

More bugfix, cleanup

aac450d

More cleanup and fixes

fb46fe6

Removed too much stuff

1244425

temp

69b1d49

Temp 2

6f544c6

cleanup

ed6b8f1

requirements

7225614

import

2e94286

import 2

0a0c062

Fix init

fd5edcc

fix init 2

67be3f8

Fix proxies

b9ae23a

Giuseppe5 force-pushed the vllm_export branch from 3259272 to b9ae23a Compare January 28, 2026 15:55

Giuseppe5 added 3 commits January 28, 2026 16:57

Update quantize.py

399363e

Update main.py

3a7ed83

sync

c8716a7

pablomlago reviewed Jan 29, 2026

View reviewed changes

Giuseppe5 added 7 commits January 29, 2026 15:55

Fix

79cc073

Fix

16d9e57

fixes

07910d6

fix

30977f4

small fixes

579101b

item not needed

dbe37f0

precommit

709a59c

Giuseppe5 added 2 commits February 4, 2026 17:18

Update

f775ce3

fix

7de7488

pablomlago reviewed Feb 5, 2026

View reviewed changes

nickfraser mentioned this pull request Feb 9, 2026

v0.13.0 Minor Release #1290

Open

34 tasks

nickfraser added the next release PRs which should be merged for the next release label Feb 9, 2026

pablomlago reviewed Feb 10, 2026

View reviewed changes

src/brevitas/export/inference/handler.py

x = x.reshape(shape)

return x

# def compute_scale(self, x, group_dim):

Copy link

Collaborator

pablomlago Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

	def from_config(cls, config: dict[str, Any]) -> "QuantConfigTcast":
	def from_config(cls, config: dict[str, Any]) -> "QuantConfigBrevitas":

	if self.ignored_layers and is_layer_skipped(
	if is_layer_skipped(

		'IntInferencetHandler': IntInferencetHandler,}


		class QuantLinear(LinearMethodBase):

		raise Exception(f"{impl_type} not recognized.")


		def solve_float_to_int_enum_from_impl(impl_type):

		raise RuntimeError(f"{impl_type} not recognized.")


		def solve_restrict_value_enum_from_impl(impl):


		return o

		def state_dict(self, destination=None, prefix='', keep_vars=False):

Feat (vLLM): initial export support #1444

Are you sure you want to change the base?

Feat (vLLM): initial export support #1444

Uh oh!

Conversation

Giuseppe5 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reason for this PR

Changes Made in this PR

Testing Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Giuseppe5 Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablomlago Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Giuseppe5 commented Jan 24, 2026 •

edited

Loading

Giuseppe5 Jan 29, 2026 •

edited

Loading

pablomlago Feb 10, 2026 •

edited

Loading