-
Notifications
You must be signed in to change notification settings - Fork 240
Feat (vLLM): initial export support #1444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
4e3e36a to
7d4a78c
Compare
3259272 to
b9ae23a
Compare
| torch>=2.4 | ||
| tqdm | ||
| transformers[sentencepiece]<5.0 | ||
| vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like vLLM should be an optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do it in a similar way to what we did for lighteval/lm_eval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving it for now so that test run and I can see what other things I'm breaking in the process, but I'll remove before this PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with doing it similarly as for lighteval/lm_eval
| tensor_shape_list = list(tensor_shape) | ||
| x = padding_to_multiple(x, group_dim, group_size) | ||
|
|
||
| tensor_shape = x.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to do simplify this logic? A function like torch.unflatten might be useful.
| return ["brevitas_config.json"] | ||
|
|
||
| def get_quant_method(self, layer: torch.nn.Module, | ||
| prefix: str) -> Optional["QuantizeMethodBase"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __init__ uses typing from Python 3.10+, e.g. ignored_layers: list[str] | None = None, while, in this method, the legacy typing is used Optional["QuantizeMethodBase"]. Can we use Python 3.10+ typing for consistency with vLLM.
| @dataclass | ||
| class QuantConfigBrevitas(QuantizationConfig): | ||
|
|
||
| def __init__(self, ignored_layers: list[str] | None = None, config: str | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this config: str | None = None the correct typing?
| self.config = config | ||
|
|
||
| @classmethod | ||
| def from_config(cls, config: dict[str, Any]) -> "QuantConfigTcast": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def from_config(cls, config: dict[str, Any]) -> "QuantConfigTcast": | |
| def from_config(cls, config: dict[str, Any]) -> "QuantConfigBrevitas": |
| prefix: str) -> Optional["QuantizeMethodBase"]: | ||
| if isinstance(layer, RowParallelLinear) or isinstance( | ||
| layer, MergedColumnParallelLinear) or isinstance(layer, QKVParallelLinear): | ||
| if self.ignored_layers and is_layer_skipped( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the check self.ignored_layers needed? Maybe
| if self.ignored_layers and is_layer_skipped( | |
| if is_layer_skipped( |
would suffice, from what I see in other classes, e.g. Fp8Config.
| 'IntInferencetHandler': IntInferencetHandler,} | ||
|
|
||
|
|
||
| class QuantLinear(LinearMethodBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adapting this class to better match the structure of examples such as vllm.model_executor.layers.quantization.fp_quant
| raise Exception(f"{impl_type} not recognized.") | ||
|
|
||
|
|
||
| def solve_float_to_int_enum_from_impl(impl_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function introduces an extra maintainability burden, i.e. every time one adds a new FloatToIntImplType, two updates need to be done. However, this pattern matching could be done programatically by iterating over the possible values FloatToIntImplType, resolving them to classes and then selecting the value of the enumeration corresponding to the given class (which would amount to generating the dictionary FLOAT_TO_INT_IMPL_TO_ENUM programatically).
| raise RuntimeError(f"{impl_type} not recognized.") | ||
|
|
||
|
|
||
| def solve_restrict_value_enum_from_impl(impl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| 'SolveDtypeDeviceFromTrackedParameterList', | ||
| 'SolveRestrictScaleSign'] | ||
|
|
||
| # FLOAT_TO_INT_ENUM_TO_IMPL = {FloatToIntImplType.ROUND: RoundSte, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| return o | ||
|
|
||
| def state_dict(self, destination=None, prefix='', keep_vars=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change in how state_dict works is specifically required for vLLM, but I don't see a reason why it should be done always. Is it possible to do this override only in the vLLM export flow?
| out = self.layer(*kwargs.values()) | ||
| return out | ||
|
|
||
| def state_dict(self, destination=None, prefix='', keep_vars=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below
| x = x.reshape(shape) | ||
| return x | ||
|
|
||
| # def compute_scale(self, x, group_dim): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
Reason for this PR
Initial support for vLLM export.
To do:
Changes Made in this PR
We are re-using the inference quantizers also for vLLM.
This is still fake-quantization style, but should be faster than plain torch execution, even in eager mode.
The same template could be easily extended to support real quantization, torch.compile, etc. etc.
Testing Summary
TBD