Bases: QuantizationConfig
Config for DeepSpeed FP quantizer. It supports fp6 and fp8.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| weight_bits | int | the target quantization bits, 6 or 8. | 8 | 
| group_size | int | group size for quantizaiton, default to 128. | 512 | 
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
  Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  classmethod  ¶
 from_config(config: dict[str, Any]) -> DeepSpeedFPConfig
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   staticmethod  ¶
    
 get_linear_method() -> DeepSpeedFPLinearMethod
 classmethod  ¶
 get_name() -> QuantizationMethods
 
 get_quant_method(
    layer: Module, prefix: str
) -> Optional[DeepSpeedFPLinearMethod]
 
  Bases: LinearMethodBase
Linear method for DeepSpeedFP quantizer.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| quant_config | DeepSpeedFPConfig | the DeepSpeedFP quantization config. | required | 
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
 __init__(quant_config: DeepSpeedFPConfig)
 
    
 create_weights(
    layer: Module,
    input_size_per_partition: int,
    output_partition_sizes: list[int],
    input_size: int,
    output_size: int,
    params_dtype: dtype,
    weight_loader=None,
    **extra_weight_attrs,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
  Bases: Parameter
DeepSpeedFP quantized parameter class that implements fp8/fp6 quantization deepspeed. Weights are stored in quantized form on GPUs, and can be dequantized on-the-fly when needed by the model.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
 __new__(
    orig_shape: Size,
    params_dtype: dtype,
    quant_config: DeepSpeedFPConfig,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
 ds_dequantize(fp_out=None) -> Tensor
Return a tensor containing the dequantized weights of this parameter.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
  
 ds_quantize_(tensor: Tensor)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   
 ds_selective_dequantize(indices, fp_out=None) -> Tensor
Return a tensor where only the weights at indices are dequantized (to save HBM -> SRAM bandwidth).