Conversation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Greptile SummaryThis PR integrates the Key Changes
Implementation NotesThe implementation allocates all weight data in a single contiguous buffer, then creates individual parameter views that share the underlying storage. This improves memory locality and enables future optimizations like grouped GEMMs (#2502). Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant GroupedLinear
participant GroupedTensor
participant Quantizer
participant Storage
Note over User,Storage: Initialization Phase
User->>GroupedLinear: __init__(num_gemms, in_features, out_features)
GroupedLinear->>GroupedLinear: register_parameter(weight0...weightN)
GroupedLinear->>GroupedLinear: reset_parameters()
GroupedLinear->>GroupedLinear: make_grouped_weights()
Note over GroupedLinear,Storage: Weight Consolidation
GroupedLinear->>Quantizer: _get_weight_quantizers()
Quantizer-->>GroupedLinear: [quantizer0...quantizerN]
GroupedLinear->>GroupedTensor: make_grouped_tensor(num_tensors, shapes, quantizers)
Note over GroupedTensor,Storage: Allocate Contiguous Storage
GroupedTensor->>GroupedTensor: analyze shape patterns
GroupedTensor->>GroupedTensor: calculate logical_shape, offsets
GroupedTensor->>Storage: allocate contiguous buffers (data, scale_inv, etc)
GroupedTensor->>GroupedTensor: split_into_quantized_tensors()
GroupedTensor-->>GroupedLinear: grouped_weights with quantized_tensors
Note over GroupedLinear: Copy & Re-register Weights
loop for each weight i
GroupedLinear->>GroupedTensor: quantized_tensors[i].copy_(weights[i])
GroupedLinear->>GroupedLinear: register_parameter(weightI, quantized_tensors[i])
end
Note over User,Storage: Forward Pass
User->>GroupedLinear: forward(inp, m_splits)
GroupedLinear->>GroupedLinear: _get_weight_tensors()
GroupedLinear->>GroupedLinear: prepare quantizers
GroupedLinear->>GroupedLinear: _GroupedLinear.apply()
Note over GroupedLinear: All weights share contiguous storage
GroupedLinear->>GroupedLinear: general_grouped_gemm(weights, inputs)
GroupedLinear-->>User: output tensor
|
There was a problem hiding this comment.
style: check that the copy operation works correctly for all quantization recipes (FP8, MXFP8, NVFP4, block scaling). the TODO comment on line 771 acknowledges this needs verification.
There was a problem hiding this comment.
style: check that all quantizers in the group are compatible. the comment acknowledges uncertainty about whether multiple quantizers are needed, but the implementation assumes they're "effectively the same" - mixed quantization schemes could cause issues.
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>
2b7ea40 to
40c619e
Compare
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
transformer_engine/common/include/transformer_engine/transformer_engine.h
Show resolved
Hide resolved
* changes for pytoch extension; but everything seems to be broken probably unrelated to my changes Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix the issues Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * comment nvte API since Oleg's PR is not merged Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * test for all cases: Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * tensor attributes should be set later Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
| ] | ||
|
|
||
|
|
||
| def make_quantizer(quantization: str, num_tensors: int, shape: List[Tuple[int, int]]) -> Quantizer: |
There was a problem hiding this comment.
We don't need "num_tensors" as an argument here anymore, I think, because we assume all tensors in the group use the same kind of quantizer.
| Quantize the GroupedTensor inplace. | ||
| """ | ||
|
|
||
| quantized_tensors = self.split_into_quantized_tensors() |
There was a problem hiding this comment.
Should "quantized_tensors" here be "self.quantized_tensors"?
for more information, see https://pre-commit.ci
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
There was a problem hiding this comment.
let's build a switch here for recipe like split_quantize
| m.def("dequantize", &transformer_engine::pytorch::dequantize, "Dequantize", py::arg("input"), | ||
| py::arg("otype")); | ||
|
|
||
| m.def("group_quantize", transformer_engine::pytorch::group_quantize, py::arg("tensor"), |
There was a problem hiding this comment.
for future grouped quantization for weights, let's add a slot for an optional noop tensor for cuda graph
| } | ||
|
|
||
| if (columnwise_usage) { | ||
| columnwise_data = at::empty({total_elements}, uint8_opts); |
There was a problem hiding this comment.
A lot of the code is duplicated throughout for each and e very quantizer. Maybe we should have common code in the base Quantizer implementation
There was a problem hiding this comment.
Or maybe call the utility function from type_conerters.cpp.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
|
Integrate NVFP4 Graph Safe Grouped Quantize: ksivaman#14 |
| return ret; | ||
| } | ||
|
|
||
| std::optional<at::Tensor> build_grouped_tensor_offsets(const size_t num_tensors, |
There was a problem hiding this comment.
Leave a TODO for the future, use a fused kernel for this function to avoid multiple torch calls.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* nvfp4 grouped quantize Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix for paged stashing Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * pass all edge cases Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * clean up Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix for other recipes Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
08614d9 to
864c484
Compare
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
9236278 to
9f5f24c
Compare
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Description
#2388 introduced the
GroupedTensorclass in the core library. This PR partly integrates this functionality to the PyTorch bindings.Type of change
Changes
GroupedTensorclass.GroupedTensorintoGroupedLinearsuch that the parameters are contiguous.grouped_quantizeAPI to python similar to thesplit_quantizewhich returns a quantizedGroupedTensorthat can be directly consumed by the GEMMs ([common] Add support for cuBLASLt GEMM for GroupedTensor #2502).Checklist: