[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
Conversation
…ntrol Signed-off-by: Xin He <xin3.he@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR improves AutoRound’s calibration dataset preprocessing by reducing peak RAM via subprocess-based preprocessing and adding a persistent on-disk cache keyed by tokenizer/dataset parameters, with configuration exposed through environment variables.
Changes:
- Added subprocess preprocessing mode for calibration dataset generation with in-process fallback.
- Implemented a disk cache for preprocessed calibration datasets using a SHA-256–derived key and a completion marker.
- Integrated new environment variables into
envs.pyand documented them indocs/environments.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/environments.md | Documents new dataset preprocessing/caching environment variables. |
| auto_round/envs.py | Adds unified env accessors for disabling subprocess mode and selecting cache directory. |
| auto_round/calib_dataset.py | Introduces subprocess-based preprocessing and persistent disk cache for calibration datasets. |
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
auto_round/calib_dataset.py:708
- The
_get_dataset_impldocstring lists parameters likesplitandapply_chat_templatethat are not present in the function signature. This makes the internal API misleading and harder to maintain; please update the docstring to match the actual parameters and behavior.
def _get_dataset_impl(tokenizer, seqlen, dataset_name="NeelNanda/pile-10k", seed=42, nsamples=512):
"""Internal implementation: generate a dataset for calibration.
Args:
tokenizer (Tokenizer): The tokenizer to use for tokenization.
seqlen (int): The exact sequence length. samples < seqlen will be dropped,
samples longer than seqlen will be truncated
dataset_name (str, optional): The name of the dataset or datasets separated by commas.
Defaults to "NeelNanda/pile-10k".
split (str, optional): The data split to use. Defaults to None.
seed (int, optional): The random seed for reproducibility. Defaults to 42.
nsamples (int, optional): The total number of samples to include. Defaults to 512.
apply_chat_template: Whether to apply chat template in tokenization.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/intel/auto-round/sessions/0c14c972-2687-4283-aee6-3017898d7e0e Co-authored-by: xin3he <83260933+xin3he@users.noreply.github.com>
|
This PR should not target this release, as it introduces a new feature, right? |
Signed-off-by: Xin He <xin3.he@intel.com>
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@wenhuach21 Yes, it targets 1.13.0. and it's ready for review. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Reduced peak_ram from 9GB to 2.5GB for Qwen/Qwen3-0.6B
Details:
Subprocess Preprocessing:
Since operations like datasets.map generate large temporary memory objects that gc.collect() cannot fully reclaim, running it in a forked subprocess ensures all memory is returned to the OS upon exit, preventing memory leaks during quantization.
add
_make_map_fingerprintto fix below warning:Documentation:
docs/environments_CN.mdas a Chinese translation ofdocs/environments.md, covering all environment variables includingAR_DISABLE_DATASET_SUBPROCESS.Type of Change
Related Issues
Checklist Before Submitting
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.