Skip to content

Add distillation dataset cleaner#6

Open
SemyonEpanov wants to merge 4 commits intomainfrom
add-distillation-dataset-cleaner
Open

Add distillation dataset cleaner#6
SemyonEpanov wants to merge 4 commits intomainfrom
add-distillation-dataset-cleaner

Conversation

@SemyonEpanov
Copy link
Collaborator

Cleaning distillation

@SemyonEpanov SemyonEpanov force-pushed the add-distillation-dataset-cleaner branch 3 times, most recently from 729aab9 to 9b8c4b9 Compare February 15, 2026 17:17
- Qwen2.5-32B-Instruct for cleaning (batch_size=24, max_len=20480)
- Detailed examples in prompts for better cleaning quality
- 4 separate runners: gptoss_b/c, qwen_b/c
- Prompts externalized to prompts.py
- Safe checkpoint system with unique names (timestamp + PID)
- Output suffix: _cleaned_32b.parquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments