CLI to stream JSONL or Parquet datasets, count tokenizer tokens with a base model tokenizer (default: Qwen/Qwen3-1.7B-Base), and write a distribution-focused Markdown report with optional JSON output.
pip install -r requirements.txt
pip install -e .
For PDF export support only, you can install the optional extra:
pip install -e ".[pdf]"
-
Parquet:
python -m token_counter.cli --input data/your_file_dataset.parquet --format parquet -
JSONL (limit to first 500 rows):
python -m token_counter.cli --input data/your_file_dataset.jsonl --format jsonl --max-docs 500 -
Hugging Face sharded Parquet (all matching files):
python -m token_counter.cli --input "hf://datasets/g4me/corpus-carolina-v2@main/data/corpus/part-*.parquet" --format parquet
The console script alias is also available after installation: token-counter --input ...
You can stream files directly from the Hub by passing a remote path or glob pattern to --input.
-
Public dataset shards (glob):
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet -
Private dataset shards (PowerShell):
$env:HF_TOKEN="hf_xxx"
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet
If the text field is not named text, pass --field <column_name>.
--input(required): dataset path, URL, or glob pattern.--format(jsonl|parquet, defaultparquet): dataset format.--model(defaultQwen/Qwen3-1.7B-Base): tokenizer to load viatransformers.AutoTokenizer.--field(defaulttext): record field containing the text to tokenize.--add-special-tokens(defaultFalse): include special tokens in the length.--max-docs(int, optional): stop after N documents.--trust-remote-code(flag): allow custom tokenizer code from the model repo.--report(defaultreports/token_count_report.md): Markdown report path. Use empty string to skip.--report-json(optional): structured JSON report path. Disabled by default.--report-pdf(flag): generate a PDF next to the Markdown report using the same base filename.
The CLI builds one canonical report payload and can render it to Markdown and JSON.
Markdown sections:
- Run context with tokenizer settings, timestamps, report paths, and package versions.
- Distribution snapshot with documents processed, mean, median, IQR, P95, and P99.
- Distribution histogram rendered as a single PNG chart embedded in Markdown with fixed token buckets.
- Data quality with skipped/null/empty/coerced row counts.
- Performance with wall-clock time and throughput metrics.
JSON report:
- Includes
schema_version,status,run_metadata,summary_stats,distribution_stats,data_quality_stats, andperformance_stats. - Keeps richer distribution details such as percentiles, IQR, histogram buckets, min/max, and standard deviation for programmatic comparison.
You can convert a generated Markdown report to PDF with the cross-platform Python exporter. It renders the Markdown to HTML and writes the PDF directly from Python, preserving tables and the distribution plot image.
During counting, you can ask the CLI to emit the PDF automatically from the Markdown report path:
python -m token_counter.cli --input data/your_file_dataset.parquet --format parquet --report reports/token_count_report.md --report-pdfThat generates reports/token_count_report.pdf automatically.
python -m token_counter.pdf_export --input reports/gutenberg_distribution_report.mdTo choose the output file explicitly:
python -m token_counter.pdf_export --input reports/gutenberg_distribution_report.md --output reports/gutenberg_distribution_report.pdfThe package also exposes:
- Module:
python -m token_counter.pdf_export ... - Console script:
token-counter-report-pdf --input ... - Wrapper script:
python scripts/export_report_pdf.py --input ...
- Module:
python -m token_counter.cli ... - Console script:
token-counter ... - Wrapper script:
python scripts/count_tokens.py ...
- Uses streaming
datasets.load_dataset(..., streaming=True)to avoid loading full datasets into memory. - Parquet and JSONL files can be local files, remote URLs, or remote glob patterns.