Token Counter

CLI to stream JSONL or Parquet datasets, count tokenizer tokens with a base model tokenizer (default: Qwen/Qwen3-1.7B-Base), and write a distribution-focused Markdown report with optional JSON output.

Installation

pip install -r requirements.txt
pip install -e .

For PDF export support only, you can install the optional extra:

pip install -e ".[pdf]"

Quick start

Parquet:
python -m token_counter.cli --input data/your_file_dataset.parquet --format parquet
JSONL (limit to first 500 rows):
python -m token_counter.cli --input data/your_file_dataset.jsonl --format jsonl --max-docs 500
Hugging Face sharded Parquet (all matching files):
python -m token_counter.cli --input "hf://datasets/g4me/corpus-carolina-v2@main/data/corpus/part-*.parquet" --format parquet

The console script alias is also available after installation: token-counter --input ...

Hugging Face datasets

You can stream files directly from the Hub by passing a remote path or glob pattern to --input.

Public dataset shards (glob):
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet
Private dataset shards (PowerShell):
$env:HF_TOKEN="hf_xxx"
python -m token_counter.cli --input "hf://datasets/<org>/<dataset>@main/<folder>/part-*.parquet" --format parquet

If the text field is not named text, pass --field <column_name>.

Flags

--input (required): dataset path, URL, or glob pattern.
--format (jsonl | parquet, default parquet): dataset format.
--model (default Qwen/Qwen3-1.7B-Base): tokenizer to load via transformers.AutoTokenizer.
--field (default text): record field containing the text to tokenize.
--add-special-tokens (default False): include special tokens in the length.
--max-docs (int, optional): stop after N documents.
--trust-remote-code (flag): allow custom tokenizer code from the model repo.
--report (default reports/token_count_report.md): Markdown report path. Use empty string to skip.
--report-json (optional): structured JSON report path. Disabled by default.
--report-pdf (flag): generate a PDF next to the Markdown report using the same base filename.

Report

The CLI builds one canonical report payload and can render it to Markdown and JSON.

Markdown sections:

Run context with tokenizer settings, timestamps, report paths, and package versions.
Distribution snapshot with documents processed, mean, median, IQR, P95, and P99.
Distribution histogram rendered as a single PNG chart embedded in Markdown with fixed token buckets.
Data quality with skipped/null/empty/coerced row counts.
Performance with wall-clock time and throughput metrics.

JSON report:

Includes schema_version, status, run_metadata, summary_stats, distribution_stats, data_quality_stats, and performance_stats.
Keeps richer distribution details such as percentiles, IQR, histogram buckets, min/max, and standard deviation for programmatic comparison.

PDF export

You can convert a generated Markdown report to PDF with the cross-platform Python exporter. It renders the Markdown to HTML and writes the PDF directly from Python, preserving tables and the distribution plot image.

During counting, you can ask the CLI to emit the PDF automatically from the Markdown report path:

python -m token_counter.cli --input data/your_file_dataset.parquet --format parquet --report reports/token_count_report.md --report-pdf

That generates reports/token_count_report.pdf automatically.

python -m token_counter.pdf_export --input reports/gutenberg_distribution_report.md

To choose the output file explicitly:

python -m token_counter.pdf_export --input reports/gutenberg_distribution_report.md --output reports/gutenberg_distribution_report.pdf

The package also exposes:

Module: python -m token_counter.pdf_export ...
Console script: token-counter-report-pdf --input ...
Wrapper script: python scripts/export_report_pdf.py --input ...

Entrypoints

Module: python -m token_counter.cli ...
Console script: token-counter ...
Wrapper script: python scripts/count_tokens.py ...

Notes

Uses streaming datasets.load_dataset(..., streaming=True) to avoid loading full datasets into memory.
Parquet and JSONL files can be local files, remote URLs, or remote glob patterns.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts		scripts
src/token_counter		src/token_counter
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token Counter

Installation

Quick start

Hugging Face datasets

Flags

Report

PDF export

Entrypoints

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Token Counter

Installation

Quick start

Hugging Face datasets

Flags

Report

PDF export

Entrypoints

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages