Releases: CodeWithKyrian/tokenizers-php
Releases · CodeWithKyrian/tokenizers-php
2.1.0
What's Changed
- feat: add FixedLengthPreTokenizer for fixed-length tokenization by @CodeWithKyrian in #3
- Implement configuration access and reconstruction for Tokenizer by @CodeWithKyrian in #4
- Refactor Tokenizer properties and visibility by @CodeWithKyrian in #5
- Make Encoding class Countable by @CodeWithKyrian in #6
Full Changelog: 2.0.0...2.1.0
v2.0.0
What's Changed
- Require PHP 8.2 and make core tokenization types readonly by @CodeWithKyrian in #2
- Use Hugging Face PHP for Hub loading and simplify examples by @CodeWithKyrian in #1
Full Changelog: 1.0.0...2.0.0
v1.0.0
I'm excited to announce the first release of Tokenizers PHP - a fast, pure-PHP tokenizer library fully compatible with Hugging Face tokenizers.
Highlights
- Pure PHP - No FFI, no external binaries, no compiled extensions. Works everywhere PHP runs.
- Zero Hard Dependencies - Core tokenization requires no dependencies. Optional PSR-18 HTTP client needed only for Hub downloads.
- Hugging Face Hub Compatible - Load any tokenizer directly from the Hub with automatic caching.
- Comprehensive Model Support - Validated against BERT, GPT-2, Llama, Gemma, Qwen, RoBERTa, ALBERT, MPNet, and more.
- Modern PHP - Built for PHP 8.1+ with strict types, readonly properties, and clean interfaces.
Supported Tokenization Algorithms
| Algorithm | Description |
|---|---|
| BPE | Byte-Pair Encoding (GPT-2, Llama, Qwen) |
| WordPiece | Greedy longest-match-first (BERT, ALBERT) |
| Unigram | Probabilistic subword selection (SentencePiece) |
Components
The library provides a complete tokenization pipeline:
- 12 Normalizers - Unicode normalization, lowercasing, accent stripping, BERT-style cleaning, and more
- 10 Pre-tokenizers - Whitespace, byte-level, metaspace, punctuation, and custom regex splitting
- 4 Models - BPE, WordPiece, Unigram, and Fallback
- 6 Post-processors - BERT, RoBERTa, template-based, and byte-level processors
- 10 Decoders - Byte-level, WordPiece, metaspace, BPE, CTC, and more
Installation
composer require codewithkyrian/tokenizersFor Hub downloads:
composer require guzzlehttp/guzzle // Or any PSR-18 client if not installedQuick Start
use Codewithkyrian\Tokenizers\Tokenizer;
// Load from Hugging Face Hub
$tokenizer = Tokenizer::fromHub('bert-base-uncased');
// Encode text
$encoding = $tokenizer->encode('Hello, how are you?');
// $encoding->ids: [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
// $encoding->tokens: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', '[SEP]']
// Decode back to text
$text = $tokenizer->decode($encoding->ids);
// "hello, how are you?"Key Features
Multiple Loading Methods
Tokenizer::fromHub('meta-llama/Llama-3.1-8B-Instruct'); // From Hub
Tokenizer::fromFile('/path/to/tokenizer.json'); // From file
Tokenizer::fromConfig($configArray); // From array
Tokenizer::load($source); // Auto-detectConfiguration Access
$tokenizer->modelMaxLength; // 512
$tokenizer->getConfig('model_max_length'); // 512
$tokenizer->getConfig(); // All configSentence Pairs
$encoding = $tokenizer->encode(
text: 'What is the capital of France?',
textPair: 'Paris is the capital of France.'
);
// typeIds distinguish between sequences: [0, 0, ..., 1, 1, ...]Custom Tokenizers via Builder
$tokenizer = Tokenizer::builder()
->withModel(new WordPieceModel($vocab, '[UNK]'))
->withNormalizer(new LowercaseNormalizer())
->withPreTokenizer(new WhitespacePreTokenizer())
->withPostProcessor(new BertPostProcessor('[CLS]', '[SEP]'))
->withDecoder(new WordPieceDecoder())
->withConfig('model_max_length', 512)
->build();Requirements
- PHP 8.1 or higher
- Optional: PSR-18 HTTP client for Hub downloads
Examples
The examples/ directory includes real-world usage patterns:
semantic_search_embeddings.php— Preparing text for vector embeddingscontext_window_fit_analysis.php— Analyzing token counts across modelstext_classification_preprocessing.php— BERT-style preprocessing for ML tasksdocument_chunking_pipeline.php— Token-aware document splitting
Links
Full Changelog: https://github.com/codewithkyrian/tokenizers-php/commits/v1.0.0