Releases · CodeWithKyrian/tokenizers-php

I'm excited to announce the first release of Tokenizers PHP - a fast, pure-PHP tokenizer library fully compatible with Hugging Face tokenizers.

Highlights

Pure PHP - No FFI, no external binaries, no compiled extensions. Works everywhere PHP runs.
Zero Hard Dependencies - Core tokenization requires no dependencies. Optional PSR-18 HTTP client needed only for Hub downloads.
Hugging Face Hub Compatible - Load any tokenizer directly from the Hub with automatic caching.
Comprehensive Model Support - Validated against BERT, GPT-2, Llama, Gemma, Qwen, RoBERTa, ALBERT, MPNet, and more.
Modern PHP - Built for PHP 8.1+ with strict types, readonly properties, and clean interfaces.

Supported Tokenization Algorithms

Algorithm	Description
BPE	Byte-Pair Encoding (GPT-2, Llama, Qwen)
WordPiece	Greedy longest-match-first (BERT, ALBERT)
Unigram	Probabilistic subword selection (SentencePiece)

Components

The library provides a complete tokenization pipeline:

12 Normalizers - Unicode normalization, lowercasing, accent stripping, BERT-style cleaning, and more
10 Pre-tokenizers - Whitespace, byte-level, metaspace, punctuation, and custom regex splitting
4 Models - BPE, WordPiece, Unigram, and Fallback
6 Post-processors - BERT, RoBERTa, template-based, and byte-level processors
10 Decoders - Byte-level, WordPiece, metaspace, BPE, CTC, and more

Installation

composer require codewithkyrian/tokenizers

For Hub downloads:

composer require guzzlehttp/guzzle // Or any PSR-18 client if not installed

Quick Start

use Codewithkyrian\Tokenizers\Tokenizer;

// Load from Hugging Face Hub
$tokenizer = Tokenizer::fromHub('bert-base-uncased');

// Encode text
$encoding = $tokenizer->encode('Hello, how are you?');
// $encoding->ids: [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
// $encoding->tokens: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', '[SEP]']

// Decode back to text
$text = $tokenizer->decode($encoding->ids);
// "hello, how are you?"

Key Features

Multiple Loading Methods

Tokenizer::fromHub('meta-llama/Llama-3.1-8B-Instruct');  // From Hub
Tokenizer::fromFile('/path/to/tokenizer.json');                       // From file
Tokenizer::fromConfig($configArray);                                   // From array
Tokenizer::load($source);                                                      // Auto-detect

Configuration Access

$tokenizer->modelMaxLength;                             // 512
$tokenizer->getConfig('model_max_length');     // 512
$tokenizer->getConfig();                                      // All config

Sentence Pairs

$encoding = $tokenizer->encode(
    text: 'What is the capital of France?',
    textPair: 'Paris is the capital of France.'
);
// typeIds distinguish between sequences: [0, 0, ..., 1, 1, ...]

Custom Tokenizers via Builder

$tokenizer = Tokenizer::builder()
    ->withModel(new WordPieceModel($vocab, '[UNK]'))
    ->withNormalizer(new LowercaseNormalizer())
    ->withPreTokenizer(new WhitespacePreTokenizer())
    ->withPostProcessor(new BertPostProcessor('[CLS]', '[SEP]'))
    ->withDecoder(new WordPieceDecoder())
    ->withConfig('model_max_length', 512)
    ->build();

Requirements

PHP 8.1 or higher
Optional: PSR-18 HTTP client for Hub downloads

Examples

The examples/ directory includes real-world usage patterns:

semantic_search_embeddings.php — Preparing text for vector embeddings
context_window_fit_analysis.php — Analyzing token counts across models
text_classification_preprocessing.php — BERT-style preprocessing for ML tasks
document_chunking_pipeline.php — Token-aware document splitting

Links

Full Changelog: https://github.com/codewithkyrian/tokenizers-php/commits/v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Supported Tokenization Algorithms

Components

Installation

Quick Start

Key Features

Multiple Loading Methods

Configuration Access

Sentence Pairs

Custom Tokenizers via Builder

Requirements

Examples

Links

Uh oh!

Releases: CodeWithKyrian/tokenizers-php

2.1.0

What's Changed

Contributors

Uh oh!

v2.0.0

What's Changed

Contributors

Uh oh!

v1.0.0

Highlights

Supported Tokenization Algorithms

Components

Installation

Quick Start

Key Features

Multiple Loading Methods

Configuration Access

Sentence Pairs

Custom Tokenizers via Builder

Requirements

Examples

Links

Uh oh!