Skip to content

Releases: CodeWithKyrian/tokenizers-php

2.1.0

06 Feb 10:31

Choose a tag to compare

What's Changed

Full Changelog: 2.0.0...2.1.0

v2.0.0

04 Feb 11:15

Choose a tag to compare

What's Changed

Full Changelog: 1.0.0...2.0.0

v1.0.0

10 Dec 19:27
4c6bf1a

Choose a tag to compare

I'm excited to announce the first release of Tokenizers PHP - a fast, pure-PHP tokenizer library fully compatible with Hugging Face tokenizers.

Highlights

  • Pure PHP - No FFI, no external binaries, no compiled extensions. Works everywhere PHP runs.
  • Zero Hard Dependencies - Core tokenization requires no dependencies. Optional PSR-18 HTTP client needed only for Hub downloads.
  • Hugging Face Hub Compatible - Load any tokenizer directly from the Hub with automatic caching.
  • Comprehensive Model Support - Validated against BERT, GPT-2, Llama, Gemma, Qwen, RoBERTa, ALBERT, MPNet, and more.
  • Modern PHP - Built for PHP 8.1+ with strict types, readonly properties, and clean interfaces.

Supported Tokenization Algorithms

Algorithm Description
BPE Byte-Pair Encoding (GPT-2, Llama, Qwen)
WordPiece Greedy longest-match-first (BERT, ALBERT)
Unigram Probabilistic subword selection (SentencePiece)

Components

The library provides a complete tokenization pipeline:

  • 12 Normalizers - Unicode normalization, lowercasing, accent stripping, BERT-style cleaning, and more
  • 10 Pre-tokenizers - Whitespace, byte-level, metaspace, punctuation, and custom regex splitting
  • 4 Models - BPE, WordPiece, Unigram, and Fallback
  • 6 Post-processors - BERT, RoBERTa, template-based, and byte-level processors
  • 10 Decoders - Byte-level, WordPiece, metaspace, BPE, CTC, and more

Installation

composer require codewithkyrian/tokenizers

For Hub downloads:

composer require guzzlehttp/guzzle // Or any PSR-18 client if not installed

Quick Start

use Codewithkyrian\Tokenizers\Tokenizer;

// Load from Hugging Face Hub
$tokenizer = Tokenizer::fromHub('bert-base-uncased');

// Encode text
$encoding = $tokenizer->encode('Hello, how are you?');
// $encoding->ids: [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
// $encoding->tokens: ['[CLS]', 'hello', ',', 'how', 'are', 'you', '?', '[SEP]']

// Decode back to text
$text = $tokenizer->decode($encoding->ids);
// "hello, how are you?"

Key Features

Multiple Loading Methods

Tokenizer::fromHub('meta-llama/Llama-3.1-8B-Instruct');  // From Hub
Tokenizer::fromFile('/path/to/tokenizer.json');                       // From file
Tokenizer::fromConfig($configArray);                                   // From array
Tokenizer::load($source);                                                      // Auto-detect

Configuration Access

$tokenizer->modelMaxLength;                             // 512
$tokenizer->getConfig('model_max_length');     // 512
$tokenizer->getConfig();                                      // All config

Sentence Pairs

$encoding = $tokenizer->encode(
    text: 'What is the capital of France?',
    textPair: 'Paris is the capital of France.'
);
// typeIds distinguish between sequences: [0, 0, ..., 1, 1, ...]

Custom Tokenizers via Builder

$tokenizer = Tokenizer::builder()
    ->withModel(new WordPieceModel($vocab, '[UNK]'))
    ->withNormalizer(new LowercaseNormalizer())
    ->withPreTokenizer(new WhitespacePreTokenizer())
    ->withPostProcessor(new BertPostProcessor('[CLS]', '[SEP]'))
    ->withDecoder(new WordPieceDecoder())
    ->withConfig('model_max_length', 512)
    ->build();

Requirements

  • PHP 8.1 or higher
  • Optional: PSR-18 HTTP client for Hub downloads

Examples

The examples/ directory includes real-world usage patterns:

  • semantic_search_embeddings.php — Preparing text for vector embeddings
  • context_window_fit_analysis.php — Analyzing token counts across models
  • text_classification_preprocessing.php — BERT-style preprocessing for ML tasks
  • document_chunking_pipeline.php — Token-aware document splitting

Links


Full Changelog: https://github.com/codewithkyrian/tokenizers-php/commits/v1.0.0