farzai / thai-word
Thai word segmentation library for PHP
Fund package maintenance!
parsilver
Requires
- php: ^8.4
- guzzlehttp/psr7: ^2.0
- php-http/discovery: ^1.19
- psr/http-client: ^1.0
- psr/http-factory: ^1.0
Requires (Dev)
- guzzlehttp/guzzle: ^7.0
- laravel/pint: ^1.0
- mockery/mockery: ^1.6
- pestphp/pest: ^3.0
- spatie/ray: ^1.28
- symfony/console: ^7.0
This package is auto-updated.
Last update: 2025-06-12 04:09:16 UTC
README
A library for Thai word segmentation in PHP.
Features
- Thai word segmentation with high accuracy
- Word suggestions for typos and misspellings
- Dictionary loading from local file, remote file, and remote URL
- Performance optimizations with caching and memory management
- Batch processing for large text volumes
- Custom configuration with caching, memory limit, and batch size
- Mixed content support (Thai, English, numbers, punctuation)
Requirements
- PHP 8.4+
- Composer
Installation
You can install the package via composer:
composer require farzai/thai-word
Basic Usage
Using the Facade (Recommended)
use Farzai\ThaiWord\Composer; // Simple text segmentation $words = Composer::segment('สวัสดีครับผมชื่อสมชาย'); // Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย'] // Segment with custom delimiter $text = Composer::segmentToString('สวัสดีครับผมชื่อสมชาย', ' '); // Result: 'สวัสดี ครับ ผม ชื่อ สมชาย' // Batch processing for multiple texts $results = Composer::segmentBatch(['สวัสดีครับ', 'ขอบคุณค่ะ']); // Result: [['สวัสดี', 'ครับ'], ['ขอบคุณ', 'ค่ะ']] // Enable word suggestions via facade // Use threshold 0.4-0.5 for single characters, 0.6-0.7 for multi-character words Composer::enableSuggestions(['threshold' => 0.5]); // Get suggestions for misspelled words $suggestions = Composer::suggest('สวัสด'); // Result: [ // ['word' => 'สวัสดี', 'score' => 0.833], // ['word' => 'สวัสดิ์', 'score' => 0.714], // ['word' => 'สวัสติ', 'score' => 0.667] // ] // Segment with automatic suggestions for single unrecognized characters $result = Composer::segmentWithSuggestions('โอเคอไร'); // Result: [ // ['word' => 'โอเค'], // ['word' => 'อ', 'suggestions' => [ // ['word' => 'กอ', 'score' => 0.5], // ['word' => 'ขอ', 'score' => 0.5], // ['word' => 'คอ', 'score' => 0.5] // ]], // ['word' => 'ไร'] // ] // Get performance statistics $stats = Composer::getStats();
Using ThaiSegmenter Directly
use Farzai\ThaiWord\Segmenter\ThaiSegmenter; $segmenter = new ThaiSegmenter(); $words = $segmenter->segment('สวัสดีครับผมชื่อสมชาย'); // Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
Word Suggestions for Typos
use Farzai\ThaiWord\Segmenter\ThaiSegmenter; $segmenter = new ThaiSegmenter(); // Enable word suggestions $segmenter->enableSuggestions([ 'threshold' => 0.5, // Minimum similarity score (0.0-1.0) 'max_suggestions' => 5 // Maximum suggestions per word ]); // Get suggestions for a misspelled word $suggestions = $segmenter->suggest('สวัสด'); // Missing last character // Result: [ // ['word' => 'สวัสดี', 'score' => 0.833], // ['word' => 'สวัสดิ์', 'score' => 0.714], // ['word' => 'สวัสติ', 'score' => 0.667] // ] // Segment text with automatic suggestions for single unrecognized characters $result = $segmenter->segmentWithSuggestions('ชื่ออไรนะ'); // 'อ' is unrecognized single character // Result: [ // ['word' => 'ชื่อ'], // ['word' => 'อ', 'suggestions' => [ // ['word' => 'กอ', 'score' => 0.5], // ['word' => 'ขอ', 'score' => 0.5], // ['word' => 'คอ', 'score' => 0.5] // ]], // ['word' => 'ไร'], // ['word' => 'นะ'] // ]
How It Works
This library segments Thai text into words and provides intelligent word suggestions through a highly optimized process. Here's how it works step by step:
Step 1: Text Input & Validation
- You provide Thai text as a string to the
ThaiSegmenter
- Example:
'สวัสดีครับผมชื่อสมชาย'
- The library validates UTF-8 encoding and handles empty strings
Step 2: Dictionary Loading (Automatic)
The library automatically loads Thai words using several sources with intelligent fallback:
- LibreOffice Thai Dictionary: Downloads from official LibreOffice repository (primary source)
- Local Dictionary Files: Falls back to local dictionary files if available
- Basic Dictionary: Uses built-in common Thai words as last resort
The dictionary is stored in a HashDictionary
with O(1) lookup performance.
Step 3: Smart Text Processing
The LongestMatchingStrategy
algorithm processes text intelligently:
Character Classification:
- Thai characters: Unicode range 0x0E00-0x0E7F for fast detection
- English words: Handled as complete word units
- Numbers: Processed as number sequences (with decimals, commas)
- Punctuation: Handled appropriately with whitespace normalization
Step 4: Longest Matching Algorithm
Input: สวัสดีครับผมชื่อสมชาย
↓
Position 0: Check สวัสดี (6 chars) → Found in dictionary ✓
Position 6: Check ครับ (4 chars) → Found in dictionary ✓
Position 10: Check ผม (2 chars) → Found in dictionary ✓
Position 12: Check ชื่อ (3 chars) → Found in dictionary ✓
Position 15: Check สมชาย (5 chars) → Found in dictionary ✓
↓
Output: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
Step 5: Word Suggestion System (Optional)
When enabled, the library can suggest corrections for typos using advanced similarity algorithms:
Levenshtein Distance Algorithm:
Input: สวัสด (missing last character)
↓
1. Filter dictionary words by length similarity (±3 characters)
2. Calculate Unicode-aware Levenshtein distance for each candidate
3. Convert distance to similarity score (0.0 to 1.0)
4. Filter by threshold (default 0.6) and sort by score
↓
Output: [
['word' => 'สวัสดี', 'score' => 0.833], // 1 character difference
['word' => 'สวัสดิ์', 'score' => 0.714], // 2 character difference
['word' => 'สวัสติ', 'score' => 0.667] // 2 character difference
]
Smart Suggestion Integration:
- Single-character only:
segmentWithSuggestions()
only provides suggestions for single-character segments that are NOT in the dictionary - Multi-character words: Use
suggest()
method directly for multi-character word suggestions - Threshold requirements: Single-character similarities max out at 0.5, so use threshold ≤ 0.5 for best results
- Configurable similarity thresholds: 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
- Performance-optimized: Caching and length-based filtering for large dictionaries
- Unicode-aware for proper Thai character handling
Step 6: Performance Optimizations
The library includes several optimizations:
- Caching: Recently segmented texts are cached for faster repeat processing
- Batch Processing: Large texts are processed in chunks to manage memory
- Memory Management: Automatic garbage collection and memory optimization
- Adaptive Processing: Different strategies for short, medium, and long texts
- Suggestion Caching: Distance calculations cached for repeated similarity checks
Step 7: Mixed Content Handling
$segmenter = new ThaiSegmenter(); $result = $segmenter->segment('ผมใช้ Computer ทำงาน'); // Result: ['ผม', 'ใช้', 'Computer', 'ทำงาน']
- Thai words are processed with dictionary lookup
- English words are kept as complete units
- Numbers and punctuation are handled appropriately
Key Components
- ThaiSegmenter: Main orchestrator with performance monitoring and suggestion integration
- HashDictionary: O(1) hash-based word lookup with 70% less memory usage than trie structures
- LongestMatchingStrategy: Optimized algorithm with character classification
- LevenshteinSuggestionStrategy: Unicode-aware word suggestion algorithm with caching
- DictionaryLoaderService: Handles loading from files, URLs, and remote sources
Performance Features
- 3-5x faster processing speed with optimized algorithms
- 50% lower memory usage with hash-based dictionary
- Intelligent suggestions with configurable accuracy thresholds
- Automatic optimization based on text characteristics
- Built-in statistics for performance monitoring
Real Usage Examples
Using the Facade (Simple & Clean)
use Farzai\ThaiWord\Composer; // Basic segmentation $words = Composer::segment('สวัสดีครับผมชื่อสมชาย'); // Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย'] // Get performance statistics $stats = Composer::getStats(); echo "Processing time: {$stats['avg_processing_time']}ms"; // Add custom words Composer::getDictionary()->add('คำใหม่'); // Batch processing for multiple texts $results = Composer::segmentBatch(['ข้อความ1', 'ข้อความ2']); // Custom configuration Composer::updateConfig([ 'enable_caching' => true, 'memory_limit_mb' => 200 ]);
Using ThaiSegmenter Directly (Advanced Control)
use Farzai\ThaiWord\Segmenter\ThaiSegmenter; // Create segmenter with custom configuration $segmenter = new ThaiSegmenter(null, null, [ 'enable_caching' => true, 'batch_size' => 500 ]); // Or use the facade to create custom instances $customSegmenter = Composer::create(null, null, ['memory_limit_mb' => 150]); // Set custom segmenter for facade Composer::setSegmenter($customSegmenter);
This architecture ensures both accuracy and performance while remaining simple to use.
Advanced Usage
Custom Suggestion Strategies
use Farzai\ThaiWord\Segmenter\ThaiSegmenter; use Farzai\ThaiWord\Suggestions\Strategies\LevenshteinSuggestionStrategy; // Create custom suggestion strategy $suggestionStrategy = new LevenshteinSuggestionStrategy; $suggestionStrategy->setThreshold(0.8) // Higher accuracy ->setMaxWordLengthDiff(2); // Stricter length filtering // Initialize segmenter with custom strategy $segmenter = new ThaiSegmenter(null, null, $suggestionStrategy); // Or set strategy later $segmenter->setSuggestionStrategy($suggestionStrategy);
Performance Monitoring with Suggestions
$segmenter = new ThaiSegmenter(); $segmenter->enableSuggestions(); // Process text $result = $segmenter->segmentWithSuggestions('สวัสดีครบผมชื่อโจน'); // Get detailed statistics $stats = $segmenter->getStats(); echo "Cache hit ratio: " . ($stats['cache_hit_ratio'] * 100) . "%\n"; // Get suggestion-specific statistics $suggestionStrategy = $segmenter->getSuggestionStrategy(); if ($suggestionStrategy instanceof LevenshteinSuggestionStrategy) { $cacheStats = $suggestionStrategy->getCacheStats(); echo "Suggestion cache size: " . $cacheStats['cache_size'] . "\n"; echo "Memory usage: " . $cacheStats['memory_usage_mb'] . "MB\n"; }
Batch Processing with Suggestions
$texts = [ 'สวัสดีครบ', // Contains typo 'ขอบคนครับ', // Contains typo 'ผมชื่อโจน' // Might need suggestions ]; $segmenter = new ThaiSegmenter(); $segmenter->enableSuggestions(['threshold' => 0.7]); foreach ($texts as $text) { $result = $segmenter->segmentWithSuggestions($text); foreach ($result as $item) { if (isset($item['suggestions'])) { echo "'{$item['word']}' → Suggested: '{$item['suggestions'][0]['word']}'\n"; } } } // Example output: // 'ครบ' → Suggested: 'ครับ' // 'คน' → Suggested: 'คุณ' // 'โจน' → Suggested: 'โจ้'
Understanding Suggestion Behavior
Important: The segmentWithSuggestions()
method only provides suggestions for single-character segments that are NOT found in the dictionary.
$segmenter = new ThaiSegmenter(); $segmenter->enableSuggestions(['threshold' => 0.5]); // ✅ Will get suggestions - 'อ' is single character not in dictionary $result = $segmenter->segmentWithSuggestions('โอเคอไร'); // 'อ' gets suggestions: ['กอ', 'ขอ', 'คอ', ...] // ❌ Won't get suggestions - 'ครบ' is multi-character and in dictionary $result = $segmenter->segmentWithSuggestions('สวัสดีครบ'); // 'ครบ' gets NO suggestions (even though 'ครับ' might be intended) // ✅ For multi-character suggestions, use suggest() directly $suggestions = $segmenter->suggest('ครบ'); // Returns: ['ครับ', 'ครอบ', 'คราบ', ...]
Threshold Guidelines:
- Single characters: Use 0.4-0.5 (similarities max out at 0.5)
- Multi-character words: Use 0.6-0.7 (higher precision possible)
Configuration Options
$segmenter = new ThaiSegmenter(); // Enable suggestions with proper threshold for single characters $segmenter->enableSuggestions([ 'threshold' => 0.5, // Optimal for single characters 'max_suggestions' => 3 // Maximum suggestions per word ]); // Update segmenter configuration $segmenter->updateConfig([ 'enable_caching' => true, 'memory_limit_mb' => 150, 'suggestion_threshold' => 0.5, // Adjusted for single characters 'max_suggestions' => 5 ]); // Disable suggestions when not needed $segmenter->disableSuggestions();
Testing
composer test
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Security Vulnerabilities
Please review our security policy on how to report security vulnerabilities.
Credits
Data Sources
- LibreOffice Thai Dictionary - Primary Thai word dictionary source
License
The MIT License (MIT). Please see License File for more information.