pcrov / unicode
Miscellaneous Unicode utility functions
Installs: 1 103 782
Dependents: 1
Suggesters: 0
Security: 0
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
pkg:composer/pcrov/unicode
Requires
- php: >=7.3
Requires (Dev)
- phpunit/phpunit: ^9.4.0
This package is auto-updated.
Last update: 2025-10-08 19:11:54 UTC
README
Miscellaneous Unicode utility functions.
Functions
Namespace pcrov\Unicode.
surrogate_pair_to_code_point(int $high, int $low): int
Translates a UTF-16 surrogate pair into a single code point. Wikipedia's UTF-16 article explains what this is fairly well.
utf8_find_invalid_byte_sequence(string $string): ?int
Returns the position of the first invalid byte sequence or null if the input is valid.
utf8_get_invalid_byte_sequence(string $string): ?string
Returns the first invalid byte sequence or null if the input is valid.
utf8_get_state_machine(): array
Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.
It is in the form of [byte => [valid next byte => ...,], ...]
Example use:
function utf8_generate_all_code_points(): string { $generator = function (array $machine, string $buffer = "") use (&$generator) { // Completed a UTF-8 encoded code point. if ($buffer !== "" && isset($machine["\x0"])) { return $buffer; } $out = ""; foreach ($machine as $byte => $next) { $out .= $generator($next, $buffer . $byte); } return $out; }; return $generator(utf8_get_state_machine()); }
utf8_validate(string $string): bool
Does what it says on the box.
Data
The test/data directory holds two files containing all possible UTF-8 encoded characters.
All 1,112,064 of them. One as plain text, the other as json. These are not included in
packaged stable releases but can be generated with the example utf8_generate_all_code_points()
function above (returns the plain text string.)
Excerpts from the Unicode 10.0.0 standard:
Recreated here for ease of reference. Nobody likes PDFs.
Table 3-6. UTF-8 Bit Distribution
| Scalar Value | First Byte | Second Byte | Third Byte | Fourth Byte | 
|---|---|---|---|---|
| 00000000 0xxxxxxx | 0xxxxxxx | |||
| 00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | ||
| zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
| 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx | 
Table 3-7. Well-Formed UTF-8 Byte Sequences
| Code Points | First Byte | Second Byte | Third Byte | Fourth Byte | 
|---|---|---|---|---|
| U+0000..U+007F | 00..7F | |||
| U+0080..U+07FF | C2..DF | 80..BF | ||
| U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
| U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
| U+D000..U+D7FF | ED | 80..9F | 80..BF | |
| U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
| U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | 
| U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | 
| U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |