benmorel/gsm-charset-converter

Converts GSM 03.38 strings to and from UTF-8

Fund package maintenance!
BenMorel

0.3.1 2024-04-16 20:05 UTC

This package is auto-updated.

Last update: 2024-12-12 20:59:41 UTC


README

A PHP library to convert GSM 03.38, the charset used for SMS messaging, to and from UTF-8.

Build Status Coverage Status Latest Stable Version Total Downloads License

This library is well tested. The character maps used have been cross-checked against multiple sources, and when in doubt, a test has been performed on a real SMS gateway.

The library offers optional transliteration: unsupported characters can be replaced with a close variant. For example, the ë character can be replaced with e.

Known limitations:

  • Only the default alphabet and extension table are supported at the moment; this is the alphabet that must be supported by every device and network element according to the standard. Other alphabets exist but this project does not currently aim to support them.
  • Transliteration may not be available for all characters that could have a close equivalent in the GSM charset; this library supports transliteration of LATIN1 chars and several other languages. If you feel like another UTF-8 character could be transliterated, please open an issue!

Installation

This library is installable via Composer:

composer require benmorel/gsm-charset-converter

Requirements

This library requires PHP >= 7.4, and the mbstring extension.

Project status & release process

This library is under development.

The current releases are numbered 0.x.y. When a non-breaking change is introduced (adding new methods, optimizing existing code, etc.), y is incremented.

When a breaking change is introduced, a new 0.x version cycle is always started.

It is therefore safe to lock your project to a given release cycle, such as 0.3.*.

If you need to upgrade to a newer release cycle, check the release history for a list of changes introduced by each further 0.x.0 version.

Usage

Converting GSM 03.38 strings to UTF-8

The convertGsmToUtf8() method takes one parameter:

use BenMorel\GsmCharsetConverter\Converter;

$converter = new Converter();
$utf8 = $converter->convertGsmToUtf8('...');

The input string must be a valid GSM 03.38 string, or an InvalidArgumentException is thrown. The input string is expected to be unpacked: 7-bit chars in 8-bit bytes with a leading zero bit, just like ASCII chars.

Converting UTF-8 strings to GSM 03.38

The convertUtf8ToGsm() method accepts 3 parameters:

  • a valid UTF-8 input string;
  • whether or not to attempt to transliterate incompatible chars;
  • an optional string to replace unknown characters with.

If the input string is not valid UTF-8, an InvalidArgumentException is thrown.

The output is an unpacked GSM 03.38 string.

Without transliteration

$gsm = $converter->convertUtf8ToGsm('Helló', false, '?'); // Hell?

If the third parameter is not provided, and the string contains characters incompatible with GSM 03.38, an InvalidArgumentException is thrown.

With transliteration

$gsm = $converter->convertUtf8ToGsm('Helló', true, '?'); // Hello

If the third parameter is not provided, and the string contains characters incompatible with GSM 03.38 and not transliterable, an InvalidArgumentException is thrown.

Cleaning up UTF-8 strings to ensure that a message is sent in the GSM charset

Nowadays, most online SMS gateways accept UTF-8 as input; however, some of them do not provide a way to force a message to be sent in the GSM charset.

As a result, you may end up with extra charges caused by your SMS being sent in Unicode (UCS-2) format, causing the segmentation of messages in multiple parts, just because your SMS message contains an unforeseen accented character or emoji.

The library provides a method, cleanUpUtf8String(), that prevents these bad surprises, by returning a UTF-8 string that contains only characters that can be safely converted to the GSM charset.

This method accepts the same parameters as convertUtf8ToGsm():

$utf8 = $converter->cleanUpUtf8String('Helló', false, '?'); // Hell?
$utf8 = $converter->cleanUpUtf8String('Helló', true, '?'); // Hello

Packing 7-bit strings into 8-bit binary strings

To fit 160 7-bit characters into a 140 bytes SMS, the characters have to be packed into a binary, 8-bit string. The Packer class provides functionality to pack and unpack strings in this format:

use BenMorel\GsmCharsetConverter\Packer;

$packer = new Packer();
$packed = $packer->pack('ABC'); // the binary string 41E110
$string = $packer->unpack("\x41\xE1\x10"); // ABC

Note that pack() throws an InvalidArgumentException if the input string contains 8-bit chars (i.e. chars with the leading bit set).