xatham / text-extraction
Easy text extraction for many different file types
Installs: 17
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
pkg:composer/xatham/text-extraction
Requires
- php: >=7.4
 - ext-fileinfo: *
 - ext-imagick: *
 - league/flysystem: ^2.0
 - phpoffice/phpspreadsheet: ^1.15
 - phpoffice/phpword: ^0.17.0 | ^0.18.2
 - shuchkin/simplexlsx: ^0.8.19
 - smalot/pdfparser: ^0.17.1
 - symfony/finder: ^5.2
 - thiagoalessio/tesseract_ocr: ^2.9
 
Requires (Dev)
- friendsofphp/php-cs-fixer: ^2.17
 - phpmd/phpmd: ^2.9
 - phpspec/prophecy-phpunit: ^2.0
 - phpstan/phpstan: ^0.12.62
 - phpunit/phpunit: ^9.5
 
This package is auto-updated.
Last update: 2025-10-26 04:15:46 UTC
README
text-extraction
About
This PHP-Library let's you extract plain text from various document types.
Currently supported file mime-types for extraction are:
text/plain
text/csv
application/vnd.ms-excel
application/vnd.oasis.opendocument.text
application/pdf
application/msword'
Install
composer require xatham/text-extraction
Usage
/** * Extracting only pdf files, without ocr capturing */ $textExtractor = (new TextExtractionBuilder())->buildTextExtractor( [ 'withOcr' => false, 'validMimeTypes' => ['application/pdf'], ], ); $target = dirname(__DIR__) . '/examples/sample.pdf'; $plainTextDocument = $textExtractor->extractByFilePath($target); if ($plainTextDocument === null) { exit('Could not extract any data'); } $texts = $plainTextDocument->getTextItems(); foreach ($texts as $text) { var_dump($text); }
License
text-extraction is licensed under MIT.