1tomany / pdf-ai
A simple PHP library that makes extracting data from PDFs for large language models easy
Requires
- php: >=8.2
- psr/container: ^2.0
- symfony/process: ^7.2
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.87.2
- phpstan/phpstan: ^2.1.23
- phpunit/phpunit: ^12.3.10
README
PDFAI is a simple PHP library that makes extracting data from PDFs for large language models easy. It uses a single dependency, the Symfony Process Component, to interface with the Poppler command line tools from the xpdf library.
Install PDFAI
Install the library using Composer:
composer require 1tomany/pdf-ai
Install Poppler
Before beginning, ensure the pdfinfo
, pdftoppm
, and pdftotext
binaries are installed and located in the $PATH
environment variables.
macOS
brew install poppler
Debian and Ubuntu
apt-get install poppler-utils
Usage
This library has three main features:
- Read PDF metadata such as the number of pages
- Rasterize one or more pages to JPEG or PNG images
- Extract text from one or more pages
Extracted data is stored in memory and can be written to the filesystem or converted to a data:
URI. Because extracted data is stored in memory, this library returns a \Generator
object for each page that is extracted or rasterized.
Using the library is easy, and you have two ways to interact with it:
- Direct Instantiate the
OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient
class and call the methods directly. This method is easier to use, but makes testing harder because you can't easily swap thePopplerExtractorClient
with theMockExtractorClient
in your unit tests. - Indirect Create a container of
OneToMany\PDFAI\Contract\Client\ExtractorClientInterface
objects, and use theOneToMany\PDFAI\Factory\ExtractorClientFactory
class to instantiate them. If you wish to use this method, I recommend you use the Symfony bundle1tomany/pdf-ai-bundle
to take advantage of Symfony's container and autowiring features. While this method requires more upfront work, it makes testing much easier because you can easily swap thePopplerExtractorClient
with theMockExtractorClient
in your tests.
Direct Usage
<?php require_once __DIR__ . '/vendor/autoload.php'; use OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient; use OneToMany\PDFAI\Contract\Enum\OutputType; use OneToMany\PDFAI\Request\ExtractDataRequest; use OneToMany\PDFAI\Request\ExtractTextRequest; use OneToMany\PDFAI\Request\ReadMetadataRequest; $filePath = '/path/to/file.pdf'; $client = new PopplerExtractorClient(); $metadata = $client->readMetadata(new ReadMetadataRequest($filePath)); printf("The PDF '%s' has %d page(s).\n", $filePath, $metadata->getPages()); // Rasterize all pages as 150 DPI JPEGs $request = new ExtractDataRequest($filePath, 1, null, OutputType::Jpg, 150); foreach ($client->extractData($request) as $image) { // $image->getData() or $image->toDataUri() printf("MD5: %s\n", md5($image->getData())); } // Extract text from pages 3 and 4 $request = new ExtractTextRequest($filePath, 3, 4); foreach ($client->extractData($request) as $text) { // $text->getData() printf("Length: %d\n", strlen($text->getData())); }
Run Test Suite
Run the test suite with PHPUnit:
./vendor/bin/phpunit
Run Static Analysis
Run static analysis with PHPStan:
./vendor/bin/phpstan
Credits
License
The MIT License