README

PDFAI is a simple PHP library that makes extracting data from PDFs for large language models easy. It uses a single dependency, the Symfony Process Component, to interface with the Poppler command line tools from the xpdf library.

Install PDFAI

Install the library using Composer:

composer require 1tomany/pdf-ai

Install Poppler

Before beginning, ensure the pdfinfo, pdftoppm, and pdftotext binaries are installed and located in the $PATH environment variables.

macOS

brew install poppler

Debian and Ubuntu

apt-get install poppler-utils

Usage

This library has three main features:

Read PDF metadata such as the number of pages
Rasterize one or more pages to JPEG or PNG images
Extract text from one or more pages

Extracted data is stored in memory and can be written to the filesystem or converted to a data: URI. Because extracted data is stored in memory, this library returns a \Generator object for each page that is extracted or rasterized.

Using the library is easy, and you have two ways to interact with it:

Direct Instantiate the OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient class and call the methods directly. This method is easier to use, but makes testing harder because you can't easily swap the PopplerExtractorClient with the MockExtractorClient in your unit tests.
Indirect Create a container of OneToMany\PDFAI\Contract\Client\ExtractorClientInterface objects, and use the OneToMany\PDFAI\Factory\ExtractorClientFactory class to instantiate them. If you wish to use this method, I recommend you use the Symfony bundle 1tomany/pdf-ai-bundle to take advantage of Symfony's container and autowiring features. While this method requires more upfront work, it makes testing much easier because you can easily swap the PopplerExtractorClient with the MockExtractorClient in your tests.

Direct Usage

<?php

require_once __DIR__ . '/vendor/autoload.php';

use OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient;
use OneToMany\PDFAI\Contract\Enum\OutputType;
use OneToMany\PDFAI\Request\ExtractDataRequest;
use OneToMany\PDFAI\Request\ExtractTextRequest;
use OneToMany\PDFAI\Request\ReadMetadataRequest;

$filePath = '/path/to/file.pdf';

$client = new PopplerExtractorClient();

$metadata = $client->readMetadata(new ReadMetadataRequest($filePath));
printf("The PDF '%s' has %d page(s).\n", $filePath, $metadata->getPages());

// Rasterize all pages as 150 DPI JPEGs
$request = new ExtractDataRequest($filePath, 1, null, OutputType::Jpg, 150);

foreach ($client->extractData($request) as $image) {
    // $image->getData() or $image->toDataUri()
    printf("MD5: %s\n", md5($image->getData()));
}

// Extract text from pages 3 and 4
$request = new ExtractTextRequest($filePath, 3, 4);

foreach ($client->extractData($request) as $text) {
    // $text->getData()
    printf("Length: %d\n", strlen($text->getData()));
}

Run Test Suite

Run the test suite with PHPUnit:

./vendor/bin/phpunit

Run Static Analysis

Run static analysis with PHPStan:

./vendor/bin/phpstan

Credits

Vic Cherubini, 1:N Labs, LLC

License

The MIT License

1tomany / pdf-ai

Maintainers

Details