tacman / php-readability
Automatic article extraction from HTML, fork of j0k3r/php-readability
Fund package maintenance!
j0k3r
Requires
- php: ^8.2
- ext-mbstring: *
- masterminds/html5: ^2.7
- psr/log: ^3.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- monolog/monolog: ^2.1|^3.0
- phpstan/phpstan: ^1.0||^2.0
- phpstan/phpstan-phpunit: ^1.0||^2.0.1
- phpunit/phpunit: ^9
- rector/rector: ^2.0
- symfony/phpunit-bridge: ^6.4|^7.0
Suggests
- ext-tidy: Used to clean up given HTML and to avoid problems with bad HTML structure.
README
composer config repositories.readability '{"type": "path", "url": "~/g/tacman/php-readability"}' composer req tacman/php-readability:"*@dev"
composer config repositories.graby '{"type": "path", "url": "~/g/tacman/graby"}' composer req tacman/graby:"*@dev"
Readability
This is an extract of the Readability class from this full-text-rss fork. It can be defined as a better version of the original php-readability.
Differences
The default php-readability lib is really old and needs to be improved. I found a great fork of full-text-rss from @Dither which improve the Readability class.
- I've extracted the class from its fork to be able to use it out of the box
- I've added some simple tests
- and changed the CS, run
php-cs-fixer
and added a namespace
But the code is still really hard to understand / read ...
Requirements
By default, this lib will use the Tidy extension if it's available. Tidy is only used to cleanup the given HTML and avoid problems with bad HTML structure, etc .. It'll be suggested by Composer.
Also, if you got problem from parsing a content without Tidy installed, please install it and try again.
Usage
use Readability\Readability; $url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html'; // you can use whatever you want to retrieve the html content (Guzzle, Buzz, cURL ...) $html = file_get_contents($url); $readability = new Readability($html, $url); // or without Tidy // $readability = new Readability($html, $url, 'libxml', false); $result = $readability->init(); if ($result) { // display the title of the page echo $readability->getTitle()->textContent; // display the *readability* content echo $readability->getContent()->textContent; } else { echo 'Looks like we couldn\'t find the content. :('; }
If you want to debug it, or check what's going on, you can inject a logger (which must follow Psr\Log\LoggerInterface
, Monolog for example):
use Readability\Readability; use Monolog\Logger; use Monolog\Handler\StreamHandler; $url = 'http://www.medialens.org/index.php/alerts/alert-archive/alerts-2013/729-thatcher.html'; $html = file_get_contents($url); $logger = new Logger('readability'); $logger->pushHandler(new StreamHandler('path/to/your.log', Logger::DEBUG)); $readability = new Readability($html, $url); $readability->setLogger($logger);