bitandblack / document-crawler
Extract different parts of an HTML or XML document.
Installs: 14
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/bitandblack/document-crawler
Requires
- php: ^8.2
- bitandblack/composer-helper: ^2.0
- bitandblack/pathinfo: ^1.0
- fig/http-message-util: ^1.0
- php-http/discovery: ^1.0
- places2be/locales: ^3.3
- psr/http-client: ^1.0
- symfony/css-selector: 7.0 || ^8.0
- symfony/dom-crawler: ^7.0 || ^8.0
Requires (Dev)
- bitandblack/helpers: ^2.0
- nyholm/psr7: ^1.8
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^11.0
- react/http: ^1.0
- rector/rector: ^2.0
- symfony/http-client: ^7.0 || ^8.0
- symfony/var-dumper: ^7.0 || ^8.0
- symplify/easy-coding-standard: ^13.0
README
Bit&Black Document Crawler
Extract different parts of an HTML or XML document.
Installation
This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.
Usage
Using Crawlers to extract parts of a document
The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:
- IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with
<link rel="icon" ... />. - ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with
<img ... />. - LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with
<html lang="...">. - MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with
<meta ... />. - TitleCrawler: Crawl and extract the title of a document, that has been declared with
<title>...</title>.
All those crawlers work the same — they need a DomCrawler object, that contains the document:
<?php use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler; use Symfony\Component\DomCrawler\Crawler; $document = <<<HTML <!doctype html> <html lang="en"> <head> <title>Test</title> </head> <body> <h1>Hello world</h1> </body> </html> HTML; $crawler = new Crawler($document); $titleCrawler = new TitleCrawler($crawler); $titleCrawler->crawlContent(); // This will output `Test`. echo $titleCrawler->getTitle();
You can create a custom Crawler by implementing the CrawlerInterface.
Handling resources
In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:
-
The FileSystemDownloadHandler: This one loads resources and writes them to the file system. There are different Downloaders available to fetch resources:
- The HttpDiscoveryDownloader is the default one and makes use of whatever library your project uses to download resources.
- The ReactDownloader needs the
react/httplibrary and fetches resources asynchronously. - You can — for sure — create a custom Downloader by implementing the FileSystemDownloaderInterface.
-
The PassiveResourceHandler: This handler does nothing and is the default one.
You can create a custom Resource Handler by implementing the ResourceHandlerInterface.
Crawling everything at once
In case you don't want to set up something, there is the HolisticDocumentCrawler, that does all the work for you:
<?php use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler; $document = <<<HTML <!doctype html> <html lang="en"> <head> <title>Test</title> </head> <body> <h1>Hello world</h1> </body> </html> HTML; $holisticDocumentCrawler = new HolisticDocumentCrawler($document); // Get all icons: $icons = $holisticDocumentCrawler->getIcons(); // Get all images: $images = $holisticDocumentCrawler->getImages(); // Get the language code: $languageCode = $holisticDocumentCrawler->getLanguageCode(); // Get all meta tags: $metaTags = $holisticDocumentCrawler->getMetaTags(); // Get the title: $title = $holisticDocumentCrawler->getTitle();
The HolisticDocumentCrawler can also be initialised using the createFromUrl method:
<?php use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler; $holisticDocumentCrawler = HolisticDocumentCrawler::createFromUrl('https://www.bitandblack.com');
Help
If you have any questions, feel free to contact us under hello@bitandblack.com.
Further information about Bit&Black can be found under www.bitandblack.com.