bitandblack/document-crawler

Extract different parts of an HTML or XML document.

Installs: 14

Dependents: 0

Suggesters: 0

Security: 0

Stars: 1

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/bitandblack/document-crawler

0.1.0 2025-12-01 08:43 UTC

This package is auto-updated.

Last update: 2025-12-01 08:44:12 UTC


README

PHP from Packagist Latest Stable Version Total Downloads License

Bit&Black Logo

Bit&Black Document Crawler

Extract different parts of an HTML or XML document.

Installation

This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.

Usage

Using Crawlers to extract parts of a document

The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:

  • IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with <link rel="icon" ... />.
  • ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with <img ... />.
  • LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with <html lang="...">.
  • MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with <meta ... />.
  • TitleCrawler: Crawl and extract the title of a document, that has been declared with <title>...</title>.

All those crawlers work the same — they need a DomCrawler object, that contains the document:

<?php

use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$crawler = new Crawler($document);

$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();

// This will output `Test`.
echo $titleCrawler->getTitle();

You can create a custom Crawler by implementing the CrawlerInterface.

Handling resources

In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:

You can create a custom Resource Handler by implementing the ResourceHandlerInterface.

Crawling everything at once

In case you don't want to set up something, there is the HolisticDocumentCrawler, that does all the work for you:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$holisticDocumentCrawler = new HolisticDocumentCrawler($document);

// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();

// Get all images:
$images = $holisticDocumentCrawler->getImages();

// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();

// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();

// Get the title:
$title = $holisticDocumentCrawler->getTitle();

The HolisticDocumentCrawler can also be initialised using the createFromUrl method:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$holisticDocumentCrawler = HolisticDocumentCrawler::createFromUrl('https://www.bitandblack.com');

Help

If you have any questions, feel free to contact us under hello@bitandblack.com.

Further information about Bit&Black can be found under www.bitandblack.com.