README

Bit&Black Document Crawler

Extract different parts of an HTML or XML document.

Installation

This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.

Usage

Using Crawlers to extract parts of a document

The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:

AnchorsCrawler: Crawl and extract all defined anchors in a document, that have been declared with <a href="...">...</a>.
IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with <link rel="icon" ... />.
ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with <img ... />.
LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with <html lang="...">.
MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with <meta ... />.
TitleCrawler: Crawl and extract the title of a document, that has been declared with <title>...</title>.

All those crawlers work the same — they need a DomCrawler object, that contains the document:

<?php

use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$crawler = new Crawler($document);

$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();

// This will output `Test`.
echo $titleCrawler->getTitle();

You can create a custom Crawler by implementing the CrawlerInterface.

Handling resources

In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:

The FileSystemDownloadHandler: This one loads resources and writes them to the file system. There are different Http Clients available to fetch resources:
- The HttpDiscoveryClient is the default one and makes use of whatever library your project uses to download resources.
- The ReactClient needs the react/http library and fetches resources asynchronously.
- You can — for sure — create a custom Http Client by implementing the HttpClientInterface.
The PassiveResourceHandler: This handler does nothing and is the default one.

You can create a custom Resource Handler by implementing the ResourceHandlerInterface.

Crawling everything at once

In case you don't want to set up something, there is the HolisticDocumentCrawler, that does all the work for you:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$holisticDocumentCrawler = new HolisticDocumentCrawler($document);

// Get all anchors:
$anchors = $holisticDocumentCrawler->getAnchors();

// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();

// Get all images:
$images = $holisticDocumentCrawler->getImages();

// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();

// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();

// Get the title:
$title = $holisticDocumentCrawler->getTitle();

The HolisticDocumentCrawler can also be initialised using the createFromUrl method:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$holisticDocumentCrawler = HolisticDocumentCrawler::createFromUrl('https://www.bitandblack.com');

Help

If you have any questions, feel free to contact us under hello@bitandblack.com.

Further information about Bit&Black can be found under www.bitandblack.com.

bitandblack / document-crawler

Maintainers

Details