sukohi/search-bot

Laravel package to crawl websites.

1.0.5 2017-02-15 09:30 UTC

This package is not auto-updated.

Last update: 2024-12-21 22:13:18 UTC


README

Laravel package to crawl websites.(Laravel 5+)

Requirements

Installation

Execute the next command.

composer require sukohi/search-bot:1.*

Set the service providers in app.php

'providers' => [
    ...Others...,
    Sukohi\SearchBot\SearchBotServiceProvider::class,
    Sukohi\LaravelAbsoluteUrl\LaravelAbsoluteUrlServiceProvider::class, 
]

Also alias

'aliases' => [
    ...Others...,
    'LaravelAbsoluteUrl' => Sukohi\LaravelAbsoluteUrl\Facades\LaravelAbsoluteUrl::class,
    'SearchBot' => Sukohi\SearchBot\Facades\SearchBot::class,
]

Then execute the next commands.

php artisan vendor:publish
php artisan migrate

Now you have config/search_bot.php which you can set domains restrictions.

Config

return [

    'main' => '*',
    'yahoo' => ['yahoo.com', 'www.yahoo.com'],
    'reddit' => ['www.reddit.com']

];
  • If you don't need to set restriction, set *.

Usage

$starting_url = 'http://yahoo.com';
$options = [
    'type' => 'main', // $type is optional.(Default: main),
    'url_deletion' => true  // Default: true
];
$result = \SearchBot::request($starting_url, $options);

if($result->exists()) {

    // Symfony\Component\BrowserKit\Response
    // See http://api.symfony.com/2.3/Symfony/Component/BrowserKit/Response.html
    $response = $result->response();

    // Symfony\Component\DomCrawler/Crawler
    // See http://api.symfony.com/2.3/Symfony/Component/DomCrawler/Crawler.html
    $crawler = $result->crawler();

    $result->links(function($url, $text){

        // All links including URL & text will come here.

    });

    $result->queues(function($crawler_queue, $url, $text){

        // All links that do not exist in DB will come here.
        // $crawler_queue has already type and url.
        $crawler_queue->save();

    });

} else {

    $e = $result->exception();
    echo $e->getMessage();
    $type = $result->type();
    $url = $result->url();

}

Options

  • type

    Type is string that you can decide freely.
    Default is main.

  • url_deletion

    If true here, URL accessed will be removed from DB.
    Default is true.

License

This package is licensed under the MIT License.
Copyright 2017 Sukohi Kuhoh