kaishiyoku / hera-rss-crawler
Modern library to handle RSS/Atom feeds
Fund package maintenance!
kaishiyoku
Installs: 2 572
Dependents: 0
Suggesters: 0
Security: 0
Stars: 2
Watchers: 3
Forks: 1
Open Issues: 1
Requires
- php: ^8.1
- ext-dom: *
- ext-json: *
- ext-libxml: *
- ext-simplexml: *
- guzzlehttp/guzzle: ^7.5.0
- illuminate/support: ^9.0|^10.0|^11.0
- laminas/laminas-feed: ^2.20.0
- laminas/laminas-xml: ^1.5.0
- monolog/monolog: ^2.9.1|^3.3.1
- nesbot/carbon: ^2.66.0
- symfony/css-selector: ^5.4.21|^6.2.7
- symfony/dom-crawler: ^5.4.21|^6.2.7
Requires (Dev)
- laravel/pint: ^1.13
- mockery/mockery: ^1.5.1
- phpdocumentor/reflection-docblock: ^5.3.0
- phpstan/phpstan: ^1.10.6
- phpunit/phpunit: ^9.6.4|^10.0.15
- spatie/phpunit-snapshot-assertions: ^4.2.16
- symfony/var-dumper: ^5.4.21|^6.2.7
- symfony/yaml: ^5.4.21|^6.2.7
- dev-master
- 6.2.1
- 6.2.0
- 6.1.0
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.1.10
- 5.1.9
- 5.1.8
- 5.1.7
- 5.1.6
- 5.1.5
- 5.1.4
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.0
- 4.0.0
- 3.1.1
- 3.1.0
- 3.0.2
- 3.0.1
- 3.0.0
- 2.2.0
- 2.1.0
- 2.0.0
- 1.1.4
- 1.1.3
- 1.1.2
- 1.1.1
- 1.1.0
- 1.0.1
- 1.0.0
- 0.11.1
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.2
- 0.5.1
- 0.5.0
- 0.4.1
- 0.4.0
- 0.3.6
- 0.3.5
- 0.3.4
- 0.3.3
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.0
- 0.1.0
This package is auto-updated.
Last update: 2024-12-13 10:38:29 UTC
README
This project tries to make fetching and parsing RSS feeds easier. With Hera RSS you can discover, fetch and parse RSS feeds.
Installation
- simply run
composer require kaishiyoku/hera-rss-crawler
- create a new crawler instance using
$heraRssCrawler = new HeraRssCrawler()
- discover a feed, for example
$feedUrls = $heraRssCrawler->discoverFeedUrls('https://laravel-news.com/')
- pick the feed you like to use; if there were multiple feeds discovered pick one
- fetch the feed:
$feed = $heraRssCrawler->parseFeed($feedUrls->get(0))
- fetch the articles:
$feedItems = $feed->getFeedItems()
Breaking Changes
Version 6.x
- dropped support for PHP 8.0
Version 5.x
- dropped support for PHP 7.4
Version 4.x
- dropped support for Laravel 8
Version 3.x
- FeedItem-method
jsonSerialize
has been renamed totoJson
and doesn't returnnull
anymore but throws aJsonException
if the serialized JSON is invalid.
Available crawler options
setRetryCount(int $retryCount): void
Determines how many retries parsing or discovering feeds will be made when an exception occurs, e.g. if the feed was unreachable.
setLogger(LoggerInterface $logger): void
Set your own logger instance, e.g. a simple file logger.
setUrlReplacementMap(array $urlReplacementMap): void
Useful for websites which redirect to another subdomain when visiting the site, e.g. for Reddit.
setFeedDiscoverers(Collection $feedDiscoverers): void
With that you can set your own feed discoverers.
You can even write your own, just make sure to implement the FeedDiscoverer
interface:
<?php namespace Kaishiyoku\HeraRssCrawler\FeedDiscoverers; use GuzzleHttp\Client; use Illuminate\Support\Arr; use Illuminate\Support\Collection; use Illuminate\Support\Str; use Kaishiyoku\HeraRssCrawler\Models\ResponseContainer; /** * Discover feed URL by parsing a direct RSS feed url. */ class FeedDiscovererByContentType implements FeedDiscoverer { public function discover(Client $httpClient, ResponseContainer $responseContainer): Collection { $contentTypeMixedValue = Arr::get($responseContainer->getResponse()->getHeaders(), 'Content-Type'); $contentType = is_array($contentTypeMixedValue) ? Arr::first($contentTypeMixedValue) : $contentTypeMixedValue; // the given url is no valid RSS feed if (!$contentType || !Str::startsWith($contentType, ['application/rss+xml', 'application/atom+xml'])) { return new Collection(); } return new Collection([$responseContainer->getRequestUrl()]); } }
The default feed discoverers are as follows:
new Collection([ new FeedDiscovererByContentType(), new FeedDiscovererByHtmlHeadElements(), new FeedDiscovererByHtmlAnchorElements(), new FeedDiscovererByFeedly(), ])
The ordering is important here because the discoverers will be called sequentially until at least one feed URL has been found and then stops.
That means that once the discoverer found a feed remaining discoverers won't be called.
If you want to mainly discover feeds by using HTML anchor elements,
the FeedDiscovererByHtmlAnchorElements
discoverer should be the first discoverer
in the collection.
Available crawler methods
parseFeed(string $url): ?Feed
Simply fetch and parse the feed of a given feed url. If no consumable RSS feed is being found null
is being returned.
discoverAndParseFeeds(string $url): Collection
Discover feeds from a website url and return all parsed feeds in a collection.
discoverFeedUrls(string $url): Collection
Discover feeds from a website url and return all found feed urls in a collection. There are multiple ways the crawler tries to discover feeds. The order is as follows:
- discover feed urls by content type
if the given url is already a valid feed return this url - discover feed urls by HTML head elements
find all feed urls inside a HTML document - discover feed urls by HTML anchor elements
get all anchor elements of a HTML element and return the urls of those which includerss
in its urls - discover feed urls by Feedly
fetch feed urls using the Feedly API
discoverFavicon(string $url): ?string
Fetch the favicon of the feed's website. If none is found then null
is being returned.
checkIfConsumableFeed(string $url): bool
Check if a given url is a consumable RSS feed.
Contribution
Found any issues or have an idea to improve the crawler? Feel free to open an issue or submit a pull request.
Plans for the future
- add a Laravel facade
Author
Email: dev@andreas-wiedel.de
Website: https://andreas-wiedel.de