The API Guys
Spatie Crawler v9: A Cleaner API for PHP Web Crawling
·3 min read·The API Guys

Spatie Crawler v9: A Cleaner API for PHP Web Crawling

Quick BytesPHPLaravel

Spatie's spatie/crawler package - one of the most widely used PHP web crawling libraries, and a core dependency of the Oh Dear uptime monitoring service - has shipped v9 with a significant API overhaul. The package has been around for years and had accumulated rough edges over time. v9 cleans those up and adds a number of features that have been on the wishlist for a while. It is a breaking change release, so upgrading requires attention, but the new API is meaningfully cleaner.

A fluent closure interface

The most visible change in v9 is the replacement of the observer class pattern with simple closure callbacks. Where v8 required implementing a CrawlObserver class, you can now attach callbacks directly on the crawler:

Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response) {
        echo "{$url}: {$response->status()} ";
    })
    ->start();

Four callbacks are available: onCrawled(), onFailed(), onFinished(), and onWillCrawl(). Each receives the new CrawlResponse object and a CrawlProgress instance, so you always know where you are in the crawl.

CrawlResponse replaces ResponseInterface

The raw PSR-7 ResponseInterface that was previously passed to observers has been replaced by a purpose-built CrawlResponse object. It exposes a clean set of methods - status(), body(), dom(), header(), isSuccessful(), isRedirect(), foundOnUrl(), linkText(), and depth() - covering what you actually need when processing a crawled page. The body is cached internally, so calling body() multiple times does not re-read the stream. The raw PSR-7 response is still accessible via toPsrResponse() if needed.

Progress tracking and finish reasons

Every callback now receives a CrawlProgress object with four properties: urlsProcessed, urlsFailed, urlsFound, and urlsPending. This makes it straightforward to display progress or make decisions mid-crawl without maintaining your own counters.

The start() method now returns a FinishReason enum rather than void, with values for Completed, CrawlLimitReached, TimeLimitReached, and Interrupted. You now know definitively why the crawler stopped.

Built-in throttling

v9 ships two throttling strategies out of the box. FixedDelayThrottle applies a fixed pause between requests. AdaptiveThrottle adjusts the delay dynamically based on how quickly the target server is responding - slowing down when the server is under load and speeding up when it is not. Both are passed to a throttle() method on the crawler.

Testing with fake()

Following the same pattern as Laravel's HTTP client, the crawler now supports a fake() method that lets you define expected responses for specific URLs without making real HTTP requests. This makes unit testing crawl logic considerably more straightforward.

Other additions worth noting

  • TransferStatistics - each CrawlResponse carries a transferStats() object with typed timing accessors: DNS lookup time, TLS handshake time, time to first byte, download speed, and total transfer time.
  • stream() - opt-in streaming HTTP responses for reduced memory usage on large pages.
  • retry() - automatic retries on connection errors and 5xx responses.
  • matchWww() - treats www.example.com and example.com as the same origin when using internalOnly().
  • Graceful shutdown - the crawler now handles SIGINT and SIGTERM cleanly.
  • CloudflareRenderer - a new driver for JavaScript rendering, with Browsershot moved to a suggested dependency rather than a required one.

Breaking changes to be aware of

v9 is a major version with breaking changes. The key ones: URLs are now plain strings throughout (no more UriInterface), CrawlProfile is now an interface rather than an abstract class, CrawlObserverCollection no longer implements ArrayAccess or Iterator, and the URL argument is now required in Crawler::create(). The full list of breaking changes is in the package's UPGRADING.md.

If you are already using Spatie packages in your Laravel project - as many teams are, given how widely adopted the Spatie ecosystem is - this is a well-considered upgrade that makes the crawler substantially more pleasant to work with. The closure-based API in particular is a significant quality-of-life improvement over maintaining observer classes for straightforward crawl tasks.

Ready to Start Your Project?

Get in touch with our Leeds-based team to discuss your Laravel or API development needs.