With the fundamentals of both static and dynamic content scraping under our belt, it's time to tackle a more comprehensive challenge: multi-page crawling. This section focuses on efficiently navigating and extracting data from websites with multiple interconnected pages.
There are two main approaches to crawling multi-page websites:
- Link-based crawling - Following links between pages
- Sitemap-based crawling - Using the sitemap.xml file
For sitemap crawling, most websites provide a sitemap.xml file that lists all important URLs. This structured XML file includes:
- Page URLs
- Last modified dates
- Change frequency
- Priority values
Using the sitemap can be more efficient than link crawling since it:
- Provides a complete list of pages upfront
- Includes metadata about page importance and freshness
- Avoids crawling unnecessary pages
- Reduces server load
But for this chapter, we'll focus on link-based crawling using Crawlee to build a crawler for a multi-page e-commerce site. Crawlee handles many of the complexities of web crawling for us, including:
- Automatic queue management and URL deduplication
- Built-in rate limiting and retry logic
- Configurable request handling and routing
- Data storage and export
The site structure we'll be crawling looks like this:
Homepage
├── Category: Electronics
│   ├── Phones
│   ├── Laptops
│   └── Accessories
├── Category: Clothing
│   ├── Men's
│   └── Women's
└── Featured ProductsEach product page has different layouts depending on the category, but we need to extract consistent information:
// Example data structure we want to build
interface ProductData {
  name: string;
  price: number;
  rating: { score: number, count: number };
  features: string[];
  status: string; // In Stock, Out of Stock, etc.
}
interface ResultData {
  categories: {
    electronics: {
      phones: ProductData[];
      laptops: ProductData[];
      accessories: ProductData[];
    };
    clothing: {
      mens: {
        shirts: ProductData[];
        pants: ProductData[];
      };
      womens: {
        dresses: ProductData[];
        tops: ProductData[];
      };
    };
  };
  featured_products: FeaturedProduct[];
}Key Crawling Concepts with Crawlee
- Request Queue Management
Crawlee handles the queue automatically, but here's how we configure it:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
    // Handles each request
    async requestHandler({ $, request, enqueueLinks }) {
        // Process the page
        const data = extractPageData($);
        // Automatically queue new URLs found on the page
        await enqueueLinks({
            selector: 'a',
            baseUrl: request.loadedUrl,
        });
    },
    // Limit concurrent requests
    maxConcurrency: 10,
});- URL Handling
Crawlee provides built-in URL handling and normalization:
await crawler.run([startUrl]);
// Or with more configuration:
await crawler.addRequests([{
    url: startUrl,
    userData: {
        label: 'start',
    },
}]);- Route Handling
Route different URLs to specific handlers:
const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const { label } = request.userData;
        switch (label) {
            case 'category':
                return handleCategory($);
            case 'product':
                return handleProduct($);
            default:
                return handleHomepage($);
        }
    },
});- Data Collection
Crawlee provides built-in storage for collected data:
const crawler = new CheerioCrawler({
    async requestHandler({ $, pushData }) {
        const productData = extractProduct($);
        await pushData(productData);
    },
});Web Crawling Best Practices
While Crawlee handles many low-level concerns, you should still consider:
- 
Configuration - Set appropriate rate limits
- Configure retry strategies
- Set meaningful user-agent strings
 
- 
Error Handling - Use Crawlee's built-in error handling
- Implement custom error callbacks
- Log meaningful diagnostic information
 
- 
Data Organization - Structure your data consistently
- Use request labels for routing
- Leverage Crawlee's dataset features
 
- 
Resource Management - Configure maxConcurrency appropriately
- Use maxRequestsPerCrawl when needed
- Monitor memory usage
 
The Challenge
Your task is to build a Crawlee-based crawler that:
- Starts at the homepage and discovers all product categories
- Visits each category and subcategory page
- Extracts product information from each listing
- Organizes data into a structured format
- Handles products that appear in multiple places (e.g., featured and category)
The site contains approximately 25-30 products across different categories, with varying layouts and information structures. Your crawler should produce a comprehensive dataset that maintains the hierarchical relationship between categories and products.
Testing Your Solution
Test for:
- Completeness: Did you find all products?
- Accuracy: Is the extracted data correct?
- Structure: Is the data organized properly?
- Efficiency: How many requests did you make?
The solved example in _solved/chapter6/ provides a reference implementation using Crawlee. Study it to understand how to leverage the library's features for efficient multi-page crawling and data organization.
Happy crawling!