You’ve built the perfect web scraper. It extracts data beautifully—titles, prices, descriptions—everything you need. But after the first page, your scraper stops. No errors, no warnings, just… nothing. You check the site, and there it is: pagination. That sneaky mechanism websites use to split content across multiple pages just broke your scraper. Do web scraping pagination challenges sound familiar?
Pagination is the boss battle of web scraping. If you can’t handle it, your data extraction stops at level one. Whether you’re scraping an e-commerce site for product prices, gathering business leads, or tracking stock market data, sooner or later, you’ll run into pagination. And if you don’t get it right, you’re missing out on 90% of the data.
But here’s the good news: pagination can be cracked. Whether it’s a simple ?page=2 in the URL, an API with limit and offset, or the dreaded infinite scroll, there’s always a way. And that’s exactly what we’re going to solve in this guide.
Interested in buying proxies for web scraping? |
Check out our proxies! |
What is Web Scraping Pagination?
Table of Contents
TogglePagination is how websites split large amounts of data across multiple pages. Instead of loading thousands of items at once—which would slow down everything—websites break them into smaller chunks, usually 10, 20, or 50 items per page. Think of it like flipping through pages of a book instead of reading an endless scroll of text.
For example:
- Amazon’s search results use numbered pagination like ?page=2.
- Pinterest keeps scrolling forever. New content loads dynamically as you scroll.
- Twitter mixes both worlds. It has an API with paginated results, but the website loads tweets dynamically.
For web scrapers, pagination means you can’t just grab everything in one request. You have to figure out how the site loads the next set of data and adapt.
Also read: Inspect Element Hacks: Techniques for Analyzing Websites
Why is Pagination Important in Web Scraping?
If you’re only scraping the first page of a website, you’re barely scratching the surface. Most of the valuable data lives on the next pages. Imagine:
- Scraping product prices from an e-commerce store but only collecting the first 20 items.
- Monitoring real estate listings but missing everything after page one.
- Analyzing news articles but only grabbing today’s headlines while missing the rest.
Without handling pagination, your dataset is incomplete. And incomplete data is useless data.
Also read: The Right Way of Collecting Data for Machine Learning
Common Challenges in Scraping Paginated Websites
Pagination sounds simple. Just go to the next page, right? If only it were that easy. Websites do not make it easy for scrapers. They throw curveballs like:
- Hidden or dynamic pagination. Some sites don’t show direct links to page 2, 3, 4… Instead, they use AJAX to load data dynamically. You won’t find ?page=2 in the URL. You’ll need to dig into network requests.
- Infinite scroll (Lazy Loading). Platforms like Pinterest never show a “Next” button. Instead, they load more content when you scroll. If you don’t simulate user actions, your scraper will only see the first set of data.
- API-based pagination with hidden parameters. Many websites offer paginated APIs, but they require authentication, tokens, or a special cursor value that changes with each request. Scrapers need to track these responses and extract the next page’s key dynamically.
- Rate limiting and anti-bot measures. Some sites will block your IP if you scrape too aggressively. Others use CAPTCHAs, session tokens, or request headers to detect scrapers.
Let’s take Amazon as an example. You’d think their pagination is as simple as ?page=2, right? Nope. They hide pagination behind complex JavaScript and use anti-bot mechanisms to block automated requests. If you don’t handle it properly, you’ll get blocked within minutes.
Also read: Five Reasons to Never Use Free Proxies for Web Scraping
Understanding Different Types of Pagination
Pagination is like a bouncer at a nightclub. It controls access to data and decides how much you can see at a time. As a web scraper, you need to figure out how to convince the bouncer to let you in page by page. The problem? Websites don’t all use the same system.
Sometimes it’s simple: just change ?page=2 in the URL. Other times, it’s a JavaScript-powered nightmare that hides data behind AJAX requests. And then there’s infinite scrolling, where content loads as you scroll because apparently, clicking “Next” was too much work for users.
Every website is different, and using the wrong technique can lead to failed scrapes, slow performance, or getting blocked. Here’s a cheat sheet to help you pick the best approach:
Pagination Type | Best Scraping Approach | Tools to Use |
Traditional (?page=2) | Requests-based scraping | requests, BeautifulSoup |
JavaScript/AJAX (XHR) | Simulate requests or use a headless browser | requests, Selenium, Playwright |
Infinite Scroll (Lazy Load) | Scroll automation | Selenium, Playwright |
API-based (limit=50&offset=100) | Direct API calls | requests, Postman |
Encrypted/API Calls | Reverse-engineering headers | requests, DevTools |
So, let’s break it down. We’ll go through the four major types of pagination, how to spot them, and—most importantly—how to scrape them.
1. Traditional Pagination (Query Parameters in URL)
This is the simplest and most common form of pagination. The website just adds parameters to the URL, like:
- ?page=2
- ?offset=20
- &start=50&limit=10
When you navigate to another page, the URL changes, and each request fetches a new batch of data. You can scrape this with simple HTTP requests, making it one of the easiest to handle.
Where do you see it?
- Blogs (articles are paginated with ?page=2).
- E-commerce sites (product listings with ?page=3).
- Search results (pagination in Google, Amazon, eBay).
How to identify it?
- Look at the URL when clicking “Next Page”
- If the URL changes from example.com/products → example.com/products?page=2, congratulations! You’ve got an easy scraper ahead.
- Check DevTools → Network tab
- Open DevTools (F12 or right-click → Inspect).
- Go to the Network tab, then click “Next Page”.
- If you see a new request with a URL containing ?page=, that’s your pagination method.
How to scrape it?
Example of scraping the first 10 pages of a website with traditional pagination:
import requests
from bs4 import BeautifulSoup
def scrape_pages(base_url):
for page in range(1, 10):
response = requests.get(f”{base_url}?page={page}”)
soup = BeautifulSoup(response.content, ‘html.parser’)
yield soup
#calls the scraping function (generator)
for soup in scrape_pages(“http://example.com”):
#process page content
pass
✅ Easy to scrape
⚠️ Some sites may obfuscate URLs or add hidden tokens
2. JavaScript-Based Pagination (AJAX Requests)
Some websites don’t reload the page when you click “Next”. Instead, they fetch new data in the background using AJAX. This is a pain for scrapers because the HTML never actually updates with new data unless JavaScript runs.
Where do you see it?
- E-commerce sites that load products dynamically.
- News websites that update articles without a full refresh.
- Dashboard-like web apps (Google Analytics, social media insights).
How to identify it?
- Click “Next” and check the URL
- If the URL stays the same, the site is loading data via AJAX.
- Check DevTools → Network Tab → XHR (Fetch Requests)
- Open DevTools (F12), go to Network → XHR.
- Click “Next Page” and watch for new requests being made.
- If a request like example.com/api/get_products?page=2 appears, that’s your AJAX call!
How to scrape it?
Use Selenium to simulate user interactions:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://example.com/ajax-pagination")
while True:
try:
next_button = driver.find_element(By.LINK_TEXT, "Next")
next_button.click()
time.sleep(2) # Wait for content to load
except:
break # No more pages
driver.quit()
✅ Handles JavaScript-based pagination
⚠️ Slower and more resource-intensive
3. Infinite Scroll Pagination (Lazy Loading)
Instead of showing pages, new content appears as you scroll down. Websites do this using event listeners that detect scrolling and trigger AJAX requests to fetch more content.
Where do you see it?
- Social media feeds (Twitter, Instagram, Facebook).
- News websites that continuously load articles.
- E-commerce sites using “Load More” instead of numbered pages.
How to identify it?
- Scroll down and watch the content load.
- Check DevTools → Network → XHR
- Scroll down and see if new requests are made automatically.
- Look for JavaScript event listeners
- In DevTools, go to Elements → Event Listeners → scroll.
How to scrape it?
Exemple of scraping for the first 10 pages of a website with dynamic pagination:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_infinite_scroll(url):
options = webdriver.ChromeOptions()
options.add_argument(‘headless’) #optional
driver = webdriver.Chrome(options-options)
driver.get(url)
while True:
items = driver.find_elements(By.CSS_SELECTOR, ‘.item-selector’)
for items in items:
#Process page contents
pass
#Check if next page exists
try:
next_button = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, ‘.next-button’))
)
except Exception as e:
print(f”Next page doesn’t exist: {e}”)
break
driver.quit()
#Call the scraping function
scrape_infinite_scroll(“https://example.com”)
✅ Works for sites without page numbers
⚠️ Must detect when new content stops loading
4. API-Based Pagination
Some websites offer APIs for structured data access, using pagination parameters. This is the most efficient way to scrape large datasets.
Types of API Pagination:
- Limit-Offset Pagination
- Example: api.com/data?limit=50&offset=100
- You control how many items you get per request.
- Cursor-Based Pagination
- Example: api.com/data?cursor=xyz123
- Instead of page numbers, the API returns a cursor for the next batch.
- Next Page URL Pagination
Example response:
{
"data": [...],
"next": "api.com/data?page=3"
}
How to identify it?
- Use Postman or DevTools → Network → XHR
- Find requests made to an API.
- Look at the JSON response
- If it contains “next”: “api.com/data?page=3”, you have API pagination.
How to scrape it?
Example for scraping the first 10 pages of a website with API-based pagination:
import requests
def scrape_api_pagination(api_url, limit=10, offset=0):
params = {‘limit’: limit, ‘offset’: offset}
response = requests.get(api_url, params=params)
data = response.json()
print(data)
#Check if there are any other available data
if Len(data) == limit:
scrape_api_pagination(api_url, limit, offset + limit)
#Call scraping function
scrape_api_pagination(“https://api.example.com/data”, 10, 0)
✅ Fast, clean, and structured
⚠️ Some APIs require authentication or tokens
Also read: Well Paid Web Scraping Projects
Handling Hybrid Pagination Systems
Most websites stick to one pagination method—either traditional, AJAX-based, infinite scroll, or API-driven. But every once in a while, you’ll come across a hybrid pagination system that combines multiple methods, making scraping a real challenge.
These cases aren’t common, but when they appear, you need a modular approach to break them down.
The most efficient way to handle a combination of pagination methods is through a modular approach, where each type of pagination is treated separately with specialized functions.
Source: Alin Andrei, Software Developer
Imagine a blog homepage that uses traditional pagination (?page=2) for navigating blog categories and has an infinite scroll carousel under each category to load additional posts.
If you treat this as one single pagination system, you’ll end up frustrated. Instead, break it into two separate tasks:
Scrape the category pages first
- Identify the main pagination system (?page=2).
- Extract all category links.
- Navigate through the numbered pages using requests + BeautifulSoup.
Handle the carousels separately
- Use Selenium to simulate right-arrow clicks on the infinite scroll carousel.
- Implement a wait mechanism to detect when new posts load.
- Stop when the carousel reaches the last post.
Also read: Web Scraping With Proxies
Avoiding Anti-Scraping Measures and IP Blocking
Let’s be honest—websites don’t like scrapers. They’ll go to great lengths to keep you out.
You’ve probably been there. Your scraper works beautifully for the first few pages… and then boom! You hit a 429 Too Many Requests error. Or worse—the entire site blocks your IP.
This isn’t a coincidence. Websites have anti-scraping measures in place to detect bots and shut them down. If you’re not careful, you’ll burn your IP address within minutes and be locked out for good.
But don’t worry. Let’s go through how websites detect scrapers and, more importantly, how to stay undetected.
Rate Limiting: Handling 429 Too Many Requests
Rate limiting is like a speed camera for web requests. If you send too many requests too fast, the website slams the brakes with a 429 Too Many Requests error. Some sites even permanently ban your IP if you keep pushing.
How to Check if a Site Uses Rate Limiting?
Send multiple requests quickly and watch for a 429 error. Check the response headers. Some sites tell you how many requests you’re allowed:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 10
X-RateLimit-Reset: 60
This means you can make 100 requests per minute before hitting the limit.
How to Avoid Getting Blocked?
Introduce delays between requests:
import time
for page in range(1, 6):
url = f"https://example.com/products?page={page}"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
print(response.status_code)
time.sleep(3) # Wait 3 seconds between requests
Randomize your delays to mimic human behavior:
import random
time.sleep(random.uniform(2, 5)) # Wait between 2 and 5 seconds
Use exponential backoff when blocked, but if you hit 429, wait longer before retrying.
import requests
import time
def fetch_page(url):
retries = 0
while retries < 5:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 429:
wait_time = 2 ** retries # Exponential backoff
print(f"Rate limit hit! Waiting {wait_time} seconds...")
time.sleep(wait_time)
retries += 1
else:
return response.text
return None
✅ This makes your scraper behave more like a human and reduces the chances of getting banned.
Proxy Rotation: Using Rotating Proxies to Avoid Detection
A proxy acts as a middleman between you and the website. Instead of making requests from your real IP, you route them through a different IP address.
Why Do You Need Proxies?
- Websites track IP addresses. Too many requests from the same IP will get you banned.
- Some sites block entire countries from accessing their content.
- Rotating proxies help distribute traffic across multiple IPs, making it harder to detect scraping.
Types of Proxies
- Data Center Proxies – Cheap, fast, but easily blocked.
- Residential Proxies – Expensive but look like real users.
- Rotating Proxies – Rotate IPs automatically to avoid detection.
How to Rotate Proxies in Your Scraper?
import requests
from random import choice
proxies = [
"http://proxy1:port",
"http://proxy2:port",
"http://proxy3:port"
]
def get_data(url):
proxy = {"http": choice(proxies)}
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, proxies=proxy)
if response.status_code == 429:
time.sleep(5) # Wait if blocked
return get_data(url)
return response.text
for page in range(1, 6):
html = get_data(f"https://example.com/products?page={page}")
print(f"Page {page} scraped")
✅ This makes it much harder for sites to block your scraper.
User-Agent Rotation: Avoiding Bot Detection with Randomized Headers
Every time you visit a site, your browser sends a User-Agent string identifying what device and browser you’re using.
A normal request might have:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
If your scraper sends the same User-Agent on every request, it’s a red flag.
Solution? Rotate User-Agents.
import requests
from random import choice
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
]
def get_data(url):
headers = {"User-Agent": choice(user_agents)}
response = requests.get(url, headers=headers)
return response.text
for page in range(1, 6):
html = get_data(f"https://example.com/products?page={page}")
print(f"Page {page} scraped")
✅ Makes your requests look like real users, reducing detection risk.
Solving Captchas: Using Automated Captcha Solving Services
Captchas are challenges that test if you’re human by making you:
- Select all traffic lights.
- Type distorted text.
- Click on weird images.
How to Bypass Captchas?
Use a Captcha Solving Service
- 2Captcha, Anti-Captcha, DeathByCaptcha
- These services solve captchas for you and return the response.
import requests
API_KEY = "your_2captcha_api_key"
captcha_url = "https://api.2captcha.com/in.php"
data = {
"key": API_KEY,
"method": "userrecaptcha",
"googlekey": "site-specific-key",
"pageurl": "https://example.com"
}
response = requests.post(captcha_url, data=data)
captcha_id = response.text.split("|")[-1]
# Wait for solution
solution_url = f"https://api.2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}"
solution = requests.get(solution_url).text.split("|")[-1]
print(f"Solved Captcha: {solution}")
✅ Automates Captcha solving, allowing your scraper to keep running.
Also read: Anti-Scraping Technology
Conclusion
Scraping a few pages is easy. Scraping thousands while dodging pagination traps, rate limits, and bot detection? That’s the real challenge.
If you’ve made it this far, you now have an arsenal of techniques to tackle any pagination system websites throw at you. You’ve learned how to:
- Identify pagination types—traditional, AJAX-based, infinite scroll, and API-based.
- Extract data efficiently using the right tools for each method.
- Avoid getting blocked with proxy rotation, user-agent spoofing, and request delays.
- Speed up your scraper with multiprocessing to handle massive datasets.
- Monitor and debug your scraper in real time with a Flask-powered dashboard.
Now, you’re equipped with everything you need to conquer pagination and scale up your scrapers.
So go ahead—build that scraper, extract that data, and stay ahead of the game.
How useful was this post?
Click on a star to rate it!
Average rating 0 / 5. Vote count: 0
No votes so far! Be the first to rate this post.
Tell Us More!
Let us improve this post!
Tell us how we can improve this post?