Table of Contents
Any website worth web scraping at a large scale will most likely have sophisticated anti-bot implementations protecting it. Even though this is mainly to protect them from malicious code, it will get in the way of your data collection. These five tips for outsmarting anti-scraping techniques will help you circumvent these countermeasures.
Also, when doing website load testing or running an SEO audit on your own site, some of these tips are still worth considering.
Tip #1 – Avoid Honeypot Traps
While you’re likely already familiar with CAPTCHAs, another anti-bot security feature that websites may implement is called honeypots.
There are several types of honeypots. In regards to web scraping, the types you’ll need to strive to avoid are the kinds that return intentionally false data once activated. This could potentially ruin your entire dataset.
These honeypots are typically HTML links that regular users wouldn’t ever go to. This is because they are hidden. Therefore, the only ones accessing them are bots.
To ensure that your bot doesn’t fall for these honeypots there are two things to have it check for and intentionally avoid when dynamically following links:
- If the text color for a link is identical to its background.
- If the link has the CSS attributes of either visibility:hidden or display:none within its link styling.
Once you’re set to steer clear of those, your bot is much less likely to go somewhere it shouldn’t. This only really applies when the bot is coded to find links of its own accord, though. If you’re giving it direct URLs to scrape, you don’t need to bother coding in if checks for these scenarios.
The next step is to hide the bot’s identifiers.
Tip #2 – Mask the Bot’s Digital Fingerprint
Apart from stumbling onto honeypots, there are a lot of other signs of bot-ness you must avoid for outsmarting anti-scraping techniques. If a site notices several requests coming in rapid succession from one source, that’s a huge red flag. The site identifies that source by its IP address as well as its digital fingerprint.
Digital fingerprints are the amalgamation of several small details that when pieced together yield a unique end result. A few of those aspects are:
- The brand and model of the device sending the request.
- What web browser it is using, including the version number, any customized settings, and what addons it’s running.
- The device’s OS.
- What fonts the device has installed.
There are two things you can do to mask this information. First, use a headless browser. Second, rotate both user agents and IP addresses with every request.
Headless browsers don’t have a Graphical User Interface, aka GUI. Thanks to this, they remove the settings and addon variables from the digital fingerprint. Additionally, they can greatly increase the speed of your scraper while letting it consume fewer resources while it’s running.
Since it’s all being handled by a bot, there’s no need for a GUI. It’s just a bunch of gains with no losses to use a headless browser when you’re web scraping.
User agents cover the remaining digital fingerprint traces identifying the browser, OS, etc. The appropriate functions in any advanced library will let you send cycling or randomized info. This customized info will then mask the actual details of the device the bot is running off of.
As for the IP addresses, that brings us to our third tip.
Tip #3 – Use Rotating Residential Proxies
As you’re probably already aware, a proxy masks your IP address for you by presenting its own instead, among other perks.
A rotating proxy supplies regularly changing IP addresses. They’ll either rotate to a new IP address upon every request or have what is called a sticky session. During a sticky session it will stick to one IP address for a predetermined length of time before cycling to the next one.
Depending on your scraping project, you may want sticky sessions so you can log in and have a persistent ID for a few steps. Otherwise, you’d just want to rotate on each request.
Quality service providers like KocerRoxy have a configuration interface where you can pick your session type. Otherwise, you’d have to design your bot to go through all of those IP addresses manually. That would involve significantly more work on your part.
There are three types of IP pools that proxy IPs can come from. They are mobile, residential, or datacenter.
As mobile IPs are much more expensive and have lower availability, you generally only want to use them for very specific projects that explicitly require them. That leaves you needing to decide between residential and datacenter IPs.
Datacenter proxies are unfortunately detectable as being part of a proxy service. This is easily attributed to botting. Many sites may auto-block their entire subnet range as part of their anti-bot countermeasures.
Residential proxies, on the other hand, are indistinguishable from regular web traffic. Thanks to this, they are the superior option for outsmarting anti-scraping techniques.
When you are confident that your target sites aren’t vigilant against datacenter proxies, you can save some money by using them instead.
It’s ill-advised to gamble with the risks of using a so-called free proxy service. This is extra-true thanks to the availability of thrifty and reliable proxy service providers like KocerRoxy.
Tip #4 – Stick to Appropriate Geo-Locations
Those IP addresses provided by your proxy service still have locational info linked to them. Depending on the type of data you are scraping, some geo-locations can impact your results. Some geo-locations may even result in a block, such as when trying to access Facebook from China.
Proxy service providers should include geo-location options in their configuration interfaces. For example, KocerRoxy offers the choice of mixed source or specifically from US, UK, DE, JP, ESP, BR, FR, IT, CA, RU, and AU.
Tip #5 – Replicate Human Behavior
Last but not least, pause to consider how you usually surf the web.
Do you regularly enter the full URL for the specific subpage you want to view? As opposed to clicking on a link from another site, the root site, or a search engine?
Of course not, and your bot shouldn’t either. By setting up appropriate referral sources it looks like your bot naturally traversed to the target site, instead of going directly to it.
When you are following links within a site, do you meticulously make every single click at a fixed time interval? If so, I’m a bit impressed.
However, the website won’t be impressed. Instead, it will find it suspicious that you space your requests at a regular interval, even if they’re not all from the same IP. This can lead to the site sending CAPTCHAs to all incoming traffic. Normally, avoiding CAPTCHAs altogether is the ideal way of handling them.
To avoid this situation, ensure you set a randomized time delay between your requests, which is as simple as a function call. You should also set rate limits so that you don’t cause a sudden traffic spike that looks questionable as well. It’s also the decent thing to do, as overloading their poor servers doesn’t do anyone any good.
Additionally, you want to ensure your bot is sending all of its requests asynchronously. This optimization improves your requests-per-minute while also staggering when you send those requests. Thankfully, this is by default in several libraries and frameworks. Just double-check the documentation on your chosen tools to make sure.
One more way of outsmarting anti-scraping techniques is to appear more human-like. Thus, avoid using too many search operators at once. During natural use, you might use a few. But, it’s suspiciously rare to use several simultaneously. Even if you think it’s just a normal Tuesday evening when you look up:
(meatloaf AROUND(5) recipe) “easy” (best OR classic) intext:bacon -turkey -youtube -vegan -vegetarian -plantbased -oat -diet -lowfat -healthy -music -album -musician -celebrity -actor -vocalist
If you haven’t built your scraper script just yet, there are a lot of free libraries out there for programming your own. If it’s your first time making a web scraper, I highly suggest coding in Python using the BeautifulSoup library.
Any of the libraries beyond the default ones should include functions for implementing all five tips for outsmarting anti-scraping techniques. That just leaves you needing a dependable proxy service provider. KocerRoxy is the best economical option on the market, suitable for projects at any scale.