Disclaimer: Scraping can harm websites and should only be done after informing the website owner. Under special circumstances scraping a website can also be a felony. The crawler described in this article should only used for educational purposes or with the website owner’s agreement.
Scraping data from websites can be tricky
Every website works unique, and some have special measures in place to hinder you from getting the sweet data you want. Once you are blocked by a website it’s hard to get the scraping process running again, as you might not be able to know how you are identified.
How not to get blocked while scraping a website
There are different ways a website can identify you and put you on a blocklist. Some of them are any combinations of:
- TCP/IP fingerprint
- Browser user-agent
You can use a service like ProxyCrawl to handle all of that. The implementation is easy but will cost you per page you want to scrape. If you need to crawl on scale, it might still be the best solution for you.
Using Tor as a free proxy rotator
If you don’t want to pay for a proxy rotator or crawling service, you can use the free Tor network as a proxy rotator. The Tor network is a network used to browse anonymously and contains many IP’s we can use for our purposes. To install and configure Tor on a Linux machine follow these instructions:
Install Tor and Privoxy
sudo apt-get install tor tor-geoipdb privoxy
Set password, get hash, and write both down somewhere
tor --hash-password PASSWORDHERE
modify /etc/tor/torrc with your favorite editor
nano /etc/tor/torrc -- ControlPort 9051 -- HashedControlPassword GENERATEDHASH
nano /etc/privoxy/config -- forward-socks5t / 127.0.0.1:9050 .
[optional:] increase timeouts in /etc/privoxy/config
-- keep-alive-timeout 600 -- default-server-timeout 600 -- socket-timeout 600
[optional:] set Tor and Privoxy to auto start with the server
update-rc.d privoxy defaults update-rc.d tor defaults systemctl reboot
Start Tor and Privoxy
sudo service privoxy start sudo service tor start
After you installed Tor successfully your TCP/IP fingerprint can now be changed whenever needed.
Mimic natural user behavior with Chrome and Chromedriver
Next you need to install a browser and a driver for that browser. Firefox and Chrome are valid options, but we will go with Chrome for now.
Install curl and unzip
sudo apt install curl unzip
Select the latest Chrome version and install it
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add sudo echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list sudo apt-get -y update sudo apt-get -y install google-chrome-stable
Get the current chrome version
Go to https://sites.google.com/chromium.org/driver/ and choose the Chromedriver that is working with your Chrome version (e.g. https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip)
Download, unzip and move the driver so it can be used
wget https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip unzip chromedriver_linux64.zip sudo mv chromedriver /usr/bin/chromedriver sudo chown root:root /usr/bin/chromedriver sudo chmod +x /usr/bin/chromedriver
You should update your browser and driver from time to time to be as less noticeable as possible.
Now you visit the landingpage you want to crawl and inspect the header information your browser is sending to the website. Your goal is to mimic those with every request you do. To do so in Chrome open the console with Right Click -> Inspect, go to Network, choose the document (should be the first entry) and look at the Request Headers.
This can be overwhelming, but don’t worry too much about it. You don’t need to specify all the information as some like user-agent and cookies are handled in the ChromeDriver and selenium well.
In this case it should be enough to note the following headers:
- upgrade-insecure-requests: 1
- accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
- accept-encoding: gzip, deflate, br
- accept-language: de-DE,de;q=0.9,en;q=0.8,fr;q=0.7,es;q=0.6
Make sure to use the same Chrome version to inspect as you installed on your Linux server.
This part can be tricky, as you can’t really know what header information are influencing the success rate of your crawler.
Also, to mimic user behavior as good as possible you must think about how a user navigates through a website. Sometimes it’s a good idea to visit the frontpage of a website before accessing the landingpage you want to crawl, as the request header might change then. Especially the ‘referrer’ data will be influenced by that.
Connect the dots with SeleniumWire
To use Chrome with the Tor network as a proxy we use a python library called SeleniumWire. To use SeleniumWire we need to install some more dependencies.
sudo apt install python3-pip python3-dev python3-openssl python3-bs4 default-libmysqlclient-dev build-essential openssl pip3 install -U stem Rust requests[socks] cryptography selenium selenium-wire undetected-chromedriver
Now we can write our python script and put everything together. This is how your controller could look like.
Parse the needed data with BeautifulSoup
To use BeautifulSoup as a parser you need to install it via:
pip3 install -U BeautifulSoup4
Now you can fetch everything that’s on the site very easily.
To learn about BeautifulSoup visit:
How to scrape data from IMDB
How to scrape data from Amazon
How to scrape data from Google
How to scrape expired domains