How to crawl websites for free using Tor and SeleniumWire!

webspider

Disclaimer: Scraping can harm websites and should only be done after informing the website owner. Under special circumstances scraping a website can also be a felony. The crawler described in this article should only used for educational purposes or with the website owner’s agreement.

Scraping data from websites can be tricky

Every website works unique, and some have special measures in place to hinder you from getting the sweet data you want. Once you are blocked by a website it’s hard to get the scraping process running again, as you might not be able to know how you are identified.

How not to get blocked while scraping a website

There are different ways a website can identify you and put you on a blocklist. Some of them are any combinations of:

  • TCP/IP fingerprint
  • Browser user-agent
  • Cookies
  • Geotarget

You can use a service like ProxyCrawl to handle all of that. The implementation is easy but will cost you per page you want to scrape. If you need to crawl on scale, it might still be the best solution for you.

Using Tor as a free proxy rotator

If you don’t want to pay for a proxy rotator or crawling service, you can use the free Tor network as a proxy rotator. The Tor network is a network used to browse anonymously and contains many IP’s we can use for our purposes. To install and configure Tor on a Linux machine follow these instructions:

Install Tor and Privoxy

sudo apt-get install tor tor-geoipdb privoxy

Set password, get hash, and write both down somewhere

tor --hash-password PASSWORDHERE

modify /etc/tor/torrc with your favorite editor

nano /etc/tor/torrc
-- ControlPort 9051
-- HashedControlPassword GENERATEDHASH

modify /etc/privoxy/config

nano /etc/privoxy/config
-- forward-socks5t / 127.0.0.1:9050 .

[optional:] increase timeouts in /etc/privoxy/config

-- keep-alive-timeout 600
-- default-server-timeout 600
-- socket-timeout 600

[optional:] set Tor and Privoxy to auto start with the server

update-rc.d privoxy defaults
update-rc.d tor defaults
systemctl reboot

Start Tor and Privoxy

sudo service privoxy start
sudo service tor start

After you installed Tor successfully your TCP/IP fingerprint can now be changed whenever needed.

Mimic natural user behavior with Chrome and Chromedriver

Next you need to install a browser and a driver for that browser. Firefox and Chrome are valid options, but we will go with Chrome for now.

Install curl and unzip

sudo apt install curl unzip

Select the latest Chrome version and install it

sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

Get the current chrome version

google-chrome -version

Go to https://sites.google.com/chromium.org/driver/ and choose the Chromedriver that is working with your Chrome version (e.g. https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip)

Download, unzip and move the driver so it can be used

wget https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

You should update your browser and driver from time to time to be as less noticeable as possible.

Now you visit the landingpage you want to crawl and inspect the header information your browser is sending to the website. Your goal is to mimic those with every request you do. To do so in Chrome open the console with Right Click -> Inspect, go to Network, choose the document (should be the first entry) and look at the Request Headers.

Inspect in Chrome Browser

This can be overwhelming, but don’t worry too much about it. You don’t need to specify all the information as some like user-agent and cookies are handled in the ChromeDriver and selenium well.

In this case it should be enough to note the following headers:

  • upgrade-insecure-requests: 1
  • accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
  • accept-encoding: gzip, deflate, br
  • accept-language: de-DE,de;q=0.9,en;q=0.8,fr;q=0.7,es;q=0.6

Make sure to use the same Chrome version to inspect as you installed on your Linux server.

This part can be tricky, as you can’t really know what header information are influencing the success rate of your crawler.

Also, to mimic user behavior as good as possible you must think about how a user navigates through a website. Sometimes it’s a good idea to visit the frontpage of a website before accessing the landingpage you want to crawl, as the request header might change then. Especially the ‘referrer’ data will be influenced by that.

Connect the dots with SeleniumWire

To use Chrome with the Tor network as a proxy we use a python library called SeleniumWire. To use SeleniumWire we need to install some more dependencies.

sudo apt install python3-pip python3-dev python3-openssl python3-bs4 default-libmysqlclient-dev build-essential openssl
pip3 install -U stem Rust requests[socks] cryptography selenium selenium-wire undetected-chromedriver

Now we can write our python script and put everything together. This is how your controller could look like.

Parse the needed data with BeautifulSoup

To use BeautifulSoup as a parser you need to install it via:

pip3 install -U BeautifulSoup4

Now you can fetch everything that’s on the site very easily.

To learn about BeautifulSoup visit:

 

Examples

How to scrape data from IMDB

How to scrape data from Amazon

How to scrape data from Google

How to scrape expired domains


Eine Antwort zu “How to crawl websites for free using Tor and SeleniumWire!”