How to crawl websites for free with SeleniumWire and Tor

Crawl Websites using Selenium and Tor
Foto von hitesh choudhary von Pexels

Disclaimer

Scraping can harm websites and should only be done with the permission of the website owner. Under certain circumstances, scraping a website can also be a criminal offense. The crawler described in this article should only be used for educational purposes or with the permission of the website owner.

Why is this necessary?

Every website works differently, and some have taken special measures to prevent you from getting the data you want. Once you are blocked from a website, it is difficult to get the scraping process started again because you may not know how you are identified. This is why it is important to avoid being blocked in the first place.

How not to get blocked when scraping a website

There are several ways a website can identify you and put you on a block list. Some of them are any combination of:

  • TCP/IP fingerprint
  • Browser user agent
  • Cookies
  • Geotarget

You can use a service like ProxyCrawl to handle all of this. It’s easy to implement, but it costs money per page you want to crawl. If you need to crawl on a large scale, this might be the best solution for you.

Requirements

Whether you use a proxy rotation service or not, you will need a (cheap) web server with root access. I recommend the smallest Netcup VPS (1) or something similar.

To set up and secure your server, you can follow this guide by DigitalOcean. You don’t have to follow all the steps, but the basic initialization is mandatory.

Use Tor as a free proxy rotator

If you don’t want to pay for a proxy rotator or crawling service, you can use the free Tor network as a proxy rotator. The Tor network is a network used for anonymous browsing and contains many IPs that we can use for our purposes. To install and configure Tor on a Linux server, follow these instructions:

Install Tor and Privoxy

sudo apt-get install tor tor-geoipdb privoxy

Set password, get hash, and write both down somewhere

tor --hash-password PASSWORDHERE

Modify /etc/tor/torrc with your favorite editor

nano /etc/tor/torrc

— ControlPort 9051
— HashedControlPassword GENERATEDHASH

Modify /etc/privoxy/config

nano /etc/privoxy/config

— forward-socks5t / 127.0.0.1:9050 .

[optional:] increase the timeouts in /etc/privoxy/config

— keep-alive-timeout 600
— default-server-timeout 600
— socket-timeout 600

[optional:] set up Tor and Privoxy to start automatically with the server

update-rc.d privoxy defaults
update-rc.d tor defaults
systemctl reboot

Start Tor and Privoxy

sudo service privoxy start
sudo service tor start

After you have successfully installed Tor, your TCP/IP fingerprint can now be changed at any time.

Mimic natural user behavior with Chrome and Chromedriver

Next, you need to install a browser and a driver for that browser. Firefox and Chrome are valid options, but we’ll opt for Chrome for now.

Install curl and unzip

sudo apt install curl unzip

Choose the latest Chrome version and install it

sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

Determine the latest Chrome version

google-chrome -version

Go to https://sites.google.com/chromium.org/driver/ and choose the Chromedriver that works with your chrome version (e.g. https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip)

Download the driver, unzip and move it so that it can be used

wget https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

You should update your browser and driver from time to time to be as unobtrusive as possible.

Now visit the landing page you want to crawl and examine the header information your browser sends to the site. Your goal is to mimic this for every request. To do this in Chrome, open the console by right-clicking -> Inspect, go to Network, select the document (should be the first entry), and look at the Request Headers.

Inspect request headers in Chrome

This can be overwhelming, but don’t worry too much about it. You don’t need to include all the information, as some is handled like cookies in ChromeDriver and Selenium itself.

In this case, it should be sufficient to note the following headers:

  • upgrade-insecure-requests: 1
  • accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
  • accept-encoding: gzip, deflate, br
  • accept-language: de-DE,de;q=0.9,en;q=0.8,fr;q=0.7,es;q=0.6

Make sure you use the same Chrome version for verification that you have installed on your Linux server.

This part can be tricky, as you can’t really know what header information will affect your crawler’s success rate.

To mimic user behavior as much as possible, you also need to think about how a user navigates through a website. Sometimes it’s a good idea to visit the home page of a website before going to the landing page you want to crawl, as the request header may change. In particular, the ‘referrer’ data will be affected.

Connect the dots with SeleniumWire

To use Chrome with the Tor network as a proxy, we use a Python library called SeleniumWire. To use SeleniumWire, we need to install a few more dependencies.

sudo apt install python3-pip python3-dev python3-openssl python3-bs4 default-libmysqlclient-dev build-essential openssl
pip3 install -U stem Rust requests[socks] cryptography selenium selenium-wire

Now we can write our Python script and put everything together, this is what your controller might look like.

import seleniumwire
from seleniumwire import webdriver

from stem import Signal
from stem.control import Controller

class NewController:

    ACCEPT = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
    ACCEPT_ENCODING = 'gzip, deflate, br'
    ACCEPT_LANGUAGE = 'de-DE,de;q=0.9,en;q=0.8,fr;q=0.7'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 91.0.4472.19 Safari/537.36'

    def createTorIdentity(self):
        with Controller.from_port(port = "TOR_PORT") as controller:
            controller.authenticate(password = "TOR_PASS")
            controller.signal(Signal.NEWNYM)

    def interceptor(self, request):
        del request.headers['User-Agent']
        del request.headers['Accept']
        del request.headers['Accept-Encoding']
        del request.headers['Accept-Language']
        del request.headers['Upgrade-Insecure-Requests']
        request.headers['User-Agent'] = self.USER_AGENT
        request.headers['Accept'] = self.ACCEPT
        request.headers['Accept-Encoding'] = self.ACCEPT_ENCODING
        request.headers['Accept-Language'] = self.ACCEPT_LANGUAGE
        request.headers['Upgrade-Insecure-Requests'] = '1'

    def createDriver(self):
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        # chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--disable-blink-features=AutomationControlled')
        chrome_options.add_argument('--window-size=1366,768')

        options = {
            'connection_keep_alive': True,
            'proxy': {
                'http': 'socks5h://127.0.0.1:9050',
                'https': 'socks5h://127.0.0.1:9050',
                'connection_timeout': 10
            }
        }

        driver = webdriver.Chrome(  # Optimized for bot detection
            executable_path = "/usr/bin/chromedriver",
            options = chrome_options,
            seleniumwire_options = options
        )

        driver.request_interceptor = self.interceptor

        return driver

Replace your Tor port and password, all header information (acceptance, user agent…) and maybe the executable path of your Chromedriver.

Request the needed data and parse it with BeautifulSoup

To use BeautifulSoup as a parser, you need to install it over:

pip3 install -U BeautifulSoup4

Now you can easily retrieve everything that is on the website. To learn more about BeautifulSoup, visit: https://www.crummy.com/software/BeautifulSoup/.

A simple script to parse data from a URL might look like this.

import controller

import seleniumwire
from seleniumwire import webdriver
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup

# enable Classes
control = controller.NewController()

# prepare Tor Port
control.createTorIdentity()

# Create a new instance of the Chrome driver
driver = control.createDriver()

# Get Urls from MySql
link = 'some url'
# example
# link = 'https://www.amazon.com'

driver.get(link)
request = driver.wait_for_request(link, timeout=60)

# Access requests via the `requests` attribute
if request.response.status_code == 200:

  soup = BeautifulSoup(driver.page_source, 'html.parser')

  # headline
  headline = soup.find('h1')
  print(headline)

With this, you should be able to write your own lightweight crawler to parse and extract data. Once you have your basic crawler set up, you can use it as a framework for your own crawler.

1) This link is affiliated. We sometimes get a revenue share when you buy something through these links.