Best Sites For Web Scraping



In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.

  1. Best Sites For Web Scraping Services
  2. Best Sites For Web Scraping Software
  3. Web Scraping Sites
  4. Best Sites For Web Scraping Design

Extract data and take action automatically on the web in order to save time and be more productive. Try us free and have your first workflow up in no time.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

  • Zenscrape (zenscrape.com) Zenscrape is a hassle-free API that offers lightning-fast and easy-to.
  • Mar 16, 2017 I'm available for consulting or contract work and have expertise in web scraping, full-stack development, data science, high performance computing, and many other areas. Advanced Web Scraping: Bypassing '403 Forbidden,' captchas, and more.
  • Best Proxies for Web Scraping IMDb Film fans everywhere find the Internet Movie Database (IMDb) to be an invaluable resource. It answers every possible question you could have about a movie, television show, or any filmed piece of media, from the main actor to who was holding the boom mic during every scene.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot

It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

  • selenium package

To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:

Quickstart

Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.

To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

  • driver.title gets the page's title
  • driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

Locating Elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).

Best sites for web scraping design

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name
  • Class name
  • IDs
  • XPath
  • CSS selectors

We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:


find_element

There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:

All these methods also have find_elements (note the plural) to return a list of elements.

For example, to get all anchors on a page, use the following:

Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those HTML elements, here are the most useful:

  • Accessing the text of the element with the property element.text
  • Clicking on the element with element.click()
  • Accessing an attribute with element.get_attribute('class')
  • Sending text to an input with: element.send_keys('mypassword')

There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.

It can be interesting to avoid honeypots (like filling hidden inputs).

Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.

That's a classic honeypot.

Full example

Here is a full example using Selenium API methods we just covered.

We are going to log into Hacker News:


In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.

In order to authenticate we need to:

  • Go to the login page using driver.get()
  • Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
  • Follow the same process with the password input
  • Click on the login button using element.click()

Should be easy right? Let's see the code:

Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

  • Check for an error message (like “Wrong password”)
  • Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!

We can't just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException exception:

Taking a screenshot

We could easily take a screenshot using:


Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.

In our Hacker News case it's simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.

If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:

  • Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
  • Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.

This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:

  • element_to_be_clickable
  • text_to_be_present_in_element
  • element_to_be_clickable

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:

Conclusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API

Selenium is also an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd:


Introduction

Best Sites For Web Scraping Services

In this article, we will look at the top five proxy list websites and perform a benchmark.

If you are in a hurry and wish to go straight to the results, click here.

The idea is not only to talk about the different features they offer, but also to test the reliability with a real-world test. We will look at and compare the response times, errors, and success rates on popular websites like Google and Amazon.

There is a proxy type to match any specific needs you might have, and you can always start with a free proxy server. This is especially true if you want to use it as a proxy scraper.

A free proxy server is a proxy you can connect to without needing special credentials and there are plenty to choose from online. The most important thing you need to consider is the source of the proxy. Since proxies take your information and re-route it through a different IP address, they still have access to any internet requests you make.

While there are a lot of reputable free proxies available for web scraping, there are just as many proxies that are hosted by hackers or government agencies. You are sending your requests to a third-party and they have a chance to see all of the unencrypted data that comes from your computer or phone.

Whether you want to gather information through web scraping without websites tracking your bots or you need to bypass rate limits, there's a way for you to get privacy.

Proxies help keep your online activity secure by routing all of your requests through a different IP address. Websites aren't able to track you when they don't have the original IP address your request came from.

Even when you find a trustworthy free proxy, there are still some issues with using them. They could return responses incredibly slowly if there are many users on the proxy at the same time. Some of them are unreliable and might disappear without warning and never come back. Proxies can also inject ads into the data returned to your computer.

In the context of web scraping, most users start out with a free proxy. Usually you aren't sending any sensitive information with your requests so many people feel comfortable using them for this purpose. However, you might not want a website to know that you are scraping it for its data.

You could be doing market research to learn more about your competition through web scraping. You could also scrape to web for building a prospect list.

Many users don't want a website to know about that kind of activities. One big reason users turn to free proxies for web scraping is that they don't plan to do it often. Let's say you sell a software to restaurant owners. You might want to scrape a list of restaurant to gather their phone number. This is a one-time task, so you might want to use free proxies for that.

You can get the information you need from a site and then disconnect from the proxy without any issues.

While free proxies are great for web scraping, they are still unsecure. A malicious proxy could alter the HTML of the page you requested and give you false information. You also have the risk that the proxy you are currently using can disconnect at any time without warning. Also, the proxy IP address you're using could get blocked by websites if there are a lot of people using it for malicious reasons.

Free proxies have their uses and there are thousands of lists available with free proxy IP addresses and their statuses. Some lists have higher quality proxies than others and you also have the option to use specific proxy services. You'll learn about several of these lists and services to help you get started in your search for the best option for your proxy scraper.

1. ScrapingBee review


I know I know… It sounds a bit pushy to immediately talk about our service but this article isn't an ad. We put a lot of time and effort into benchmarking these services, and I think it is fair to compare these free proxy lists to the ScrapingBee API.

If you're going to use a proxy for web scraping, consider ScrapingBee. While some of the best features are in the paid version, you can get 1,000 free credits when you sign up. Unlock verizon phone code generator tool software. This service stands out because even free users have access to support and the IP addresses you have access to are more secure and reliable.

The features ScrapingBee includes in the free credits are unmatched by any other free proxy you'll find in the lists below. You'll have access to tools like JavaScript rendering and headless Chrome to make it easier to use your proxy scraper.

One of the coolest features is that they have rotating proxies so that you can get around rate-limiting websites. This helps you hide your proxy scraper bots and lowers the chance you'll get blocked by a website.

You can also find code snippets in Python, NodeJS, PHP, Go, and several for web scrapers. ScrapingBee even has its own API, which makes it even easier to do web scraping. You don't have to worry about security leaks or the proxy running slow because access to the proxy servers is limited.

You can customize things like your geolocation, the headers that get forwarded, and the cookies that are sent in the requests, and ScrapingBee automatically block ads and images to speed up your requests.

Using

Another cool thing is that if your requests return a status code other than 200, you don't get charged for that credit. You only have to pay for successful requests.

Even though ScrapingBee's free plan is great, if you plan on using scraping websites a lot you will need to upgrade to a paid plan. Then of course, if you have any problem you can get in touch with the team to find out what happened.

With the free proxies on the lists below, you won't have any support. You'll be responsible for making sure your information is secure and you'll have to deal with IP addresses getting blocked and requests returning painfully slow as more users connect to the same proxy.

Results (full benchmark & methodology)

WebsiteErrorsBlockedSuccessAverage Time
Instagram4509553.3
Google8009208.30
Amazon2209783.34
Top 300 Alexa509953.34

2. ProxyScrape Review


If you're looking for a list of completely free proxies, Proxyscrape is one of the leading free proxy lists available. One really cool feature is that you can download the list of proxies to a .txt file. This can be useful if you want to run a lot of proxy scrapers at the same time on different IP addresses.

You can even filter the free proxy lists by country, level of anonymity, and whether they use an SSL connection. This lets you find the kind of proxy you want to use more quickly than with many other lists where you have to scroll down a page, looking through table columns.

ProxyScrape even has different kinds of proxies available. You still have access to HTTP proxies, and you can find lists of Socks4 and Socks5 proxies. There aren't as many filters available for Socks4 and Socks5 lists, but you can select the country you want to use.

The ProxyScrape API currently works with Python and there are only four types of API requests you can make. An important thing to remember is that none of the proxies on any of the lists you get from this website are guaranteed to be secure. Free proxies can be hosted by anyone or any entity, so you will be using these proxies at your own risk.

They do have a premium service available where they host datacenter proxies. These are typically more secure than the free ones. They do more monitoring on these proxies to make sure that you have consistent uptime and that the IP addresses don't get added to blocklists.

Another nice tool they have is an online proxy checker. This lets you enter the IP addresses of some of the free proxies you've found and test them to see if they are still working. When you're trying to do web scraping you want to make sure that your proxy doesn't disconnect in the middle of the process and this is one way you can keep an eye on the connection.

Results (full benchmark & methodology)

WebsiteErrorsBlockedSuccessAverage time
Instagram3925921625.55
Google9584474216.12
Amazon4451653920.37
Top 300 Alexa551144813.60

Best Sites For Web Scraping Software

3. free-proxy.cz review


Free-proxy.cz is one of the original free proxy list sites. There hasn't been much maintenance on the website so it still has the user interface of an early 2000's website, but if you're just looking for free proxies it has a large list. One thing you'll find here that's different from other proxy list sites is a list for free web proxies.

Web proxies are usually run on server-side scripts like PHProxy, Glype, or CGIProxy. The list is also pre-filtered for duplicates so there aren't any repeating IP addresses. Also, the list of other proxy servers in their database is unique.

On the homepage there is a table with all of the free proxies they have found. You can filter the proxies by country, protocol, and anonymity level. You can sort the filtered table by the proxy speed, uptime, response time, and the last time the status was checked. The table shows paginated results, so taking advantage of the sort function will save you some time.

There's also a “proxies by category” tool below the table that lets you look at the free proxies by country and region. This makes it easier to go through the table of results and find exactly what you need. This is the best way to navigate this list of free proxies because there are thousands available.

Another useful tool on this site is the “Your IP Address Info” button at the top of the page. It will tell you everything about the IP address you are using to connect to the website. It'll show you the location, proxy variables, and other useful information on your current connection. It even goes as far as showing your location on Google Maps. This is a good way to test a proxy server.

Best Sites For Web Scraping

This site doesn't offer any premium or paid services, there is no guarantee that the free proxies you find here are always online or have any security measures to protect your proxy scraping activities.

Results (full benchmark & methodology)

WebsiteErrorsBlockedSuccessAverage time
Instagram654332143.74
Google96990313.74
Amazon675332216.40
Top 300 Alexa742025812.73

4. GatherProxy review


GatherProxy (proxygather.com) is another great option for finding free proxy lists. It's a bit more organized than many of the lists you'll find online. You can find proxies based on country or port number. There are also anonymous proxies and web proxies. Plus, they have a separate section for socks lists.

The site also offers several free tools like a free proxy scraper. You can download the tool, but it hasn't been updated in a few years. It's a good starting point if you are trying to build a proxy scraper or do web scraping in general. There is also an embed plugin for GatherProxy that lets you add a free proxy list to your own website if that would be useful for you.

If you want to check your IP address or browser information, they also have a tool to show you that information. It's not as detailed as the IP address information you see on free-proxy.cz, but it still gives you enough information to find what you need.

Another tool you can find on this site is the proxy checker. It lets you find, filter, and check the status of millions of proxies. You can export all of the proxies you find using this tool into a number of different formats, like CSV. There are some great videos on GatherProxy that show you how to use these tools.

The main difference between this site and a lot of the others is that you have to enter an email address before you can browse through their lists of free proxies. It's still a completely free service, but you have to sign up and get login credentials. Once you do that, you'll be able to see the tables of free proxies and sort them by a number of parameters.

You also have the option to download the free proxy lists after you sort and filter them based on your search criteria. One nice feature is that they auto-update the proxy lists constantly so you don't have to worry about getting a list of stale IP addresses.

Results (full benchmark & methodology)

(At the time of writing, this service was down)

5. freeproxylists.net review


Freeproxylists is simple to use. The homepage brings up a table of all of the free proxies that have been found. Like many of the other sites in this post, you can sort the table by country, port number, uptime, and other parameters. The results are paginated, so you'll have to click through multiple pages to see everything available.

It has a straight-forward filtering function at the top of the page so you can limit the number of results shown in the table. If using a proxy from a specific country is a concern, you can go to the “By Country”. It'll show you a list of all of the countries the free proxies represent and the number of proxies available for that country.

One downside is that you won't be able to download the proxy list from this website. This is probably one of the more basic free proxy lists you'll find online for your web scrapers. However, this service does have a good reputation compared to the thousands of other lists available, and the proxies you find here at least work.

(Even for free proxy list sites with a decent reputation as a site for free proxy lists, always remember that there is a risk involved with using proxies hosted by entities you don't know.)

This list seems to be updated frequently, but they don't share how often it's updated. You'll find free proxies here, but it would be best to use a different tool to check if the proxy you want to use is still available.

There is an email address available on the site if you have questions, although you shouldn't expect a fast response time. Unlike some of the other free proxy sites, there aren't any paid or premium versions of the proxy lists or any additional tools, like proxy scrapers.

Results (full benchmark & methodology)

WebsiteErrorsBlockedSuccessAverage time
Instagram386585290.70
Google984640168.90
Amazon3761361121.02
Top 300 Alexa483051710.90

Benchmark

Now that we have looked at the different free proxies available on the market, it is time to test them against different websites. The benchmark is simple.

We made a script that collects free proxies from each (it has to be dynamic and get the latest proxy, since the lists change every few hours on these websites). Then, we have a set of URLs for some popular websites like Instagram, Google and Amazon and 300 URLs from the top 1,000 Alexa rank. We then go to each URL using the proxy list and record the response time/HTTP code and eventual blocking behavior on the website.

For example, Google will send a 429 HTTP code if they block an IP, Amazon will return a 200 HTTP code with a Captcha in the body, and Instagram will redirect you to the login page.

You can find the script here: https://github.com/ScrapingBee/freeproxylist-blogpost

We ran the script using each proxy list with the different websites, 1,000 requests each time and found the following results:

Instagram

Proxy ListErrorsBlockedSuccessAverage time
Proxyscrape3925921624.55
Freeproxycz654332143.74
Freeproxylist386585290.70
ScrapingBee4509553.3

Google

Proxy ListErrorsBlockedSuccessAverage time
Proxyscrape9584474216.12
Freeproxycz96990313.74
Freeproxylist984640168.90
ScrapingBee*8009208.30

*Using ScrapingBee Google API

Amazon

Proxy ListErrorsBlockedSuccessAverage time
Proxyscrape4451653920.37
Freeproxycz675332216.40
Freeproxylist3761361121.02
ScrapingBee2209783.34

Top 300 Alexa Rank

Proxy ListErrorsBlockedSuccessAverage time
Proxyscrape551144813.60
Freeproxycz742025812.73
Freeproxylist483051710.90
ScrapingBee509953.34

Analysis

The biggest issue with all of these proxies was the error rate on the proxy: timeouts, network error, HTTPS…you name it.

Web Scraping Sites

Then, especially for Google and Instagram, most of the requests were blocked with the “working” proxies (meaning proxies that don't produce timeouts or network errors). This can be explained by the fact that Google is heavily scraped by tools like the Scrapebox/Screaming Frog spider.

These are SEO tools used to get keyword suggestions, scrape Google, and generate SEO reports. They have a built-in mechanism to gather these free proxy lists, and lots of SEO people use them. So, these proxies are over-used on Google and often get blocked.

Overall, besides ScrapingBee of course, Freeproxylists.net seems to have the best proxies, but as you can see it's not that great either.

Conclusion

Best Sites For Web Scraping Design

When you are trying to use web scraping to get information about competitors, find email addresses, or get other data from a website, using a proxy will help you protect your identity and avoid adding your true IP address to any blocklists. Proxy scrapers help you keep your bots secure and crawling pages for as long as you need.

While there are numerous lists of free proxies online, not all of them contain the same quality of proxies. Be aware of the risks that come with using free proxies. There's a chance you could connect to one hosted by a hacker or government agency or just someone trying to insert their ads into every response that is returned from any website. That's why it's good to use free proxy services from websites you trust.

Having a list of free proxies gives you the advantage of not dealing with blacklists because if an IP address gets blocked, you can move on to another proxy without much hassle. If you need to use the same IP address multiple times for your web scraping, it will be worth the investment to pay for a service that has support and manages its own proxies so you don't have to worry about them going down at the worst time.