Best practices of scraping website data for beginners

In the previous – article, I’ve explained how to scrape a website using selenium VBA and also mentioned that using selenium is not the best method to scrape data always. So let’s talk about different scraping methods and how to choose the best one for a web page.

Best practices of scraping website

Before going into scraping website, let’s understand how websites work!

How websites work

Websites are just a bunch of HTML pages. Websites are on the Internet and Internet is a network of computers all over the world. So any website that we see on the Internet is present on a computer somewhere. A web server is installed on that computer which serves HTML files when a request is made to that server.

For example : When you go to codingislove.com in your browser, a GET request is made by the browser to codingislove.com which is mapped to a web server’s IP address. Web server would respond with an HTML file which is shown on your browser.

Basic GET requests

So all you have to do is make a GET request using any programming language to the web page that you want to scrape > server responds with an HTML page > parse the HTML received and you have the data required!

Client side rendering and Server side rendering

In Server side rendering, data in HTML is rendered on the server and HTML with data is sent in the response. For example, This blog uses server-side rendering. Blog post’s title and content is pulled from a database and rendered within the server and HTML with data is sent when a GET request is made.

For Server side rendered websites, we can just make basic GET request and parse HTML to get data.

If you are using Excel VBA to make requests then read How to send HTTP requests in VBA

Scraping example of server-side rendered web page – Parse HTML in Excel VBA – Learn by parsing hacker news home page

In Client side rendering, only HTML layout is sent by the server along with Javascript files > Data is pulled from a different source or an API using Javascript and rendered on your browser.

An Example of client side rendered website is in the previous article where the website pulls data from its JSON API

If we make a GET request to client side rendered websites, we’ll get only HTML layout without data. So all you have to do is find the API or another source from which data is being pulled and make a request to that source directly.

Finding the source : Open the website in a browser > open chrome developer tools or firefox developer tools > network tab > It shows all HTTP requests made by that web page > Use XHR filter to see only HTTP requests and check the response of all requests to find the one with data.

Now make GET request or POST request with appropriate parameters to that source URL and you have the data. Check the example mentioned in the previous article to understand this clearly.

Combined rendering

Most websites use both client side rendering and server-side rendering side by side for best user experience. Take an example of any e-commerce website, When we open a product category, the first page of products are rendered server side.

When we click on see more products > more products are pulled from its API and rendered on the browser.

Recently I’ve seen a question in a forum where the user wants to scrape a website from an e-commerce website. He was making a GET request to that page and was able to scrape only a few products as other products are loaded when ‘See more’ button is clicked which is not possible with a simple GET request.

Solution for this question is simple, find source API from which website is pulling data and make a GET request or POST request appropriately to that API.

Best method for scraping website

First, check if the website has an API. If it has then use the API.

Check if data is rendered server side, If yes then make GET requests directly to that URL

Use selenium where Javascript has to be executed. For example, A site which pulls data from API and makes further changes to data using Javascript.

Wrapping up

Scraping data is not allowed by many sites and also ban scraper’s IP, although scrapers use proxies to get through it. Make sure not to violate the rules. Scraping data legally can help automate your work and add productivity to your work.

If you have and questions or feedback, comment below.

Get notified when there's a new post.

Need some help? Post your questions on our forum

Author: Ranjith kumar

A CA student by education, self taught coder by passion, loves to explore new technologies and believes in learn by doing.

One thought

  1. I have read a few excellent stuff here. Certainly price bookmarking for revisiting.
    I surprise how so much effort you put to create this type of wonderful informative
    website.

Leave a Reply

Your email address will not be published. Required fields are marked *