Best practices of scraping website data for beginners

In the previous – article, I’ve explained how to scrape a website using selenium VBA and also mentioned that using selenium is not the best method to scrape data always. So let’s talk about different scraping methods and how to choose the best one for a web page.

Before going into scraping website, let’s understand how websites work!

How websites work

Websites are just a bunch of HTML pages. Websites are on the Internet and Internet is a network of computers all over the world. So any website that we see on the Internet is present on a computer somewhere. A web server is installed on that computer which serves HTML files when a request is made to that server.

For example : When you go to codingislove.com in your browser, a GET request is made by the browser to codingislove.com which is mapped to a web server’s IP address. Web server would respond with an HTML file which is shown on your browser.

Basic GET requests

So all you have to do is make a GET request using any programming language to the web page that you want to scrape > server responds with an HTML page > parse the HTML received and you have the data required!

Client side rendering and Server side rendering

In Server side rendering, data in HTML is rendered on the server and HTML with data is sent in the response. For example, This blog uses server-side rendering. Blog post’s title and content is pulled from a database and rendered within the server and HTML with data is sent when a GET request is made.

For Server side rendered websites, we can just make basic GET request and parse HTML to get data.

If you are using Excel VBA to make requests then read How to send HTTP requests in VBA

Scraping example of server-side rendered web page – Parse HTML in Excel VBA – Learn by parsing hacker news home page

In Client side rendering, only HTML layout is sent by the server along with Javascript files > Data is pulled from a different source or an API using Javascript and rendered on your browser.

An Example of client side rendered website is in the previous article where the website pulls data from its JSON API

If we make a GET request to client side rendered websites, we’ll get only HTML layout without data. So all you have to do is find the API or another source from which data is being pulled and make a request to that source directly.

Finding the source : Open the website in a browser > open chrome developer tools or firefox developer tools > network tab > It shows all HTTP requests made by that web page > Use XHR filter to see only HTTP requests and check the response of all requests to find the one with data.

Now make GET request or POST request with appropriate parameters to that source URL and you have the data. Check the example mentioned in the previous article to understand this clearly.

Combined rendering

Most websites use both client side rendering and server-side rendering side by side for best user experience. Take an example of any e-commerce website, When we open a product category, the first page of products are rendered server side.

When we click on see more products > more products are pulled from its API and rendered on the browser.

Recently I’ve seen a question in a forum where the user wants to scrape a website from an e-commerce website. He was making a GET request to that page and was able to scrape only a few products as other products are loaded when ‘See more’ button is clicked which is not possible with a simple GET request.

Solution for this question is simple, find source API from which website is pulling data and make a GET request or POST request appropriately to that API.

Best method for scraping website

First, check if the website has an API. If it has then use the API.

Check if data is rendered server side, If yes then make GET requests directly to that URL

Use selenium where Javascript has to be executed. For example, A site which pulls data from API and makes further changes to data using Javascript.

Wrapping up

Scraping data is not allowed by many sites and also ban scraper’s IP, although scrapers use proxies to get through it. Make sure not to violate the rules. Scraping data legally can help automate your work and add productivity to your work.

If you have and questions or feedback, comment below.

Author
Recent Posts

Ranjith kumar

A CA- by education, self taught coder by passion, loves to explore new technologies and believes in learn by doing.

Latest posts by Ranjith kumar (see all)

Ultimate Guide: Build A Mobile E-commerce App With React Native And Medusa.js - February 15, 2025
Flutter lookup failed in @fields error (solved) - July 14, 2023
Free open source alternative to Notion along with AI - July 13, 2023

5 1 vote

Article Rating

Sign me up for the newsletter!

11 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Laurence

8 years ago

I have read a few excellent stuff here. Certainly price bookmarking for revisiting.
I surprise how so much effort you put to create this type of wonderful informative
website.

7 years ago

Excellent article.

Do you have any tutorials on how to send the Percentage sign into a webpage input text box?
I tried Unicode but could not get it to work.

U+0025 % 37 Percent sign

Reply to j

I found one HINT to a solution, but I tried the HTML escape code and the Unicode, but neither worked within the XHR .post string to a webpage input text box.

https://stackoverflow.com/questions/13093126/insert-unicode-character-into-javascript

Any suggestions greatly appreciated.

Figured it out, using + ” “

Hi Ranjith,

The forum was overloaded.

Is there a replacement in XHR VBA for URL location and Title?
‘Debug.Print “URL : ” & http.LocationURL
‘Debug.Print “TITLE URL : ” & http.Title

Saving Excel file to desktop. Your code is brilliant!

I adapted it to create a new temp file for the desktop, any suggestions on improvement of the below?

Sub MsgboxDesktopAddress()
MsgBox CreateObject(“WScript.Shell”).SpecialFolders(“Desktop”)
newTmp = CreateObject(“WScript.Shell”).SpecialFolders(“Desktop”)
CreateObject(“Scripting.FileSystemObject”).CreateFolder (newTmp & “/temp”)
End Sub

MODIFIED:

Dim newTmp As String newTmp = CreateObject("WScript.Shell").SpecialFolders("Desktop") Debug.Print newTmp If Not CreateObject("Scripting.FileSystemObject").FolderExists(newTmp & "\tmp") Then CreateObject("Scripting.FileSystemObject").CreateFolder (newTmp & "\tmp") End If