In the previous – article, I’ve explained how to scrape a website using selenium VBA and also mentioned that using selenium is not the best method to scrape data always. So let’s talk about different scraping methods and how to choose the best one for a web page.
Before going into scraping website, let’s understand how websites work!
How websites work
Websites are just a bunch of HTML pages. Websites are on the Internet and Internet is a network of computers all over the world. So any website that we see on the Internet is present on a computer somewhere. A web server is installed on that computer which serves HTML files when a request is made to that server.
For example : When you go to codingislove.com in your browser, a GET request is made by the browser to codingislove.com which is mapped to a web server’s IP address. Web server would respond with an HTML file which is shown on your browser.
Basic GET requests
So all you have to do is make a GET request using any programming language to the web page that you want to scrape > server responds with an HTML page > parse the HTML received and you have the data required!
Client side rendering and Server side rendering
In Server side rendering, data in HTML is rendered on the server and HTML with data is sent in the response. For example, This blog uses server-side rendering. Blog post’s title and content is pulled from a database and rendered within the server and HTML with data is sent when a GET request is made.
For Server side rendered websites, we can just make basic GET request and parse HTML to get data.
If you are using Excel VBA to make requests then read How to send HTTP requests in VBA
Scraping example of server-side rendered web page – Parse HTML in Excel VBA – Learn by parsing hacker news home page
In Client side rendering, only HTML layout is sent by the server along with Javascript files > Data is pulled from a different source or an API using Javascript and rendered on your browser.
An Example of client side rendered website is in the previous article where the website pulls data from its JSON API
If we make a GET request to client side rendered websites, we’ll get only HTML layout without data. So all you have to do is find the API or another source from which data is being pulled and make a request to that source directly.
Finding the source : Open the website in a browser > open chrome developer tools or firefox developer tools > network tab > It shows all HTTP requests made by that web page > Use XHR filter to see only HTTP requests and check the response of all requests to find the one with data.
Now make GET request or POST request with appropriate parameters to that source URL and you have the data. Check the example mentioned in the previous article to understand this clearly.
Combined rendering
Most websites use both client side rendering and server-side rendering side by side for best user experience. Take an example of any e-commerce website, When we open a product category, the first page of products are rendered server side.
When we click on see more products > more products are pulled from its API and rendered on the browser.
Recently I’ve seen a question in a forum where the user wants to scrape a website from an e-commerce website. He was making a GET request to that page and was able to scrape only a few products as other products are loaded when ‘See more’ button is clicked which is not possible with a simple GET request.
Solution for this question is simple, find source API from which website is pulling data and make a GET request or POST request appropriately to that API.
Best method for scraping website
First, check if the website has an API. If it has then use the API.
Check if data is rendered server side, If yes then make GET requests directly to that URL
Use selenium where Javascript has to be executed. For example, A site which pulls data from API and makes further changes to data using Javascript.
Wrapping up
Scraping data is not allowed by many sites and also ban scraper’s IP, although scrapers use proxies to get through it. Make sure not to violate the rules. Scraping data legally can help automate your work and add productivity to your work.
If you have and questions or feedback, comment below.
- Flutter lookup failed in @fields error (solved) - July 14, 2023
- Free open source alternative to Notion along with AI - July 13, 2023
- Threads API for developers for programmatic access - July 12, 2023
I have read a few excellent stuff here. Certainly price bookmarking for revisiting.
I surprise how so much effort you put to create this type of wonderful informative
website.
Excellent article.
Do you have any tutorials on how to send the Percentage sign into a webpage input text box?
I tried Unicode but could not get it to work.
U+0025 % 37 Percent sign
I found one HINT to a solution, but I tried the HTML escape code and the Unicode, but neither worked within the XHR .post string to a webpage input text box.
https://stackoverflow.com/questions/13093126/insert-unicode-character-into-javascript
Any suggestions greatly appreciated.
Figured it out, using + ” “
Hi Ranjith,
The forum was overloaded.
Is there a replacement in XHR VBA for URL location and Title?
‘Debug.Print “URL : ” & http.LocationURL
‘Debug.Print “TITLE URL : ” & http.Title
Hi Ranjith,
Saving Excel file to desktop. Your code is brilliant!
I adapted it to create a new temp file for the desktop, any suggestions on improvement of the below?
Sub MsgboxDesktopAddress()
MsgBox CreateObject(“WScript.Shell”).SpecialFolders(“Desktop”)
newTmp = CreateObject(“WScript.Shell”).SpecialFolders(“Desktop”)
CreateObject(“Scripting.FileSystemObject”).CreateFolder (newTmp & “/temp”)
End Sub
MODIFIED:
Dim newTmp As String
newTmp = CreateObject("WScript.Shell").SpecialFolders("Desktop")
Debug.Print newTmp
If Not CreateObject("Scripting.FileSystemObject").FolderExists(newTmp & "\tmp") Then
CreateObject("Scripting.FileSystemObject").CreateFolder (newTmp & "\tmp")
End If
Is anyone using an http server, PAAS for testing scripts?
An online version of node?
Hi J, What scripts do you want to test?
Sorry to reply so tardy, am using code pen within a React course.