In the previous – article, I’ve explained how to scrape a website using selenium VBA and also mentioned that using selenium is not the best method to scrape data always. So let’s talk about different scraping methods and how to choose the best one for a web page.
Before going into scraping website, let’s understand how websites work!
How websites work
Websites are just a bunch of HTML pages. Websites are on the Internet and Internet is a network of computers all over the world. So any website that we see on the Internet is present on a computer somewhere. A web server is installed on that computer which serves HTML files when a request is made to that server.
For example : When you go to codingislove.com in your browser, a GET request is made by the browser to codingislove.com which is mapped to a web server’s IP address. Web server would respond with an HTML file which is shown on your browser.
Basic GET requests
So all you have to do is make a GET request using any programming language to the web page that you want to scrape > server responds with an HTML page > parse the HTML received and you have the data required!
Client side rendering and Server side rendering
In Server side rendering, data in HTML is rendered on the server and HTML with data is sent in the response. For example, This blog uses server-side rendering. Blog post’s title and content is pulled from a database and rendered within the server and HTML with data is sent when a GET request is made.
For Server side rendered websites, we can just make basic GET request and parse HTML to get data.
If you are using Excel VBA to make requests then read How to send HTTP requests in VBA
Scraping example of server-side rendered web page – Parse HTML in Excel VBA – Learn by parsing hacker news home page
An Example of client side rendered website is in the previous article where the website pulls data from its JSON API
If we make a GET request to client side rendered websites, we’ll get only HTML layout without data. So all you have to do is find the API or another source from which data is being pulled and make a request to that source directly.
Finding the source : Open the website in a browser > open chrome developer tools or firefox developer tools > network tab > It shows all HTTP requests made by that web page > Use XHR filter to see only HTTP requests and check the response of all requests to find the one with data.
Now make GET request or POST request with appropriate parameters to that source URL and you have the data. Check the example mentioned in the previous article to understand this clearly.
Most websites use both client side rendering and server-side rendering side by side for best user experience. Take an example of any e-commerce website, When we open a product category, the first page of products are rendered server side.
When we click on see more products > more products are pulled from its API and rendered on the browser.
Recently I’ve seen a question in a forum where the user wants to scrape a website from an e-commerce website. He was making a GET request to that page and was able to scrape only a few products as other products are loaded when ‘See more’ button is clicked which is not possible with a simple GET request.
Solution for this question is simple, find source API from which website is pulling data and make a GET request or POST request appropriately to that API.
Best method for scraping website
First, check if the website has an API. If it has then use the API.
Check if data is rendered server side, If yes then make GET requests directly to that URL
Scraping data is not allowed by many sites and also ban scraper’s IP, although scrapers use proxies to get through it. Make sure not to violate the rules. Scraping data legally can help automate your work and add productivity to your work.
If you have and questions or feedback, comment below.
- React UI tutorial – Building Instagram video player using HTML and CSS Flex - July 27, 2020
- How to open emoji keyboard on Windows or Mac - July 17, 2020
- 8 ways to optimize React native FlatList performance - July 16, 2020