Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.
I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.
Why Hacker News? Because everyone knows Hacker News!
Final output looks like image below.
Getting started
- Microsoft HTML object library is used in parsing HTML.
- Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)
A few basics first and then dive into code!
Defining and setting HTML
HTML object can be defined using :
Dim html As New HTMLDocument
HTML can be set to this object using this syntax :
html.body.innerHTML = htmlstring
Useful methods and properties
There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:
getElementsByTagNamegetElementsByClassNamegetElementByIdgetAttributeinnerTextinnerHTML
Steps
- First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA
- Set HTTP response to our HTML object.
- Get all the latest topics using
getElementsByClassmethod - Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.
- Enter all parsed values to sheet 1
Complete Code
Have a look at the code first.
Public Sub parsehtml()
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://news.ycombinator.com/", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("athing")
i = 2
For Each topic In topics
Set titleElem = topic.getElementsByTagName("td")(2)
Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
i = i + 1
Next
End Sub
When I say topics, I mean posts.
Code Explanation
- First we define all the required objects like
HTMLDocument,MSXML2.XMLHTTP - Topics object to store topics which will be used to loop. I’ve defined few other objects like
titleElemanddetailsElemto make code readable. - get homepage and set its HTML to our HTML object in line 7
- Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named
athingso we can usegetElementsByClassNamemethod to get those elements. - How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)
- In line 8, We get all the topics using
getElementsByClassNamemethod by specifying the class nameathing. Next we loop through all the topics using aForloop - All the topics are in a
tableelement and each topic is atrelement (row) and topic’s details are in the nexttrelement. Each row has sub-parts :tdelements which have the content like topic name, domain name, upvotes, username etc.
- Topic title and domain are in the third
tdelement so we get it usinggetElementsByTagName("td")(2)(Index starts at 0). Topic name and link is in anotheraelement so we get it usinggetElementsByTagName("a")(0)and enter its values in sheet 1 - Topic details like upvotes and username are in the next element to topic element so we get it using
NextSiblingmethod. Get upvotes and username which are inspanandaelements and enter in sheet 1. - Integer
iis used to store row number which starts at 2 and increments with every topic.
Wrapping up
HTML Elements can also be defined as HTMLBaseElement for auto completion.
There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.
Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement Dim sXPathArray() As String Dim sNodeName As String Dim sNodeNameIndex As String Dim sRestOfXPath As String Dim lNodeIndex As Long Dim lCount As Long ' Split the xpath statement sXPathArray = Split(sXPath, "/") sNodeNameIndex = sXPathArray(1) If Not InStr(sNodeNameIndex, "[") > 0 Then sNodeName = sNodeNameIndex lNodeIndex = 1 Else sXPathArray = Split(sNodeNameIndex, "[") sNodeName = sXPathArray(0) lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1)) End If sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1)) Set getXPathElement = Nothing For lCount = 0 To objElement.ChildNodes().Length - 1 If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then If lNodeIndex = 1 Then If sRestOfXPath = "" Then Set getXPathElement = objElement.ChildNodes().item(lCount) Else Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount)) End If End If lNodeIndex = lNodeIndex - 1 End If Next lCount End Function
If you have any questions or feedback, comment below.
- Ultimate Guide: Build A Mobile E-commerce App With React Native And Medusa.js - February 15, 2025
- Flutter lookup failed in @fields error (solved) - July 14, 2023
- Free open source alternative to Notion along with AI - July 13, 2023

