Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.
I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.
Why Hacker News? Because everyone knows Hacker News!
Final output looks like image below.
Getting started
- Microsoft HTML object library is used in parsing HTML.
- Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)
A few basics first and then dive into code!
Defining and setting HTML
HTML object can be defined using :
Dim html As New HTMLDocument
HTML can be set to this object using this syntax :
html.body.innerHTML = htmlstring
Useful methods and properties
There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:
getElementsByTagName
getElementsByClassName
getElementById
getAttribute
innerText
innerHTML
Steps
- First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA
- Set HTTP response to our HTML object.
- Get all the latest topics using
getElementsByClass
method - Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.
- Enter all parsed values to sheet 1
Complete Code
Have a look at the code first.
Public Sub parsehtml() Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement Dim i As Integer Set http = CreateObject("MSXML2.XMLHTTP") http.Open "GET", "https://news.ycombinator.com/", False http.send html.body.innerHTML = http.responseText Set topics = html.getElementsByClassName("athing") i = 2 For Each topic In topics Set titleElem = topic.getElementsByTagName("td")(2) Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1) Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText i = i + 1 Next End Sub
When I say topics, I mean posts.
Code Explanation
- First we define all the required objects like
HTMLDocument
,MSXML2.XMLHTTP
- Topics object to store topics which will be used to loop. I’ve defined few other objects like
titleElem
anddetailsElem
to make code readable. - get homepage and set its HTML to our HTML object in line 7
- Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named
athing
so we can usegetElementsByClassName
method to get those elements. - How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)
- In line 8, We get all the topics using
getElementsByClassName
method by specifying the class nameathing
. Next we loop through all the topics using aFor
loop - All the topics are in a
table
element and each topic is atr
element (row) and topic’s details are in the nexttr
element. Each row has sub-parts :td
elements which have the content like topic name, domain name, upvotes, username etc.
- Topic title and domain are in the third
td
element so we get it usinggetElementsByTagName("td")(2)
(Index starts at 0). Topic name and link is in anothera
element so we get it usinggetElementsByTagName("a")(0)
and enter its values in sheet 1 - Topic details like upvotes and username are in the next element to topic element so we get it using
NextSibling
method. Get upvotes and username which are inspan
anda
elements and enter in sheet 1. - Integer
i
is used to store row number which starts at 2 and increments with every topic.
Wrapping up
HTML Elements can also be defined as HTMLBaseElement
for auto completion.
There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.
Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement Dim sXPathArray() As String Dim sNodeName As String Dim sNodeNameIndex As String Dim sRestOfXPath As String Dim lNodeIndex As Long Dim lCount As Long ' Split the xpath statement sXPathArray = Split(sXPath, "/") sNodeNameIndex = sXPathArray(1) If Not InStr(sNodeNameIndex, "[") > 0 Then sNodeName = sNodeNameIndex lNodeIndex = 1 Else sXPathArray = Split(sNodeNameIndex, "[") sNodeName = sXPathArray(0) lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1)) End If sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1)) Set getXPathElement = Nothing For lCount = 0 To objElement.ChildNodes().Length - 1 If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then If lNodeIndex = 1 Then If sRestOfXPath = "" Then Set getXPathElement = objElement.ChildNodes().item(lCount) Else Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount)) End If End If lNodeIndex = lNodeIndex - 1 End If Next lCount End Function
If you have any questions or feedback, comment below.