Site icon Coding is Love

Parse HTML in Excel VBA – Learn by parsing hacker news home page (Updated 2022)

Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.

I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.

Why Hacker News? Because everyone knows Hacker News!

Final output looks like image below.

Getting started

  1. Microsoft HTML object library is used in parsing HTML.
  2. Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)

A few basics first and then dive into code!

Defining and setting HTML

HTML object can be defined using :

Dim html As New HTMLDocument

HTML can be set to this object using this syntax :

html.body.innerHTML = htmlstring

Useful methods and properties

There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:

  1. getElementsByTagName
  2. getElementsByClassName
  3. getElementById
  4. getAttribute
  5. innerText
  6. innerHTML

Steps

  1. First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA
  2. Set HTTP response to our HTML object.
  3. Get all the latest topics using getElementsByClass method
  4. Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.
  5. Enter all parsed values to sheet 1

Complete Code

Have a look at the code first.

Public Sub parsehtml()
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://news.ycombinator.com/", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("athing")
i = 2
For Each topic In topics
Set titleElem = topic.getElementsByTagName("td")(2)
Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
i = i + 1
Next
End Sub

When I say topics, I mean posts.

Code Explanation

  1. First we define all the required objects like HTMLDocument, MSXML2.XMLHTTP
  2. Topics object to store topics which will be used to loop. I’ve defined few other objects like titleElem and detailsElem to make code readable.
  3. get homepage and set its HTML to our HTML object in line 7
  4. Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named athing so we can use getElementsByClassName method to get those elements.
  5. How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)
  6. In line 8, We get all the topics using getElementsByClassName method by specifying the class name athing. Next we loop through all the topics using a For loop
  7. All the topics are in a table element and each topic is a tr element (row) and topic’s details are in the next tr element. Each row has sub-parts : td elements which have the content like topic name, domain name, upvotes, username etc.

  1. Topic title and domain are in the third td element so we get it using getElementsByTagName("td")(2) (Index starts at 0). Topic name and link is in another a element so we get it using getElementsByTagName("a")(0) and enter its values in sheet 1
  2. Topic details like upvotes and username are in the next element to topic element so we get it using NextSibling method. Get upvotes and username which are in span and a elements and enter in sheet 1.
  3. Integer i is used to store row number which starts at 2 and increments with every topic.

Wrapping up

HTML Elements can also be defined as HTMLBaseElement for auto completion.

There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.

Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement
Dim sXPathArray() As String

Dim sNodeName As String
Dim sNodeNameIndex As String
Dim sRestOfXPath As String
Dim lNodeIndex As Long
Dim lCount As Long

' Split the xpath statement
sXPathArray = Split(sXPath, "/")
sNodeNameIndex = sXPathArray(1)
If Not InStr(sNodeNameIndex, "[") > 0 Then
sNodeName = sNodeNameIndex
lNodeIndex = 1
Else
sXPathArray = Split(sNodeNameIndex, "[")
sNodeName = sXPathArray(0)
lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1))
End If
sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1))

Set getXPathElement = Nothing
For lCount = 0 To objElement.ChildNodes().Length - 1
If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then
If lNodeIndex = 1 Then
If sRestOfXPath = "" Then
Set getXPathElement = objElement.ChildNodes().item(lCount)
Else
Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount))
End If
End If
lNodeIndex = lNodeIndex - 1
End If
Next lCount
End Function

If you have any questions or feedback, comment below.

Exit mobile version