Parse HTML in Excel VBA – Learn by parsing hacker news home page

Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.

I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.

Why Hacker News? Because everyone knows Hacker News!

Final output looks like image below.

parse html in vba

Getting started

  1. Microsoft HTML object library is used in parsing HTML.

  2. Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)

A few basics first and then dive into code!

Defining and setting HTML

HTML object can be defined using :

Dim html As New HTMLDocument

HTML can be set to this object using this syntax :

html.body.innerHTML = htmlstring

Useful methods and properties

There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:

  1. getElementsByTagName

  2. getElementsByClassName

  3. getElementById

  4. getAttribute

  5. innerText

  6. innerHTML

Steps

  1. First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA

  2. Set HTTP response to our HTML object.

  3. Get all the latest topics using getElementsByClass method

  4. Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.

  5. Enter all parsed values to sheet 1

Complete Code

Have a look at the code first.

Public Sub parsehtml()
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://news.ycombinator.com/", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("athing")
i = 2
For Each topic In topics
Set titleElem = topic.getElementsByTagName("td")(2)
Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
i = i + 1
Next
End Sub

When I say topics, I mean posts.

Code Explanation

  1. First we define all the required objects like HTMLDocument, MSXML2.XMLHTTP

  2. Topics object to store topics which will be used to loop. I’ve defined few other objects like titleElem and detailsElem to make code readable.

  3. get homepage and set its HTML to our HTML object in line 7

  4. Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named athing so we can use getElementsByClassName method to get those elements.

  5. How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)

  6. In line 8, We get all the topics using getElementsByClassName method by specifying the class name athing. Next we loop through all the topics using a For loop

  7. All the topics are in a table element and each topic is a tr element (row) and topic’s details are in the next tr element. Each row has sub-parts : td elements which have the content like topic name, domain name, upvotes, username etc.

parse html in excel vba

  1. Topic title and domain are in the third td element so we get it using getElementsByTagName("td")(2) (Index starts at 0). Topic name and link is in another a element so we get it using getElementsByTagName("a")(0) and enter its values in sheet 1

  2. Topic details like upvotes and username are in the next element to topic element so we get it using NextSibling method. Get upvotes and username which are in span and a elements and enter in sheet 1.

  3. Integer i is used to store row number which starts at 2 and increments with every topic.

Wrapping up

HTML Elements can also be defined as HTMLBaseElement for auto completion.

There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.

Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement 
    Dim sXPathArray() As String 
     
    Dim sNodeName As String 
    Dim sNodeNameIndex As String 
    Dim sRestOfXPath As String 
    Dim lNodeIndex As Long 
    Dim lCount As Long 
     
     ' Split the xpath statement
    sXPathArray = Split(sXPath, "/") 
    sNodeNameIndex = sXPathArray(1) 
    If Not InStr(sNodeNameIndex, "[") > 0 Then 
        sNodeName = sNodeNameIndex 
        lNodeIndex = 1 
    Else 
        sXPathArray = Split(sNodeNameIndex, "[") 
        sNodeName = sXPathArray(0) 
        lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1)) 
    End If 
    sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1)) 
     
    Set getXPathElement = Nothing 
    For lCount = 0 To objElement.ChildNodes().Length - 1 
        If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then 
            If lNodeIndex = 1 Then 
                If sRestOfXPath = "" Then 
                    Set getXPathElement = objElement.ChildNodes().item(lCount) 
                Else 
                    Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount)) 
                End If 
            End If 
            lNodeIndex = lNodeIndex - 1 
        End If 
    Next lCount 
End Function

If you have any questions or feedback, comment below.

Author: Ranjith kumar

A CA student by education, self taught coder by passion, loves to explore new technologies and believes in learn by doing.

5 Thoughts

  1. Hi ranjith! Hope u r doing well. I got stuck writing a code to pull data from web using vba. I would like to scrape every docs within the newly produced links. It produces 20 links but when i run the code it gives me the docs of the first link and continue until 20 times, that means it repeats the action for a single link not for the 20 links. Hope to have ur kind reply soon. https://bin.codingislove.com/yehinaluki.vbs

    1. I suppose the error is occurring between the [if–end if] block. Ur sophisticated touch on my code will definitely brings the output i long for. By the way, the code is working but getting stuck somewhere that is because it provides a single result 20 times rather than producing 20 different results. Thanks in advance. Hope my question is clear to you.

Leave a Reply

Your email address will not be published. Required fields are marked *