Parse HTML in Excel VBA – Learn by parsing hacker news home page

Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.

I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.

Why Hacker News? Because everyone knows Hacker News!

Final output looks like image below.

parse html in vba

Getting started

  1. Microsoft HTML object library is used in parsing HTML.
  2. Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)

A few basics first and then dive into code!

Defining and setting HTML

HTML object can be defined using :

Dim html As New HTMLDocument

HTML can be set to this object using this syntax :

html.body.innerHTML = htmlstring

Useful methods and properties

There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:

  1. getElementsByTagName
  2. getElementsByClassName
  3. getElementById
  4. getAttribute
  5. innerText
  6. innerHTML

Steps

  1. First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA
  2. Set HTTP response to our HTML object.
  3. Get all the latest topics using getElementsByClass method
  4. Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.
  5. Enter all parsed values to sheet 1

Complete Code

Have a look at the code first.

Public Sub parsehtml()
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "https://news.ycombinator.com/", False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("athing")
i = 2
For Each topic In topics
Set titleElem = topic.getElementsByTagName("td")(2)
Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
i = i + 1
Next
End Sub

When I say topics, I mean posts.

Code Explanation

  1. First we define all the required objects like HTMLDocument, MSXML2.XMLHTTP
  2. Topics object to store topics which will be used to loop. I’ve defined few other objects like titleElem and detailsElem to make code readable.
  3. get homepage and set its HTML to our HTML object in line 7
  4. Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named athing so we can use getElementsByClassName method to get those elements.
  5. How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)
  6. In line 8, We get all the topics using getElementsByClassName method by specifying the class name athing. Next we loop through all the topics using a For loop
  7. All the topics are in a table element and each topic is a tr element (row) and topic’s details are in the next tr element. Each row has sub-parts : td elements which have the content like topic name, domain name, upvotes, username etc.

parse html in excel vba

  1. Topic title and domain are in the third td element so we get it using getElementsByTagName("td")(2) (Index starts at 0). Topic name and link is in another a element so we get it using getElementsByTagName("a")(0) and enter its values in sheet 1
  2. Topic details like upvotes and username are in the next element to topic element so we get it using NextSibling method. Get upvotes and username which are in span and a elements and enter in sheet 1.
  3. Integer i is used to store row number which starts at 2 and increments with every topic.

Wrapping up

HTML Elements can also be defined as HTMLBaseElement for auto completion.

There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.

Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement
Dim sXPathArray() As String

Dim sNodeName As String
Dim sNodeNameIndex As String
Dim sRestOfXPath As String
Dim lNodeIndex As Long
Dim lCount As Long

' Split the xpath statement
sXPathArray = Split(sXPath, "/")
sNodeNameIndex = sXPathArray(1)
If Not InStr(sNodeNameIndex, "[") > 0 Then
sNodeName = sNodeNameIndex
lNodeIndex = 1
Else
sXPathArray = Split(sNodeNameIndex, "[")
sNodeName = sXPathArray(0)
lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1))
End If
sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1))

Set getXPathElement = Nothing
For lCount = 0 To objElement.ChildNodes().Length - 1
If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then
If lNodeIndex = 1 Then
If sRestOfXPath = "" Then
Set getXPathElement = objElement.ChildNodes().item(lCount)
Else
Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount))
End If
End If
lNodeIndex = lNodeIndex - 1
End If
Next lCount
End Function

If you have any questions or feedback, comment below.

5 1 vote
Article Rating
Subscribe
Notify of
guest
44 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
shahin2137
3 years ago

Hi ranjith! Hope u r doing well. I got stuck writing a code to pull data from web using vba. I would like to scrape every docs within the newly produced links. It produces 20 links but when i run the code it gives me the docs of the first link and continue until 20 times, that means it repeats the action for a single link not for the 20 links. Hope to have ur kind reply soon. https://bin.codingislove.com/yehinaluki.vbs

shahin2137
3 years ago
Reply to  shahin2137

I suppose the error is occurring between the [if–end if] block. Ur sophisticated touch on my code will definitely brings the output i long for. By the way, the code is working but getting stuck somewhere that is because it provides a single result 20 times rather than producing 20 different results. Thanks in advance. Hope my question is clear to you.

shahin2137
3 years ago
Reply to  Ranjith kumar

Thanks for your concern. I’ve already found the solution i was looking for. Thanks once again.

Yasser
Yasser
3 years ago

Hello
Thanks a lot for this useful tutorial
Can you give example of how to use the UDF getXPathElement?

smash
smash
1 year ago
Reply to  Ranjith kumar

how is “html” defined? can you please paste a complete code for that? What is the output of the UDF? I expected a string…

thanks

Yasser
Yasser
3 years ago

Thanks a lot for swift reply
I tested it and it is OK .. but I can’t get the XPath from firefox for Hackernews
Can you snapshot how to get this XPath?

Yasser
Yasser
3 years ago

By the way. I tried to sign up the forum but I encountered error…
How can I sign up?

Yasser
Yasser
3 years ago

Thank you very much for great help and support
I have the latest version of firefox 55.0.3 and I have installed Firebug addon .. but couldn’t know how to copy the xPath .. Forgive me as I am a beginner at scraping issues

j
j
2 years ago

Hello, following the tutorial above — Excellent!

I have two challenges:

I could not get the loop to function in my case, and
How could I click the link to go to the detail page in XHR?

Any suggestions greatly appreciated.

J
J
2 years ago
Reply to  Ranjith kumar

Excellent. I tried to use both from an approved reply some had on Stack Exchange. Shall use HTTP requests. Thank you.

j
j
2 years ago

Excellent tutorial. How would one CLICK the individual

titleElem.getElementsByTagName(“a”)(0).href

to iterate through to each DETAIL page from the link

using http Request?

j
j
2 years ago

Got it, thank you!

j
j
2 years ago

The detail links in Hacker News are not dynamically created.

The same href in the LIST: https://news.ycombinator.com/
is the same URL when opened in a browser.

I am trying to goto a DETAIL href link that is dynamically created.
Where the href in the LISR is not the same as the URL when opened in a browser.

There is a RecordID that is in the Query String Parameters.

How would one capture a dynamically created RecordID to include in the GET string?

j
j
2 years ago

Hi Ranjith,

Any suggestions on the correct selector to return only the href?

Public Sub parsehtml()
‘Microsoft HTML Object Library

Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject(“MSXML2.XMLHTTP”)
http.Open “GET”, “https://www.google.com/search?&q=transamerica+san+francisco+ca”, False
http.send
html.body.innerHTML = http.responsetext
Debug.Print html.getElementsByClassName(“_ldf”)(0).innerHTML ‘ website
End Sub

j
j
2 years ago
Reply to  Ranjith kumar

Shall try. Thank you.

j
j
2 years ago

Would anyone know what the code for the “source” refers to and how it impacts XHR searches?

Can a generic “source” be created?

https://www.google.com/search?source=hp&ei=i5ggWtf4B8qP0wKl55aADw&q=time

Jon
Jon
2 years ago

I have been working on trying to open a new IE tab from a link.click on a webpage that executes javascript. I can accomplish it with IE.navigate

With IE.navigate, CLng(2048) it works great, notes below
https://msdn.microsoft.com/en-us/library/dd565688(VS.85).aspx

I have searched for two days and tried everything found, but nothing addresses how to
click on a javascript link and have it open in a new IE tab in an existing open IE browser.

Any suggestions greatly appreciated. I posted it on Stack Overflow:
https://stackoverflow.com/questions/48686537/how-to-click-javascript-link-using-clng2048-in-a-new-ie-tab

Thank you.

Scott
Scott
2 years ago

Hello. Can you help me? please.

I want to extract the bitcoin price from this site https://coinmarketcap.com/
The name and price is all i need, nothing else.
I tried some codes like yours but nothing works for me.

YasserKhalil
YasserKhalil
2 years ago
Reply to  Scott

Hello .. Try this code
Sub Test()
Dim html As Object
Dim tbl As Object
Dim tRow As Object
Dim tCel As Object
Dim x As Long
Dim y As Long

With CreateObject("MSxml2.XMLHTTP")
.Open "GET", "https://coinmarketcap.com", False
.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = .responseText
End With

For Each tbl In html.getElementsByTagName("table")
For Each tRow In tbl.getElementsByTagName("tr")
For Each tCel In tRow.getElementsByTagName("td")
y = y + 1
If y = 2 Or y = 4 Then
Cells(x + 1, IIf(y = 2, 1, 2)) = tCel.innerText
End If
Next tCel
y = 0
x = x + 1
Next tRow
Next tbl

End Sub

Scott
Scott
2 years ago
Reply to  YasserKhalil

yes!! it works, thank you very much

Skyscraper
Skyscraper
2 years ago

Hi Ranjith and others,
I’ve tried to run your complete code but it throws me an error at this line:
Set topics = html.getElementsByClassName(“athing”)
The error is:
Run-time error ‘438’:
Object doesn’t support this property or method
References to “Microsoft XML, v6.0” and “Microsoft HTML Object Library” are added.
Any ideas?

Skyscraper
Skyscraper
2 years ago
Reply to  Skyscraper

Strange enough but unticking MSXML reference solves the problem.

OmniJet
OmniJet
2 years ago

Hi Ranjith,
I am a very beginner in VBA and found your above code just fitting fine for the scraping I want to do.
Your sample is running unmodified, but when I just change line 5 to the website I am targetting to
http.Open “GET”, “http://ariva.de/daimler-aktie/”, False
the Run Time Error -2147024891 (800700005) Access dienied comes up.
The Debugger then highlights line 6.
Do you have any ideas what might be wrong with this URL?
Thank you in advance – rgds Friedrich

Dick
Dick
2 years ago

Hey Ranjith, Yasser and others,
How would I parse XML in Excel VBA on a website that requires me to login/authentication?
I have a subscription to Ancestry and want to scrape into excel a few records here:
https://search.ancestry.com.au/cgi-bin/sse.dll?db=FS1ScotlandBirthsandBaptisms&gss=angs-d&new=1&rank=1&msbdy=1900&msbdy_x=1&msbdp=10&MSAV=1&uidh=533&gl=&gst=&hc=10
Tried a few things but have had no success.
Please help me.

Reanoe
Reanoe
1 year ago
Reply to  Dick

I have same issue. Need to get some data from page that requires login. Is that possible with HTTP request or we have do to it with IE automation?

Eckhard from Germany
Eckhard from Germany
2 years ago

I need you help!!!

Hello,
thanks for the code!
I have a strange problem. I get weather data. The actual website updates every minute.
The script only brought me the latest data when I restart Excel. Otherwise, the Marco brings me only the old data again. No idea where he got it from. From the Cash de Explores? How can I prevent this?

Thanks!

[email protected]

Tanny
Tanny
2 years ago

I have been trying to follow your website and extract data from an HTML file and I was able to do so on my system. BUT when I try to use the same code/module to extract data from the html file from my Friend’s system it shows me an “error 91: object variable or with block not set” at line 2 and sometime even at line 1. Basically it is not reading the html document on a different system. The HTMLButton1 and HTMLButton2 when tried in debug.print show nothing

Dim file As String
Dim IE As New SHDocVw.InternetExplorerMedium
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLButton1
Dim HTMLButton2

file = “C:\temp\HTML\page1.html”

IE.Visible = False
IE.navigate file ‘line 1
Do While IE.Busy = True
Loop

Set HTMLDoc = IE.Document
Set HTMLButton1 = HTMLDoc.elementFromPoint(338, 462)
Set HTMLButton2 = HTMLDoc.elementFromPoint(314, 490)

array_objects(index).date = HTMLButton1.innerText ‘line2
array_objects(index).protocol = HTMLButton2.innerText

Vladimir
Vladimir
1 year ago

Hi,
Set topics = html.getElementsByClassName(“athing”)
Run-time error ‘438’:
Object doesn’t support this property or method

Unticking MSXML reference does not solve the problem.
I’m stuck. 😦

Dhilshas
Dhilshas
1 year ago

Hi..Would like to know how to extract url from this part of html code

https://www.extremetech.com › Computing

As im a beginner dont know how to get the url using getelements in VBA. This is the code that i tried.
Set a = html.getElementsByClassName(“Tbwupd”)
i = 2
For Each b In a
Set titleElem = b.getElementsByTagName(“Cite”)
Sheets(1).Cells(i, 1).Value = b.getElementsByTagName(“iUH30”)
i = i + 1
Next

Nephi Lewis
Nephi Lewis
1 year ago

Your code has been very helpful…but I’m trying to read a list of “classes”, each with some unique text. I’m using your structure with some changes. All of the classes are found in a html structure like this:

<

table border=”0″ cellpadding=”2″ cellspacing=”1″>

ChooseName

AFC201C HGS CML COLL M6700 PROD (000100125031 / 0662 / 0670)

BP790C CSeries Collateral Labor (000100125900 / 0002 / 0220)

C6AXIY19 AXIS CJ1 Tech Spt (000100134129 / 0012 / 0026)

C6BOEY18 ISS2100 Program Support (000100131490 / 0066 / 0316)

C6FSIY18 FSI CL604 DTT Dev PM (000100130099 / 0032 / 0052)

I’m trying to read this like this, but get an object not defined error on the loop.

Public Sub getNWAsFromhtml()

Dim http As Object
Dim html As New HTMLDocument
Dim NWAsOnPage As Object
Dim NWA_ElemW As Object
Dim NWA_ElemG As Object

Dim i As Integer
Dim j As Integer

Set http = CreateObject(“MSXML2.XMLHTTP”)

‘http.Open “GET”, “https://news.ycombinator.com/”, False
http.Open “GET”, “http://win.rockwellcollins.com/TimeEntry/TEWorkList.aspx”, False
http.send
html.body.innerHTML = http.responseText

‘Set topics = html.getElementsByClassName(“athing”)
Set NWAsOnPage = html.getElementById(“ctl00_ctl00_MainContent_cph_ctrlWorklist”)

i = 2
j = i + 1

For Each nwa In NWAsOnPage

Set NWA_ElemW = nwa.getElementByClassName("item")
Set NWA_ElemG = nwa.getElementByClassName("bgGreyItem")
Sheets(1).Cells(i, 2).Value = NWA_ElemW.getElementsByTagName("td")(1).innerText
Sheets(1).Cells(j, 2).Value = NWA_ElemG.getElementsByTagName("td")(1).innerText

i = i + 2
j = j + 2

Next

End Sub

Any suggestions will be greatly appreciated!

Yvonne
Yvonne
1 year ago

Very handy, thanks!
I am using it to summarize hundreds of model result files which are in HTML format.

Doug Belling
Doug Belling
1 year ago

Getting a Run-time Error 91: Object block variable not set.
Is this in connection to a missing Reference Library.
No one else in this thread seemed to have the issue.

Doug Belling
Doug Belling
1 year ago
Reply to  Doug Belling

Sorry – Line 7 html.body.innerHTML = http.responseText

Alfred Vachris
8 months ago

Nice Work!

Michel Levesque
4 months ago

Your code is helpful but I’m stuck. I want to extract several informations (ie tasks) from http://www.capsule.com). I’ve put all information inside the following code:
http.Open “GET”, “https://cth.capsulecrm.com/”, False
http.send
html.body.innerHTML = http.responseText
‘Set topics = html.getElementsByClassName(“athing”)
”Set topics = html.getElementsByClassName(“application-outlet”)
”Set topics = html.getElementsByClassName(“panel”) ‘Where all the tasks are located
”Set topics = html.getElementsByClassName(“panel__header”) ‘not useful, could be panel__header too (typo)
”Set topics = html.getElementsByClassName(“grouped-tasks”)
”Set topics = html.getElementsByClassName(“task-group is-overdue”)
”Set topics = html.getElementsByClassName(“task-group is-today”)
Set topics = html.getElementsByClassName(“task-group is-next-7-days”)

''-= The following is part of every "task-group is-..." above =-
''Set topics = html.getElementsByClassName("task-group-items")
''Set topics = html.getElementsByClassName("general-task-item")
''Set topics = html.getElementsByClassName("general-task-item-input") 'Check button
''Set topics = html.getElementsByClassName("general-task-item-container")
''Set topics = html.getElementsByClassName("general-task-item-title")
''Set topics = html.getElementsByClassName("general-task-item-category") 'Email, Follow-up - span class
''Set topics = html.getElementsByClassName("hyperlink-button general-task-item-title-text") 'ML - suivi projets - button class
''
''Set topics = html.getElementsByClassName("general-task-item-subtitle") 'date for Who
''Set topics = html.getElementsByClassName("task-date-time ember-view") 'date
''Set topics = html.getElementsByClassName("ember-view") 'Who
i = 2
For Each topic In topics
'Set titleElem = topic.getElementsByTagName("td")(2)
'Sheets(3).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
'Sheets(3).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set titleElem = topic.getElementsByTagName("general-task-item")(2)
Sheets(3).Cells(i, 1).Value = titleElem.getElementsByTagName("general-task-item-category")(0).innerText
Sheets(3).Cells(i, 2).Value = titleElem.getElementsByTagName("general-task-item-sub-title")(0).innerText
'Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
'Sheets(3).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
'Sheets(3).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
Set detailsElem = topic.NextSibling.getElementsByTagName("general-task-item")(1)
Sheets(3).Cells(i, 3).Value = detailsElem.getElementsByTagName("general-task-item-category")(0).innerText
Sheets(3).Cells(i, 4).Value = detailsElem.getElementsByTagName("general-task-item-sub-title")(0).innerText