Why parse HTML in Excel VBA? There may be different cases where we need to parse HTML in Excel. Few cases are generating multiple HTML files based on excel data, editing multiple HTML files, scraping some data etc.
I’m using Hacker News homepage for this example where we parse all the posts from homepage. Of course, Hacker News has its own API which can be used to pull latest topics but this example is just to learn how to parse HTML.
Why Hacker News? Because everyone knows Hacker News!
Final output looks like image below.
Getting started
- Microsoft HTML object library is used in parsing HTML.
- Open script editor in excel (alt + f11) and add a reference to Microsoft HTML object library (Tools > references > select)
A few basics first and then dive into code!
Defining and setting HTML
HTML object can be defined using :
Dim html As New HTMLDocument
HTML can be set to this object using this syntax :
html.body.innerHTML = htmlstring
Useful methods and properties
There are many methods and properties of HTML object and elements. You can have a look at all the methods using autocomplete but most useful methods are properties are as follows:
getElementsByTagName
getElementsByClassName
getElementById
getAttribute
innerText
innerHTML
Steps
- First, we pull Hacker News homepage by making a basic HTTP GET request. Read more about HTTP requests here – Http requests in Excel VBA
- Set HTTP response to our HTML object.
- Get all the latest topics using
getElementsByClass
method - Loop through each topic, parse each topic’s title, link, upvotes and username using different methods.
- Enter all parsed values to sheet 1
Complete Code
Have a look at the code first.
Public Sub parsehtml() Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement Dim i As Integer Set http = CreateObject("MSXML2.XMLHTTP") http.Open "GET", "https://news.ycombinator.com/", False http.send html.body.innerHTML = http.responseText Set topics = html.getElementsByClassName("athing") i = 2 For Each topic In topics Set titleElem = topic.getElementsByTagName("td")(2) Sheets(1).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText Sheets(1).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1) Sheets(1).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText Sheets(1).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText i = i + 1 Next End Sub
When I say topics, I mean posts.
Code Explanation
- First we define all the required objects like
HTMLDocument
,MSXML2.XMLHTTP
- Topics object to store topics which will be used to loop. I’ve defined few other objects like
titleElem
anddetailsElem
to make code readable. - get homepage and set its HTML to our HTML object in line 7
- Now we have to get all the topic elements. We have to identify those elements to get them. In this htmlpage, all topic elements have a class named
athing
so we can usegetElementsByClassName
method to get those elements. - How to identify elements? By viewing page source or inspecting element in chrome (right click on element > inspect)
- In line 8, We get all the topics using
getElementsByClassName
method by specifying the class nameathing
. Next we loop through all the topics using aFor
loop - All the topics are in a
table
element and each topic is atr
element (row) and topic’s details are in the nexttr
element. Each row has sub-parts :td
elements which have the content like topic name, domain name, upvotes, username etc.
- Topic title and domain are in the third
td
element so we get it usinggetElementsByTagName("td")(2)
(Index starts at 0). Topic name and link is in anothera
element so we get it usinggetElementsByTagName("a")(0)
and enter its values in sheet 1 - Topic details like upvotes and username are in the next element to topic element so we get it using
NextSibling
method. Get upvotes and username which are inspan
anda
elements and enter in sheet 1. - Integer
i
is used to store row number which starts at 2 and increments with every topic.
Wrapping up
HTML Elements can also be defined as HTMLBaseElement
for auto completion.
There are many ways of identifying an element in HTML. Using ID, Class name, Tag name etc. XPath can also be using to identify element but VBA doesn’t have built-in support for XPath. Here’s a custom function to identify elements using XPath.
Public Function getXPathElement(sXPath As String, objElement As Object) As HTMLBaseElement Dim sXPathArray() As String Dim sNodeName As String Dim sNodeNameIndex As String Dim sRestOfXPath As String Dim lNodeIndex As Long Dim lCount As Long ' Split the xpath statement sXPathArray = Split(sXPath, "/") sNodeNameIndex = sXPathArray(1) If Not InStr(sNodeNameIndex, "[") > 0 Then sNodeName = sNodeNameIndex lNodeIndex = 1 Else sXPathArray = Split(sNodeNameIndex, "[") sNodeName = sXPathArray(0) lNodeIndex = CLng(Left(sXPathArray(1), Len(sXPathArray(1)) - 1)) End If sRestOfXPath = Right(sXPath, Len(sXPath) - (Len(sNodeNameIndex) + 1)) Set getXPathElement = Nothing For lCount = 0 To objElement.ChildNodes().Length - 1 If UCase(objElement.ChildNodes().item(lCount).nodeName) = UCase(sNodeName) Then If lNodeIndex = 1 Then If sRestOfXPath = "" Then Set getXPathElement = objElement.ChildNodes().item(lCount) Else Set getXPathElement = getXPathElement(sRestOfXPath, objElement.ChildNodes().item(lCount)) End If End If lNodeIndex = lNodeIndex - 1 End If Next lCount End Function
If you have any questions or feedback, comment below.
- Flutter lookup failed in @fields error (solved) - July 14, 2023
- Free open source alternative to Notion along with AI - July 13, 2023
- Threads API for developers for programmatic access - July 12, 2023
Hi ranjith! Hope u r doing well. I got stuck writing a code to pull data from web using vba. I would like to scrape every docs within the newly produced links. It produces 20 links but when i run the code it gives me the docs of the first link and continue until 20 times, that means it repeats the action for a single link not for the 20 links. Hope to have ur kind reply soon. https://bin.codingislove.com/yehinaluki.vbs
I suppose the error is occurring between the [if–end if] block. Ur sophisticated touch on my code will definitely brings the output i long for. By the way, the code is working but getting stuck somewhere that is because it provides a single result 20 times rather than producing 20 different results. Thanks in advance. Hope my question is clear to you.
Please re-post it on forum. Simple queries can be answered in comments. Code related queries can be handled better in forum.
Thanks for your concern. I’ve already found the solution i was looking for. Thanks once again.
Hello
Thanks a lot for this useful tutorial
Can you give example of how to use the UDF getXPathElement?
Hi Yasser. Here’s an example of getXpathElement UDF
The above code returns the first link in Hackernews.
Use Firefox to get full XPath like above. Chrome might give short Xpaths with Id or class name which won’t work with this UDF.
Also, Read List of all Xpath selectors
how is “html” defined? can you please paste a complete code for that? What is the output of the UDF? I expected a string…
thanks
Thanks a lot for swift reply
I tested it and it is OK .. but I can’t get the XPath from firefox for Hackernews
Can you snapshot how to get this XPath?
Install Firebug for now. Firefox 55 will have built-in feature to copy xpath.
Copy Xpath in Firebug – right click on an element in firebug panel and copy XPath.
CodingisLove forum is currently down. It will be up and running in few days.
By the way. I tried to sign up the forum but I encountered error…
How can I sign up?
Thank you very much for great help and support
I have the latest version of firefox 55.0.3 and I have installed Firebug addon .. but couldn’t know how to copy the xPath .. Forgive me as I am a beginner at scraping issues
Hello, following the tutorial above — Excellent!
I have two challenges:
I could not get the loop to function in my case, and
How could I click the link to go to the detail page in XHR?
Any suggestions greatly appreciated.
Hi, You cannot interact with a page using XHR. Understand the difference between HTTP requests and browser automation with IE. Use one of them. Don’t use both. Did you see any browser interaction in the above example? No right! because they are just HTTP requests. Use this method or automate completely using IE or selenium VBA. Don’t mix and match.
BTW forum is UP now!
Excellent. I tried to use both from an approved reply some had on Stack Exchange. Shall use HTTP requests. Thank you.
Excellent tutorial. How would one CLICK the individual
titleElem.getElementsByTagName(“a”)(0).href
to iterate through to each DETAIL page from the link
using http Request?
You cannot click a link but instead, you should make another GET request to the detailed link –
detailUrl = titleElem.getElementsByTagName("a")(0).href
Got it, thank you!
The detail links in Hacker News are not dynamically created.
The same href in the LIST: https://news.ycombinator.com/
is the same URL when opened in a browser.
I am trying to goto a DETAIL href link that is dynamically created.
Where the href in the LISR is not the same as the URL when opened in a browser.
There is a RecordID that is in the Query String Parameters.
How would one capture a dynamically created RecordID to include in the GET string?
Hi Ranjith,
Any suggestions on the correct selector to return only the href?
Public Sub parsehtml()
‘Microsoft HTML Object Library
Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer
Set http = CreateObject(“MSXML2.XMLHTTP”)
http.Open “GET”, “https://www.google.com/search?&q=transamerica+san+francisco+ca”, False
http.send
html.body.innerHTML = http.responsetext
Debug.Print html.getElementsByClassName(“_ldf”)(0).innerHTML ‘ website
End Sub
You can use .href property as explained in the example. Line 12 in the complete code above.
Shall try. Thank you.
Would anyone know what the code for the “source” refers to and how it impacts XHR searches?
Can a generic “source” be created?
https://www.google.com/search?source=hp&ei=i5ggWtf4B8qP0wKl55aADw&q=time
Hi J, Don’t try to scrape google using VBA. They will ban your IP address soon.
I have been working on trying to open a new IE tab from a link.click on a webpage that executes javascript. I can accomplish it with IE.navigate
With IE.navigate, CLng(2048) it works great, notes below
https://msdn.microsoft.com/en-us/library/dd565688(VS.85).aspx
I have searched for two days and tried everything found, but nothing addresses how to
click on a javascript link and have it open in a new IE tab in an existing open IE browser.
Any suggestions greatly appreciated. I posted it on Stack Overflow:
https://stackoverflow.com/questions/48686537/how-to-click-javascript-link-using-clng2048-in-a-new-ie-tab
Thank you.
Hello. Can you help me? please.
I want to extract the bitcoin price from this site https://coinmarketcap.com/
The name and price is all i need, nothing else.
I tried some codes like yours but nothing works for me.
Hello .. Try this code
Sub Test()
Dim html As Object
Dim tbl As Object
Dim tRow As Object
Dim tCel As Object
Dim x As Long
Dim y As Long
With CreateObject("MSxml2.XMLHTTP")
.Open "GET", "https://coinmarketcap.com", False
.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = .responseText
End With
For Each tbl In html.getElementsByTagName("table")
For Each tRow In tbl.getElementsByTagName("tr")
For Each tCel In tRow.getElementsByTagName("td")
y = y + 1
If y = 2 Or y = 4 Then
Cells(x + 1, IIf(y = 2, 1, 2)) = tCel.innerText
End If
Next tCel
y = 0
x = x + 1
Next tRow
Next tbl
End Sub
yes!! it works, thank you very much
Thanks Yaseer.
i slightly changed your code and it worked perfectly for me to pull 4 column values.. below is the updated code of yours..
Sub BitcoinPrice()
Dim html As Object
Dim tbl As Object
Dim tRow As Object
Dim tCel As Object
Dim x As Long
Dim y As Long
Set ws = Sheet4
With CreateObject(“MSxml2.XMLHTTP”)
.Open “GET”, “https://coinmarketcap.com”, False
.Send
Set html = CreateObject(“htmlfile”)
html.body.innerHTML = .ResponseText
End With
For Each tbl In html.getElementsByTagName(“table”)
For Each tRow In tbl.getElementsByTagName(“tr”)
For Each tCel In tRow.getElementsByTagName(“td”)
y = y + 1
Debug.Print y
If y = 2 Or y = 3 Or y = 4 Or y = 5 Then
ws.Cells(x + 1, IIf(y = 2, 1, IIf(y = 3, 2, IIf(y = 4, 3, IIf(y = 5, 4, 5))))) = tCel.innerText
Debug.Print tCel.innerText
End If
Next tCel
y = 0
x = x + 1
Next tRow
Next tbl
End Sub
Regards,
Arun
Hi Ranjith and others,
I’ve tried to run your complete code but it throws me an error at this line:
Set topics = html.getElementsByClassName(“athing”)
The error is:
Run-time error ‘438’:
Object doesn’t support this property or method
References to “Microsoft XML, v6.0” and “Microsoft HTML Object Library” are added.
Any ideas?
Strange enough but unticking MSXML reference solves the problem.
Hi Ranjith,
I am a very beginner in VBA and found your above code just fitting fine for the scraping I want to do.
Your sample is running unmodified, but when I just change line 5 to the website I am targetting to
http.Open “GET”, “http://ariva.de/daimler-aktie/”, False
the Run Time Error -2147024891 (800700005) Access dienied comes up.
The Debugger then highlights line 6.
Do you have any ideas what might be wrong with this URL?
Thank you in advance – rgds Friedrich
Hey Ranjith, Yasser and others,
How would I parse XML in Excel VBA on a website that requires me to login/authentication?
I have a subscription to Ancestry and want to scrape into excel a few records here:
https://search.ancestry.com.au/cgi-bin/sse.dll?db=FS1ScotlandBirthsandBaptisms&gss=angs-d&new=1&rank=1&msbdy=1900&msbdy_x=1&msbdp=10&MSAV=1&uidh=533&gl=&gst=&hc=10
Tried a few things but have had no success.
Please help me.
I have same issue. Need to get some data from page that requires login. Is that possible with HTTP request or we have do to it with IE automation?
I need you help!!!
Hello,
thanks for the code!
I have a strange problem. I get weather data. The actual website updates every minute.
The script only brought me the latest data when I restart Excel. Otherwise, the Marco brings me only the old data again. No idea where he got it from. From the Cash de Explores? How can I prevent this?
Thanks!
[email protected]
I have been trying to follow your website and extract data from an HTML file and I was able to do so on my system. BUT when I try to use the same code/module to extract data from the html file from my Friend’s system it shows me an “error 91: object variable or with block not set” at line 2 and sometime even at line 1. Basically it is not reading the html document on a different system. The HTMLButton1 and HTMLButton2 when tried in debug.print show nothing
Dim file As String
Dim IE As New SHDocVw.InternetExplorerMedium
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLButton1
Dim HTMLButton2
file = “C:\temp\HTML\page1.html”
IE.Visible = False
IE.navigate file ‘line 1
Do While IE.Busy = True
Loop
Set HTMLDoc = IE.Document
Set HTMLButton1 = HTMLDoc.elementFromPoint(338, 462)
Set HTMLButton2 = HTMLDoc.elementFromPoint(314, 490)
array_objects(index).date = HTMLButton1.innerText ‘line2
array_objects(index).protocol = HTMLButton2.innerText
Hi,
Set topics = html.getElementsByClassName(“athing”)
Run-time error ‘438’:
Object doesn’t support this property or method
Unticking MSXML reference does not solve the problem.
I’m stuck. 😦
Hi..Would like to know how to extract url from this part of html code
https://www.extremetech.com › Computing
As im a beginner dont know how to get the url using getelements in VBA. This is the code that i tried.
Set a = html.getElementsByClassName(“Tbwupd”)
i = 2
For Each b In a
Set titleElem = b.getElementsByTagName(“Cite”)
Sheets(1).Cells(i, 1).Value = b.getElementsByTagName(“iUH30”)
i = i + 1
Next
Your code has been very helpful…but I’m trying to read a list of “classes”, each with some unique text. I’m using your structure with some changes. All of the classes are found in a html structure like this:
<
table border=”0″ cellpadding=”2″ cellspacing=”1″>
ChooseName
AFC201C HGS CML COLL M6700 PROD (000100125031 / 0662 / 0670)
BP790C CSeries Collateral Labor (000100125900 / 0002 / 0220)
C6AXIY19 AXIS CJ1 Tech Spt (000100134129 / 0012 / 0026)
C6BOEY18 ISS2100 Program Support (000100131490 / 0066 / 0316)
C6FSIY18 FSI CL604 DTT Dev PM (000100130099 / 0032 / 0052)
…
I’m trying to read this like this, but get an object not defined error on the loop.
Public Sub getNWAsFromhtml()
Dim http As Object
Dim html As New HTMLDocument
Dim NWAsOnPage As Object
Dim NWA_ElemW As Object
Dim NWA_ElemG As Object
Dim i As Integer
Dim j As Integer
Set http = CreateObject(“MSXML2.XMLHTTP”)
‘http.Open “GET”, “https://news.ycombinator.com/”, False
http.Open “GET”, “http://win.rockwellcollins.com/TimeEntry/TEWorkList.aspx”, False
http.send
html.body.innerHTML = http.responseText
‘Set topics = html.getElementsByClassName(“athing”)
Set NWAsOnPage = html.getElementById(“ctl00_ctl00_MainContent_cph_ctrlWorklist”)
i = 2
j = i + 1
For Each nwa In NWAsOnPage
Set NWA_ElemW = nwa.getElementByClassName("item")
Set NWA_ElemG = nwa.getElementByClassName("bgGreyItem")
Sheets(1).Cells(i, 2).Value = NWA_ElemW.getElementsByTagName("td")(1).innerText
Sheets(1).Cells(j, 2).Value = NWA_ElemG.getElementsByTagName("td")(1).innerText
i = i + 2
j = j + 2
Next
End Sub
Any suggestions will be greatly appreciated!
Very handy, thanks!
I am using it to summarize hundreds of model result files which are in HTML format.
Glad it helped 🙂
Getting a Run-time Error 91: Object block variable not set.
Is this in connection to a missing Reference Library.
No one else in this thread seemed to have the issue.
Sorry – Line 7 html.body.innerHTML = http.responseText
Nice Work!
Your code is helpful but I’m stuck. I want to extract several informations (ie tasks) from http://www.capsule.com). I’ve put all information inside the following code:
http.Open “GET”, “https://cth.capsulecrm.com/”, False
http.send
html.body.innerHTML = http.responseText
‘Set topics = html.getElementsByClassName(“athing”)
”Set topics = html.getElementsByClassName(“application-outlet”)
”Set topics = html.getElementsByClassName(“panel”) ‘Where all the tasks are located
”Set topics = html.getElementsByClassName(“panel__header”) ‘not useful, could be panel__header too (typo)
”Set topics = html.getElementsByClassName(“grouped-tasks”)
”Set topics = html.getElementsByClassName(“task-group is-overdue”)
”Set topics = html.getElementsByClassName(“task-group is-today”)
Set topics = html.getElementsByClassName(“task-group is-next-7-days”)
''-= The following is part of every "task-group is-..." above =-
''Set topics = html.getElementsByClassName("task-group-items")
''Set topics = html.getElementsByClassName("general-task-item")
''Set topics = html.getElementsByClassName("general-task-item-input") 'Check button
''Set topics = html.getElementsByClassName("general-task-item-container")
''Set topics = html.getElementsByClassName("general-task-item-title")
''Set topics = html.getElementsByClassName("general-task-item-category") 'Email, Follow-up - span class
''Set topics = html.getElementsByClassName("hyperlink-button general-task-item-title-text") 'ML - suivi projets - button class
''
''Set topics = html.getElementsByClassName("general-task-item-subtitle") 'date for Who
''Set topics = html.getElementsByClassName("task-date-time ember-view") 'date
''Set topics = html.getElementsByClassName("ember-view") 'Who
i = 2
For Each topic In topics
'Set titleElem = topic.getElementsByTagName("td")(2)
'Sheets(3).Cells(i, 1).Value = titleElem.getElementsByTagName("a")(0).innerText
'Sheets(3).Cells(i, 2).Value = titleElem.getElementsByTagName("a")(0).href
Set titleElem = topic.getElementsByTagName("general-task-item")(2)
Sheets(3).Cells(i, 1).Value = titleElem.getElementsByTagName("general-task-item-category")(0).innerText
Sheets(3).Cells(i, 2).Value = titleElem.getElementsByTagName("general-task-item-sub-title")(0).innerText
'Set detailsElem = topic.NextSibling.getElementsByTagName("td")(1)
'Sheets(3).Cells(i, 3).Value = detailsElem.getElementsByTagName("span")(0).innerText
'Sheets(3).Cells(i, 4).Value = detailsElem.getElementsByTagName("a")(0).innerText
Set detailsElem = topic.NextSibling.getElementsByTagName("general-task-item")(1)
Sheets(3).Cells(i, 3).Value = detailsElem.getElementsByTagName("general-task-item-category")(0).innerText
Sheets(3).Cells(i, 4).Value = detailsElem.getElementsByTagName("general-task-item-sub-title")(0).innerText
hi, I am new ,knows some about vba intrested in excel vba web scrapping and creating userforms with vba needs help on how to open Whatsapp window in excel userform with selenium chrome driver and copy all whatsapp chats in my excel sheet