I’ve seen quite a lot of requests lately from people wanting to know how to extract text from web pages.
Macro Scheduler’s optional WebRecorder add-on simplifies the automation of web pages and includes functions for extracting tables, text or HTML from web page elements. WebRecorder’s Tag Extraction wizard makes it easy to create the code.
Sometimes you can choose a specific HTML element and identify it uniquely via it’s ID or NAME attribute. But other times you might want all the text from the whole page, or you may need to extract the entire page and then parse out the bits you’re interested in using RegEx or some other string manipulation functions.
To extract an entire page I specify the BODY element. If you want to extract data from web pages it does help if you know a little about HTML. And if you do you’ll know that each page has just one BODY element which contains the code making up the visible portion of the page.
Here’s code produced using WebRecorder when navigating to mjtnet.com and using the Tag Extraction wizard to extract the BODY text:
IE_Create>0,IE[0]
IE_Navigate>%IE[0]%,http://www.mjtnet.com/,r
IE_Wait>%IE[0]%,r
Wait>delay
//Modify buffer size if required (you may get a crash if buffer size too small for data) ...
Let>BODY0_SIZE=9999
IE_ExtractTag>%IE[0]%,,BODY,0,0,BODY0,r
MidStr>r_6,1,r,BODY0
MessageModal>BODY0
The macro simply displays just the text in a message box but could be set to pull out the full HTML. You could then parse it with RegEx to get the information you are interested in.
You will need WebRecorder installed for the above to work.
If you don’t have WebRecorder you can do the same with a bit more work using VBScript. Some library functions for doing this can be found here and here.
So here’s the equivalent in VBScript:
VBSTART
Dim IE
'Creates IE instance
Sub CreateIE
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible=1
End Sub
'Navigate to an IE instance
Sub Navigate(URL)
IE.Navigate URL
do while IE.Busy
loop
End Sub
'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
dim t
set t = IE.document.getElementsbyTagname(Tagname)
if all=1 then
ExtractTag = t.Item(Num).outerHTML
else
ExtractTag = t.Item(Num).innerText
end if
End Function
VBEND
VBRun>CreateIE
VBRun>Navigate,www.mjtnet.com
VBEval>ExtractTag("BODY",0,0),BodyText
MessageModal>BodyText
But what if you already have a macro which already opens IE, or works against an already open instance of IE? The above macros need to create the IE instance before they can access them and extract data from them. You may have a macro that already starts IE some other way – maybe just by using a RunProgram or ExecuteFile call, or indirectly via some other application. Many times people tackle the extraction of data from such an IE window by sending keystrokes to do a Select-All, Edit/Copy and then use GetClipboard; or even File/Save As to save the HTML to a file. This of course adds time and can be unreliable. So how else can we do it?
Well, this tip shows us a function we can use to attach to an existing IE instance. So let’s use that and then use our ExtractTag function to pull out the BODY HTML:
VBSTART
Dim IE
' Attaches to an already running IE instance with given URL
Sub GetIE(URL)
Dim objInstances, objIE
Set objInstances = CreateObject("Shell.Application").windows
If objInstances.Count > 0 Then '/// make sure we have instances open.
For Each objIE In objInstances
If InStr(objIE.LocationURL,URL) > 0 then
Set IE = objIE
End if
Next
End if
End Sub
'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
dim t
set t = IE.document.getElementsbyTagname(Tagname)
if all=1 then
ExtractTag = t.Item(Num).outerHTML
else
ExtractTag = t.Item(Num).innerText
end if
End Function
VBEND
VBRun>GetIE,www.mjtnet.com
VBEval>ExtractTag("BODY",0,1),BodyHTML
MessageModal>BodyHTML
This snippet assumes a copy of IE is already open and pointing to www.mjtnet.com. The GetIE call creates a link to that IE window and then we use the ExtractTag function to pull out the HTML of the BODY element.
These examples use the BODY element, which will contain everything displayed on the page. As I mentioned before you can be more specific and specify some other element, and with WebRecorder, or a modified version of the ExtractTag VBScript function use other attributes to identify the element (the existing VBScript ExtractTag function shown above just uses the numeric index). WebRecorder tries to make it simple by giving you a point and click wizard, making some assumptions for you, so that you need not fully understand the HTML of the page. But it still helps you understand HTML. Looking at the source of the page you should be able to identify the element you need to extract from. And whether you extract directly from that or extract the BODY and then use RegEx being prepared to delve into the HTML source is going to get you further.
UPDATE: 19th January 2012
As of version 13.0.06 Macro Scheduler now includes a function called IEGetTags. For a given tag type and IE tab this will retrieve an array of tag contents. It can extract just the text, or html of the tags. This example extracts the inner HTML of all DIV elements in the open IE document currently at www.mjtnet.com:
IEGetTags>mjtnet.com,DIV,H,divArr
You can then cycle through each one with a Repeat Until
If>divArr_count>0
Let>k=0
Repeat>k
Let>k=k+1
Let>this_div_html=divArr_%k%
..
.. do something with it
.. e.g. use RegEx or substring searching to determine
.. if this is the DIV you want and extract from it
..
Until>k=divArr_count
Endif
To further identify the tag you are interested in, or find the data you want, you can use RegEx, EasyPatterns, or string functions.
Macro Scheduler 13.0.06 and above also has a function called IETagEvent which will let you simulate a Click on a given tag, focus it, or modify its value. So once you have identified a tag using IEGetTags and your Repeat/Until loop you can click on it, focus it or modify its value (e.g. for form fields).