I’ve seen quite a lot of requests lately from people wanting to know how to extract text from web pages.
Macro Scheduler’s optional WebRecorder add-on simplifies the automation of web pages and includes functions for extracting tables, text or HTML from web page elements. WebRecorder’s Tag Extraction wizard makes it easy to create the code.
Sometimes you can choose a specific HTML element and identify it uniquely via it’s ID or NAME attribute. But other times you might want all the text from the whole page, or you may need to extract the entire page and then parse out the bits you’re interested in using RegEx or some other string manipulation functions.
To extract an entire page I specify the BODY element. If you want to extract data from web pages it does help if you know a little about HTML. And if you do you’ll know that each page has just one BODY element which contains the code making up the visible portion of the page.
Here’s code produced using WebRecorder when navigating to mjtnet.com and using the Tag Extraction wizard to extract the BODY text:
IE_Create>0,IE IE_Navigate>%IE%,http://www.mjtnet.com/,r IE_Wait>%IE%,r Wait>delay //Modify buffer size if required (you may get a crash if buffer size too small for data) ... Let>BODY0_SIZE=9999 IE_ExtractTag>%IE%,,BODY,0,0,BODY0,r MidStr>r_6,1,r,BODY0 MessageModal>BODY0
The macro simply displays just the text in a message box but could be set to pull out the full HTML. You could then parse it with RegEx to get the information you are interested in.
You will need WebRecorder installed for the above to work.
So here’s the equivalent in VBScript:
VBSTART Dim IE 'Creates IE instance Sub CreateIE Set IE = CreateObject("InternetExplorer.Application") IE.Visible=1 End Sub 'Navigate to an IE instance Sub Navigate(URL) IE.Navigate URL do while IE.Busy loop End Sub 'This function extracts text from a specific tag by name and index 'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element) 'set all to 1 to extract all HTML, 0 for only inside text without HTML Function ExtractTag(TagName,Num,all) dim t set t = IE.document.getElementsbyTagname(Tagname) if all=1 then ExtractTag = t.Item(Num).outerHTML else ExtractTag = t.Item(Num).innerText end if End Function VBEND VBRun>CreateIE VBRun>Navigate,www.mjtnet.com VBEval>ExtractTag("BODY",0,0),BodyText MessageModal>BodyText
But what if you already have a macro which already opens IE, or works against an already open instance of IE? The above macros need to create the IE instance before they can access them and extract data from them. You may have a macro that already starts IE some other way – maybe just by using a RunProgram or ExecuteFile call, or indirectly via some other application. Many times people tackle the extraction of data from such an IE window by sending keystrokes to do a Select-All, Edit/Copy and then use GetClipboard; or even File/Save As to save the HTML to a file. This of course adds time and can be unreliable. So how else can we do it?
Well, this tip shows us a function we can use to attach to an existing IE instance. So let’s use that and then use our ExtractTag function to pull out the BODY HTML:
VBSTART Dim IE ' Attaches to an already running IE instance with given URL Sub GetIE(URL) Dim objInstances, objIE Set objInstances = CreateObject("Shell.Application").windows If objInstances.Count > 0 Then '/// make sure we have instances open. For Each objIE In objInstances If InStr(objIE.LocationURL,URL) > 0 then Set IE = objIE End if Next End if End Sub 'This function extracts text from a specific tag by name and index 'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element) 'set all to 1 to extract all HTML, 0 for only inside text without HTML Function ExtractTag(TagName,Num,all) dim t set t = IE.document.getElementsbyTagname(Tagname) if all=1 then ExtractTag = t.Item(Num).outerHTML else ExtractTag = t.Item(Num).innerText end if End Function VBEND VBRun>GetIE,www.mjtnet.com VBEval>ExtractTag("BODY",0,1),BodyHTML MessageModal>BodyHTML
This snippet assumes a copy of IE is already open and pointing to www.mjtnet.com. The GetIE call creates a link to that IE window and then we use the ExtractTag function to pull out the HTML of the BODY element.
These examples use the BODY element, which will contain everything displayed on the page. As I mentioned before you can be more specific and specify some other element, and with WebRecorder, or a modified version of the ExtractTag VBScript function use other attributes to identify the element (the existing VBScript ExtractTag function shown above just uses the numeric index). WebRecorder tries to make it simple by giving you a point and click wizard, making some assumptions for you, so that you need not fully understand the HTML of the page. But it still helps you understand HTML. Looking at the source of the page you should be able to identify the element you need to extract from. And whether you extract directly from that or extract the BODY and then use RegEx being prepared to delve into the HTML source is going to get you further.
UPDATE: 19th January 2012
As of version 13.0.06 Macro Scheduler now includes a function called IEGetTags. For a given tag type and IE tab this will retrieve an array of tag contents. It can extract just the text, or html of the tags. This example extracts the inner HTML of all DIV elements in the open IE document currently at www.mjtnet.com:
You can then cycle through each one with a Repeat Until
If>divArr_count>0 Let>k=0 Repeat>k Let>k=k+1 Let>this_div_html=divArr_%k% .. .. do something with it .. e.g. use RegEx or substring searching to determine .. if this is the DIV you want and extract from it .. Until>k=divArr_count Endif
To further identify the tag you are interested in, or find the data you want, you can use RegEx, EasyPatterns, or string functions.
Macro Scheduler 13.0.06 and above also has a function called IETagEvent which will let you simulate a Click on a given tag, focus it, or modify its value. So once you have identified a tag using IEGetTags and your Repeat/Until loop you can click on it, focus it or modify its value (e.g. for form fields).