May 20, 2011

Scraping Data From Web Pages

Filed under: Automation,Scripting,Web/Tech — Marcus Tettmar @ 1:02 pm

I’ve seen quite a lot of requests lately from people wanting to know how to extract text from web pages.

Macro Scheduler’s optional WebRecorder add-on simplifies the automation of web pages and includes functions for extracting tables, text or HTML from web page elements. WebRecorder’s Tag Extraction wizard makes it easy to create the code.

Sometimes you can choose a specific HTML element and identify it uniquely via it’s ID or NAME attribute. But other times you might want all the text from the whole page, or you may need to extract the entire page and then parse out the bits you’re interested in using RegEx or some other string manipulation functions.

To extract an entire page I specify the BODY element. If you want to extract data from web pages it does help if you know a little about HTML. And if you do you’ll know that each page has just one BODY element which contains the code making up the visible portion of the page.

Here’s code produced using WebRecorder when navigating to mjtnet.com and using the Tag Extraction wizard to extract the BODY text:

IE_Create>0,IE[0]

IE_Navigate>%IE[0]%,http://www.mjtnet.com/,r
IE_Wait>%IE[0]%,r
Wait>delay

//Modify buffer size if required (you may get a crash if buffer size too small for data) ...
Let>BODY0_SIZE=9999
IE_ExtractTag>%IE[0]%,,BODY,0,0,BODY0,r
MidStr>r_6,1,r,BODY0

MessageModal>BODY0

The macro simply displays just the text in a message box but could be set to pull out the full HTML. You could then parse it with RegEx to get the information you are interested in.

You will need WebRecorder installed for the above to work.

If you don’t have WebRecorder you can do the same with a bit more work using VBScript. Some library functions for doing this can be found here and here.

So here’s the equivalent in VBScript:

VBSTART
Dim IE

'Creates IE instance
Sub CreateIE
  Set IE = CreateObject("InternetExplorer.Application")
  IE.Visible=1
End Sub

'Navigate to an IE instance
Sub Navigate(URL)
  IE.Navigate URL
  do while IE.Busy
  loop
End Sub

'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
  dim t
  set t = IE.document.getElementsbyTagname(Tagname)
  if all=1 then
    ExtractTag = t.Item(Num).outerHTML
  else
    ExtractTag = t.Item(Num).innerText
  end if
End Function
VBEND

VBRun>CreateIE
VBRun>Navigate,www.mjtnet.com

VBEval>ExtractTag("BODY",0,0),BodyText
MessageModal>BodyText

But what if you already have a macro which already opens IE, or works against an already open instance of IE? The above macros need to create the IE instance before they can access them and extract data from them. You may have a macro that already starts IE some other way – maybe just by using a RunProgram or ExecuteFile call, or indirectly via some other application. Many times people tackle the extraction of data from such an IE window by sending keystrokes to do a Select-All, Edit/Copy and then use GetClipboard; or even File/Save As to save the HTML to a file. This of course adds time and can be unreliable. So how else can we do it?

Well, this tip shows us a function we can use to attach to an existing IE instance. So let’s use that and then use our ExtractTag function to pull out the BODY HTML:

VBSTART
Dim IE

' Attaches to an already running IE instance with given URL
Sub GetIE(URL)
  Dim objInstances, objIE
  Set objInstances = CreateObject("Shell.Application").windows
  If objInstances.Count > 0 Then '/// make sure we have instances open.
    For Each objIE In objInstances
      If InStr(objIE.LocationURL,URL) > 0 then
        Set IE = objIE
      End if
    Next
  End if
End Sub

'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
  dim t
  set t = IE.document.getElementsbyTagname(Tagname)
  if all=1 then
    ExtractTag = t.Item(Num).outerHTML
  else
    ExtractTag = t.Item(Num).innerText
  end if
End Function
VBEND

VBRun>GetIE,www.mjtnet.com

VBEval>ExtractTag("BODY",0,1),BodyHTML
MessageModal>BodyHTML

This snippet assumes a copy of IE is already open and pointing to www.mjtnet.com. The GetIE call creates a link to that IE window and then we use the ExtractTag function to pull out the HTML of the BODY element.

These examples use the BODY element, which will contain everything displayed on the page. As I mentioned before you can be more specific and specify some other element, and with WebRecorder, or a modified version of the ExtractTag VBScript function use other attributes to identify the element (the existing VBScript ExtractTag function shown above just uses the numeric index). WebRecorder tries to make it simple by giving you a point and click wizard, making some assumptions for you, so that you need not fully understand the HTML of the page. But it still helps you understand HTML. Looking at the source of the page you should be able to identify the element you need to extract from. And whether you extract directly from that or extract the BODY and then use RegEx being prepared to delve into the HTML source is going to get you further.

UPDATE: 19th January 2012

As of version 13.0.06 Macro Scheduler now includes a function called IEGetTags. For a given tag type and IE tab this will retrieve an array of tag contents. It can extract just the text, or html of the tags. This example extracts the inner HTML of all DIV elements in the open IE document currently at www.mjtnet.com:

IEGetTags>mjtnet.com,DIV,H,divArr

You can then cycle through each one with a Repeat Until

If>divArr_count>0
  Let>k=0
  Repeat>k
    Let>k=k+1
    Let>this_div_html=divArr_%k%
    .. 
    .. do something with it
    .. e.g. use RegEx or substring searching to determine 
    .. if this is the DIV you want and extract from it
    .. 
  Until>k=divArr_count
Endif

To further identify the tag you are interested in, or find the data you want, you can use RegEx, EasyPatterns, or string functions.

Macro Scheduler 13.0.06 and above also has a function called IETagEvent which will let you simulate a Click on a given tag, focus it, or modify its value. So once you have identified a tag using IEGetTags and your Repeat/Until loop you can click on it, focus it or modify its value (e.g. for form fields).