March 23, 2018

Filed under: Automation — Marcus Tettmar @ 9:55 pm

What is Screen Scraping?

Screen Scraping is a term used to describe the process of a computer program or macro extracting data from the display output of another application. Rather than parsing data from the database or data files belonging to an application, Screen Scraping pulls the data from the screen itself, extracting data that was intended to be displayed to the end-user as opposed to data designed for output to another application or database.  

Screen Scraping is necessary when there is a need to access the information displayed by the application but there is no method provided to access it behind the scenes. The database or data files may not be accessible, or may be undocumented or proprietary and therefore cannot be parsed easily; the costs associated with interacting with the database may be too high; or the license agreement or warranty prohibits it. In the case of legacy systems that are no longer supported there may be no knowledge of the data structures, or the technology used is no longer compatible with current technology. In these cases we are resorted to extracting the data from the screen – from the windows of the application.

The term Screen Scraping probably originates from the era of computer terminals when you could connect the terminal output of a computer to an input port on another and therefore record the screen data.  

Screen Scraping Methods

There are a number of ways we can retrieve information from the screen using Macro Scheduler, depending on the type of application the data is in.

Screen Scraping Web Applications

Applications like Macro Scheduler’s WebRecorder can access the data and objects inside an Interner Explorer window and can therefore be used to extract the data.  Technically speaking I would not call this screen scraping since WebRecorder is using an API interface provided by Internet Explorer, but the process of extracting information from web sites is commonly refered to as Screen Scraping.  With WebRecorder we can use the ExtractTag wizard to create code that extracts the text from a particular element in the page.   While WebRecorder is the easiest way to do it, it is also possible to automate IE and extract data from web pages by using VBScript. The following forum posts may help:

Automate Internet Explorer with OLE/ActiveX
Automate web forms with IE
HTTP GET and POST using VBscript

Screen Scraping Microsoft Office Applications

Microsoft Office Applications, like Internet Explorer, have a COM interface that allows scripts to manipulate them and access the data held within them.  Again, not really scraping data from the screen itself, as you are getting it directly from a programming interface. There are a number of examples in the forums and blog archives and also some sample scripts that come with Macro Scheduler which demonstrate how to automate Office applications and retrieve data from them.  

Working with Excel

Screen Scraping Regular Windows Applications

Most other applications don’t offer a scripting interface like MS Office or Internet Explorer.  This is where we really need to work directly with the screen.   There are a number of ways we can do this kind of Screen Scraping with Macro Scheduler.

Screen Scraping via Optical Character Recognition

Macro Scheduler 14.4 includes some really neat functions which make it really easy to OCR a portion of the screen:

  • OCRScreen
  • OCRWindow
  • OCRArea

The first of these is the simplest. It simply scans the entire screen and returns all the text it can recognise. Of course this is also the slowest as it has to perform OCR against the entire screen. OCRWindow takes a window title and scans only the area of the screen where that window appears. This is nice and simple and a good compromise if the window isn’t too large. Finally, OCRArea can be given a rectangular screen region (X1,Y1,X2,Y2). You could use FindObject to find the coordinates of a specific UI object and pass those coordinates to OCRArea if you want to narrow things right down.

The Text Capture Functions

Macro Scheduler includes some Text Capture functions which can be used to extract text from a given window, rectangular screen area or screen point.  These functions use low level system hooks which monitor applications calling the various “TextOut” functions that Windows uses to output text to the screen.  By doing so they are able to capture this text.  The Text Capture functions return the text to  a variable which you can then use as needed.  

However, a few applications don’t use the Windows built-in functions to create and output text.  Don’t worry –Most do, but a few use their own techniques.  When you realise that text on the screen is just a sequence of small dots, if the application programmer decided to build his own routine to assemble text from dots rather than calling the Windows functions which already do that for you, you’re not going to be able to capture it.

The text capture functions and their limitations are explained here.  There is an example application, here, created with Macro Scheduler, which you can use to determine whether or not the text you want to capture can be captured using the text capture functions.

http://www.mjtnet.com/blog/2007/12/12/capturing-screen-text/
http://www.mjtnet.com/blog/2008/01/03/screen-scrape-text-capture-example/

Using the Clipboard for Screen Scraping

If the text you want to capture is selectable then you can use the clipboard to retrieve it.  A Macro Scheduler macro can send the keystrokes necessary to highlight and copy the text to the clipboard and then use the GetClipboard function to retrieve that text to a variable.  This is far less elegant than using the Text Capture functions but might be necessary if the application concerned is not utlising any of the Windows text out functions to create the text.

SetFocus>Notepad*
//Select ALL
Press CTRL
Send>a
Release CTRL
//Copy to clipboard
Press CTRL
Send>c
Release CTRL

//Get and display the data
WaitClipboard
GetClipboard>theData
MessageModal>theData

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment