May 20, 2011

Scraping Data From Web Pages

Filed under: Automation, Scripting, Web/Tech — Marcus Tettmar @ 1:02 pm

I’ve seen quite a lot of requests lately from people wanting to know how to extract text from web pages.

Macro Scheduler’s optional WebRecorder add-on simplifies the automation of web pages and includes functions for extracting tables, text or HTML from web page elements. WebRecorder’s Tag Extraction wizard makes it easy to create the code.

Sometimes you can choose a specific HTML element and identify it uniquely via it’s ID or NAME attribute. But other times you might want all the text from the whole page, or you may need to extract the entire page and then parse out the bits you’re interested in using RegEx or some other string manipulation functions.

To extract an entire page I specify the BODY element. If you want to extract data from web pages it does help if you know a little about HTML. And if you do you’ll know that each page has just one BODY element which contains the code making up the visible portion of the page.

Here’s code produced using WebRecorder when navigating to mjtnet.com and using the Tag Extraction wizard to extract the BODY text:

IE_Create>0,IE[0]

IE_Navigate>%IE[0]%,http://www.mjtnet.com/,r
IE_Wait>%IE[0]%,r
Wait>delay

//Modify buffer size if required (you may get a crash if buffer size too small for data) ...
Let>BODY0_SIZE=9999
IE_ExtractTag>%IE[0]%,,BODY,0,0,BODY0,r
MidStr>r_6,1,r,BODY0

MessageModal>BODY0

The macro simply displays just the text in a message box but could be set to pull out the full HTML. You could then parse it with RegEx to get the information you are interested in.

You will need WebRecorder installed for the above to work.

If you don’t have WebRecorder you can do the same with a bit more work using VBScript. Some library functions for doing this can be found here and here.

So here’s the equivalent in VBScript:

VBSTART
Dim IE

'Creates IE instance
Sub CreateIE
  Set IE = CreateObject("InternetExplorer.Application")
  IE.Visible=1
End Sub

'Navigate to an IE instance
Sub Navigate(URL)
  IE.Navigate URL
  do while IE.Busy
  loop
End Sub

'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
  dim t
  set t = IE.document.getElementsbyTagname(Tagname)
  if all=1 then
    ExtractTag = t.Item(Num).outerHTML
  else
    ExtractTag = t.Item(Num).innerText
  end if
End Function
VBEND

VBRun>CreateIE
VBRun>Navigate,www.mjtnet.com

VBEval>ExtractTag("BODY",0,0),BodyText
MessageModal>BodyText

But what if you already have a macro which already opens IE, or works against an already open instance of IE? The above macros need to create the IE instance before they can access them and extract data from them. You may have a macro that already starts IE some other way – maybe just by using a RunProgram or ExecuteFile call, or indirectly via some other application. Many times people tackle the extraction of data from such an IE window by sending keystrokes to do a Select-All, Edit/Copy and then use GetClipboard; or even File/Save As to save the HTML to a file. This of course adds time and can be unreliable. So how else can we do it?

Well, this tip shows us a function we can use to attach to an existing IE instance. So let’s use that and then use our ExtractTag function to pull out the BODY HTML:

VBSTART
Dim IE

' Attaches to an already running IE instance with given URL
Sub GetIE(URL)
  Dim objInstances, objIE
  Set objInstances = CreateObject("Shell.Application").windows
  If objInstances.Count > 0 Then '/// make sure we have instances open.
    For Each objIE In objInstances
      If InStr(objIE.LocationURL,URL) > 0 then
        Set IE = objIE
      End if
    Next
  End if
End Sub

'This function extracts text from a specific tag by name and index
'e.g. TABLE,0 (1st Table element) or P,1 (2nd Paragraph element)
'set all to 1 to extract all HTML, 0 for only inside text without HTML
Function ExtractTag(TagName,Num,all)
  dim t
  set t = IE.document.getElementsbyTagname(Tagname)
  if all=1 then
    ExtractTag = t.Item(Num).outerHTML
  else
    ExtractTag = t.Item(Num).innerText
  end if
End Function
VBEND

VBRun>GetIE,www.mjtnet.com

VBEval>ExtractTag("BODY",0,1),BodyHTML
MessageModal>BodyHTML

This snippet assumes a copy of IE is already open and pointing to www.mjtnet.com. The GetIE call creates a link to that IE window and then we use the ExtractTag function to pull out the HTML of the BODY element.

These examples use the BODY element, which will contain everything displayed on the page. As I mentioned before you can be more specific and specify some other element, and with WebRecorder, or a modified version of the ExtractTag VBScript function use other attributes to identify the element (the existing VBScript ExtractTag function shown above just uses the numeric index). WebRecorder tries to make it simple by giving you a point and click wizard, making some assumptions for you, so that you need not fully understand the HTML of the page. But it still helps you understand HTML. Looking at the source of the page you should be able to identify the element you need to extract from. And whether you extract directly from that or extract the BODY and then use RegEx being prepared to delve into the HTML source is going to get you further.

UPDATE: 19th January 2012

As of version 13.0.06 Macro Scheduler now includes a function called IEGetTags. For a given tag type and IE tab this will retrieve an array of tag contents. It can extract just the text, or html of the tags. This example extracts the inner HTML of all DIV elements in the open IE document currently at www.mjtnet.com:

IEGetTags>mjtnet.com,DIV,H,divArr

You can then cycle through each one with a Repeat Until

If>divArr_count>0
  Let>k=0
  Repeat>k
    Let>k=k+1
    Let>this_div_html=divArr_%k%
    .. 
    .. do something with it
    .. e.g. use RegEx or substring searching to determine 
    .. if this is the DIV you want and extract from it
    .. 
  Until>k=divArr_count
Endif

To further identify the tag you are interested in, or find the data you want, you can use RegEx, EasyPatterns, or string functions.

Macro Scheduler 13.0.06 and above also has a function called IETagEvent which will let you simulate a Click on a given tag, focus it, or modify its value. So once you have identified a tag using IEGetTags and your Repeat/Until loop you can click on it, focus it or modify its value (e.g. for form fields).

Macro Scheduler 12.1.6 – Now With Macro Recording in Script Editor

Filed under: Announcements, Macro Recorder — Marcus Tettmar @ 10:59 am

Macro Scheduler 12.1.6 is now available. This release includes a bonus new feature: Access to the Macro Recorder from within the editor.

This means you can now record steps within the editor, with the recorded code being inserted at the current cursor position. This also means the macro recorder can be invoked multiple times to insert recorded steps at whatever point in the script you like.

So you could build up recorded scripts step by step or add recorded code to a script you have already written. You might have a macro which gets to a specific point in a script and then you want to record some keystrokes against that application, then manually add some more code afterwards. By being able to access the Macro Recorder from the Editor and insert recorded steps on the fly building up scripts like this becomes much easier.

Check it out. Look for the Macro Recorder icon on the Editor toolbar and under the Tools menu. Download from the usual locations (links below).

Registered Downloads/Upgrades | Evaluation Downloads | New License Sales

May 4, 2011

Macro Scheduler 12.1.5 Available

Filed under: Announcements — Marcus Tettmar @ 11:22 am

Macro Scheduler 12.1.5 is now available with the following fixes since my last update announcement:

  • Fixed: FTP_TIMEOUT not properly setting the connection timeout.
  • Script Encryption not working and causing corrupt scripts.

Workflow Designer and the SDK have also been updated to the same MacroScript version.

Registered Downloads/Upgrades | Evaluation Downloads | New License Sales

April 26, 2011

Authenticate Your EXEs – Discounted Code Signing

Filed under: Announcements, General — Marcus Tettmar @ 10:14 am

What is Code Signing?

Since XP, when you download an executable file from the Internet the browser checks the file’s Authenticode signature. This verifies who the publisher is. You get a dialog asking if you wish to download software from this publisher. If there is no signature the warning is more severe and it says something like:

The publisher could not be verified. Are you sure you want to run this software? This file does not have a valid digital signature that verifies its publisher. You should only run software from publishers you trust.

In some cases you will also get a similar warning when running applications that haven’t been signed, especially if the executable resides on a network drive. Apps that have been signed are trusted more by the operating system. Vista and Windows 7 are more fussy and certain types of app must be signed.

Code signing protects against tampering and impersonation. If a signed app is tampered with or modified in some way the signature becomes invalid and so the user will be warned when they try to run it.

How does it work?

A publisher applies for a digital certificate from a Certification Authority like Comodo or Verisign. Using the Microsoft Authenticode tools the publisher can sign their applications with their digital certificate. The signing tool basically makes a hash of the code and their private key and appends the signature to the end of the executable. If the code is later modified the signature will therefore be invalid as it is partially based on the application’s code itself.

Should I sign EXEs Compiled with Macro Scheduler?

If you distribute compiled macros to others, or let people download them from the web you should definitely be signing them. Users can then see who the publisher is and be sure that the file hasn’t been modified in any way, and will no longer see the unknown publisher warning presented by the web browser/operating system.

So how do I sign my EXEs?

First you need to obtain an Authenticode Certificate. We have negotiated a very helpful 10% discount for our customers off the price of Comodo Code Signing certificates supplied by K Software, an official Comodo Reseller. K Software prices are already extremely competitive and now, as a Macro Scheduler user, you get an extra 10% off.

The certificate used to sign our software, including Macro Scheduler was supplied by K Software. So you know you are in good company! 🙂

To find out more and place an order visit K Software’s Code Signing page. To get your 10% log into the Macro Scheduler Registered Customer area to obtain your special discount code.

You also need the code signing tools. These come with the Microsoft Platform SDK and can be downloaded here:
Platform SDK Redistributable: CAPICOM

Once installed, launch SignTool.exe to sign your EXE. For command line options see: Sign Tool (SignTool.exe)

For more step by step help Jeff Wilcox has written an excellent article about code signing and authenticode. It covers everything from the order process through the tools you’ll need to do the signing. Read the article about code signing here.

April 5, 2011

Another AV False Positive – McAfee

Filed under: Announcements — Marcus Tettmar @ 8:53 am

Several customers have written to us in the last couple of days reporting that the latest version of McAfee is detecting a “trojan” in the compiler (msrt.exe) that shipped with Macro Scheduler Pro v11. It also reports a virus in macros compiled with that version of the compiler.

The virus reported is “Generic.dx!xdn”.

The same version of McAfee does NOT report an issue with the Macro Scheduler 12 compiler. It seems to be particular to v11.

This is a FALSE POSITIVE. We have submitted the v11 compiler to virustotal.com and ALL other AV vendors report it as clean.

Unfortunately McAfee is quarantining this file and preventing our customers from using the software and their compiled macros.

I have submitted a false positive dispute to McAfee and I would ask all customers affected by this to do the same. Details on how to report a false positive can be found here:

https://community.mcafee.com/docs/DOC-1041

There is nothing that we at MJT Net can do to prevent this false positive apart from submit a claim to McAfee. We are at their mercy. My experience is that they usually fix these issues quickly and I would hope that the next definitions update solves the problem.

However, once McAfee has updated their database you may need to reinstall Macro Scheduler v11 and may need to recompile your macros unless it is possible to recover the quarantined files. We are happy to help with this but you may need to contact McAfee for assistance with recovering files from quarantine.

Given my last blog post it feels like AV vendors are out to get us at the moment! It is most frustrating.

But it’s not only us. Just the other day I read this report about a “keylogger” being wrongly reported on Samsung laptops by the “VIPRE Antivirus Software”. The false positive could be reproduced simply by creating a new folder called “SL” anywhere on the PC!

March 24, 2011

False Positives – Preying on Fear and Ruining Reputations

Filed under: Uncategorized — Marcus Tettmar @ 5:09 pm

Update: 25/03/2011 1420 GMT – Symantec have just emailed me to say that this detection has been removed and will not be present in the next definition update.

Fake viruses are one thing. I recently helped out four people who fell victim to the fake “System Tool” virus which pretends that your PC has a virus, and preventing the computer from being used, tries to get people to visit their website to hand over their credit card details. They prey on fear.

But legitimate anti-virus vendors aren’t an awful lot better. I know a number of people who bought a home PC with Norton pre-installed. They get a free 12 month subscription for virus definitions. But they don’t know that. Most of them have no idea that an anti-virus product even needs to download new updates. Then 12 months later they get a nasty looking warning saying that their PC is unprotected and now they have to pay for a new subscription. Frightened that something nasty will happen to their PC they pony up.

What they didn’t realise is that there are cheaper/better and even free alternatives. When I tell them they seem pretty angry.

Now it seems Norton have decided that small software companies are not to be trusted and are scaring people into deleting perfectly good software.

I recently received reports from a couple of trial-downloaders saying that their Norton/Symantec software reports a possible virus in Macro Scheduler.

The “virus” is: ws.reputation.1

Details of this threat can be found here. I quote:

“WS.Reputation.1 is a detection for files that have a low reputation score based on analyzing data from Symantec’s community of users and therefore are likely to be security risks. Detections of this type are based on Symantec’s reputation-based security technology. Because this detection is based on a reputation score, it does not represent a specific class of threat like adware or spyware, but instead applies to all threat categories.

The reputation-based system uses “the wisdom of crowds” (Symantec’s tens of millions of end users) connected to cloud-based intelligence to compute a reputation score for an application, and in the process identify malicious software in an entirely new way beyond traditional signatures and behavior-based detection techniques.”

In other words it seems to be saying:

“Because only a few of our users have used this product, it must be dangerous, though we have no specific idea why.”

Isn’t there a catch 22 here? Since insufficient people are using it to deem it safe Norton blocks it, which means no further people CAN use it, which means the number of people using won’t grow which means its reputation gets worse. A new file needs lots of people to use it for Norton to pass it, but if they block it new people can’t use it? It’s daft and very unfair.

And we’ve been in business and selling Macro Scheduler since 1997! If you’re a start-up with a new product I guess you’re going to have trouble getting the average home PC user to install your software since so many of them use Norton.

I wonder what Peter Norton would make of this.

If you use Norton – in fact even if you don’t – please send them a false positive report by going to:
https://submit.symantec.com/false_positive/

March 23, 2011

Podcasts: Macro Scheduler Consultant Spotlight with Gary DalSanto

Filed under: Announcements, Podcasts — Marcus Tettmar @ 9:53 am

In our latest podcast series Tracy talks to Gary DalSanto of Inventive Software Designs who is one of our partner consultants providing customers with custom automation solutions and helping them with their Macro Scheduler script development.

Gary has been working with Macro Scheduler since 2006 on a large variety of different automation scenarios. His first experience of Macro Scheduler was when he converted a large scale IBM Rational Robot project over to Macro Scheduler and found that Macro Scheduler matched the functionality at a much lower cost.

I quickly found out that not only Macro Scheduler could perform all the functionality that we were using with the higher priced commercial tool, but it could also do it for a fraction of the cost. So from that point on, I was pretty much sold.

Since then Gary has used Macro Scheduler in projects as diverse as data migration in Telecoms companies to automating ordering systems in a College Bookstore.

So to summarize it all up, by automating all these processes for them over the course of less than a year, the store turned a profit for the first time in the history of the store.

Gary has worked with many MJT Net customers, building custom automation solutions for them using Macro Scheduler, and providing assistance with their own scripts.

Podcast: Gary DalSanto – Part 1: Telecoms Company Data Transfer

[audio:http://www.mjtnet.com/podcasts/GaryDalSantoPart1.mp3]

Podcast: Gary DalSanto – Part 2: Bookstore Order Processing Automation

[audio:http://www.mjtnet.com/podcasts/GaryDalSantoPart2.mp3]

More info and contact details for Gary are here.

February 23, 2011

New Video: Using The Debugger

Filed under: Announcements, Automation, Scripting — Marcus Tettmar @ 9:44 am

Macro Scheduler veteran John Brozycki has put together this fantastic video tutorial all about Macro Scheduler’s debugging capabilities. The video is 18 minutes long and demonstrates every debug feature, showing examples of their use and talks about how useful the debugger can be for problem resolution as well as script creation. Take a look:

A larger version of the video can be found here.

John Brozycki is an information security professional who uses Macro Scheduler as a tool to accomplish a wide range of tasks in his daily activities. His personal web site is www.trueinsecurity.com.

I think this is an excellent tutorial which all script developers should benefit from. Thanks John!

February 10, 2011

Rewarding Feedback

Filed under: General — Marcus Tettmar @ 3:46 pm

I received this in an email yesterday and just had to share it:

“Thank you for a truly superior product! I used this at a previous place of employment and developed a system to autopopulate our very static software. When the software vender sent training staff they were so impressed they offered me a job! I now work for them and have you to thank for it!”

It’s not always easy running a small software company. But receiving feedback like that certainly makes up for a lot of hard work.

February 8, 2011

Undocumented Internal Dialog Event Parameters

Filed under: Scripting, Tutorials — Marcus Tettmar @ 3:35 pm

In this forum post Armsys asks how he can determine which key the user pressed in an OnKeyPress dialog event handler. The solution I posted reveals an undocumented feature: Internal event parameters.

While there is a sample macro called “Dialogs – MouseOver” which ships with Macro Scheduler and demonstrates these event parameters, they are missing from the help file.

So here’s a short 3 minute video showing how this sample script works and demonstrating how you can determine what event parameters are available for use.

(Don’t forget you can view full screen and/or change the quality with the options in the video control panel above).

If you’re completely new to custom dialogs you might also want to watch part 1 and part 2 of the custom dialog video tutorials first.