How to read out the DOM in IE11

Heinz57 · Post by **Heinz57** » Tue Jul 28, 2015 6:53 am

Hello,

I have just finished my first two smaller macros and they work well.

I now want to use MacroScheduler for scraping web pages and ideally I would like to read out the DOM, which would be much better and safer than "OCRing" the browser's pixmap surface.

The relevant pages have no tables, very few id or tag names and are mainly constructed from <div class="some css class" blah blah</div>. Extracting the main id "maincontent" works but the resulting data has no structure at all and therefore cannot be properly used.

My question is if there exists any of these alternatives, none of which I found in the documentation nor on the forum:

1) Extract div by passing their css class names. This would not be a perfect solution for me but much better than the non-existent id or names. Ideally, all occurrences of a given css class would be available in an array.

2) Navigate within the DOM starting at some id recursively looping through its child nodes and extracting either entire branches or leaf nodes one by one and this ideally by identifying nodes by their css classes.

3) I don't expect this but the greatest solution would be to get a (sub-) tree of data objects starting at some given id or first css class with all the child nodes transferred to a tree of data variables, which could then be processed in MacroScheduler. This would fully preserve the logical data structure of the browser's content.

4) Any other solution welcome.

BTW: We had implemented already the reading of the entire DOM via the COM interface into a tree of our own DOM data objects, which are then traversed and the data extracted. But for several reasons we want/must get away from this older solution.

My plan is to use the SDK and to run all these actions from my application software. In this respect scraping IE by reading the DOM is the core issue.

Best regards
Hein57

Post by **Marcus Tettmar** » Tue Jul 28, 2015 7:19 am

You can do 1 using IEGetTagsByAttrib.

Or you could extract the containing element and then use RegEx to parse/loop through.

Heinz57 · Post by **Heinz57** » Tue Jul 28, 2015 7:45 am

Hello and thank you for the quick answer.

I tried this but I did not see any chance to select DOM nodes by their css class names.

Have I missing anything?

And RegEx won't help me in interpreting and classifying the data and linking it to instance variables.

Post by **Marcus Tettmar** » Wed Jul 29, 2015 10:50 am

You can use any attribute you like:

IEGetTagsByAttrib>www.domain.com,DIV,CLASSNAME=some classname,H,aDivs

Heinz57 · Post by **Heinz57** » Thu Jul 30, 2015 6:53 pm

I have tried it and it works nicely!

In the case of my logical table consisting of several main <div> each with a couple of sub-div each with a different css class name I was able to extract them into 5 arrays - one array for each css class. It's then easy to combine their elements to data "objects" and thereby to bind the data elements of one logical record.

Great!

Pardon the remark that sometimes a few words more in the documentation do help the users and save lots of support efforts as in this case. Sparse or minimalistic documentation is never fruitful.

As a hint to other reader who might have the same requirements:

As for the navigation in complex DOM structures (see my OP) and the extraction of not so simple DOM elements I am investigating these other methods that could be useful:

Insert JavaScript code into the IE DOM, which then does the complex DOM navigation that MacroScheduler cannot do
Add an IE "Browser Helper Object" (possibly written in VB or some .NET language) into IE, which does the same
We had already practiced to navigate in the DOM through the COM interface from our own application but there are reasons why we would like to limit or even avoid this in the future

How to read out the DOM in IE11

How to read out the DOM in IE11

Re: How to read out the DOM in IE11

Re: How to read out the DOM in IE11

Re: How to read out the DOM in IE11

Re: How to read out the DOM in IE11