URL scraping

Semper · Post by **Semper** » Tue Oct 01, 2013 7:37 am

Hi,

does anyone know how can one scrape the urls from a google query?
Can't seem to find an answer

Any help appriciated.

CyberCitizen · Post by **CyberCitizen** » Tue Oct 01, 2013 2:07 pm

Not in a position to test at the moment, but couldn't you do a HTTPRequest save the results to a file or even keep them as a VAR & then use RegEx Easy Pattens [URL] to return the links?

Or are you talking about once you have the results of the links it scrapes the data from each page of the search results.

Dorian (MJT support) · Post by **Dorian (MJT support)** » Wed Oct 02, 2013 11:53 pm

Cyber's solution is the best one. Using HTTPrequest and Regex. The challenge with Google is that sometimes the layout changes, causing Regex to fail. Plus it litters the results with paid ads and wikipedia links.

But it is certainly possible. We just finished writing a custom script for someone which did exactly this.

Semper · Post by **Semper** » Thu Oct 03, 2013 8:10 am

Thank you for your replyes.

I've tried HTTPRequest with several RegEx patterns.
I almost always end up with several 5-20 URLs but they all are http://www.google.

This is my latest regex

Code: Select all

Let>HTTP_SSL=1
HTTPRequest>https://www.google.co.uk/#q=filetype:swf+dentist+london,,GET,,res
Let>Pat=[paste from http://alanstorm.com/url_regex_explained]
RegEx>Pat,res,0,match,num,,,
mdl>match

The whole pattern can't be shown here but it's from http://alanstorm.com/url_regex_explained

I look for the results in debugger

Can't figure this out

Post by **Marcus Tettmar** » Thu Oct 03, 2013 1:48 pm

Why try and scrape the Google front end? It was designed for a human to look at, has all kinds of dynamic stuff in it, adverts and all sorts.

Instead you should use Google's simple API which returns pure results with less clutter.

Try this example:

Code: Select all

//location of output file
Let>out_file=%SCRIPT_DIR%\google_results.txt
//specify the search term:
Let>theSearchTerm=Nord Keyboards
//specify how many results youwant:
Let>theQuantityWanted=20
  
//replace spaces insearch term with + symbol
StringReplace>theSearchTerm,SPACE,+,theSearchTerm

DeleteFile>out_file

Let>start=1
Let>numresults=0
Label>get_results
  //Uses Google's Ajax API - Simpler For Parsing - returns 8 at a time (use start=)
  Let>url=http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%theSearchTerm%&rsz=large&start=%start%
  HTTPRequest>url,,GET,,htmlRES

  //pull out all the URLs
  RegEx>(?<="url":").*?(?="),htmlRes,0,URLs,numURLS,0

  //loop through all the URLs
  Let>sres=0
  Repeat>sres
    Let>sres=sres+1
    Let>this_url=URLS_%sres%
    Let>xlRow=xlRow+1
    If>numresults<theQuantityWanted
      WriteLn>out_file,wlres,this_url
    Endif
    Let>numresults=numresults+1
  Until>sres=numURLS
    
  If>numresults<theQuantityWanted
    Let>start=start+8
    Goto>get_results
  Endif

ExecuteFile>out_file

Semper · Post by **Semper** » Thu Oct 03, 2013 6:21 pm

Thanks Marcus!

Just what I need