Technical support and scripting issues
Moderators: JRL, Dorian (MJT support)
-
Semper
- Junior Coder
- Posts: 30
- Joined: Mon Feb 25, 2008 3:28 pm
Post
by Semper » Tue Oct 01, 2013 7:37 am
Hi,
does anyone know how can one scrape the urls from a google query?
Can't seem to find an answer
Any help appriciated.
-
CyberCitizen
- Automation Wizard
- Posts: 724
- Joined: Sun Jun 20, 2004 7:06 am
- Location: Adelaide, South Australia
Post
by CyberCitizen » Tue Oct 01, 2013 2:07 pm
Not in a position to test at the moment, but couldn't you do a HTTPRequest save the results to a file or even keep them as a VAR & then use RegEx Easy Pattens [URL] to return the links?
Or are you talking about once you have the results of the links it scrapes the data from each page of the search results.
FIREFIGHTER
-
Dorian (MJT support)
- Automation Wizard
- Posts: 1414
- Joined: Sun Nov 03, 2002 3:19 am
Post
by Dorian (MJT support) » Wed Oct 02, 2013 11:53 pm
Cyber's solution is the best one. Using HTTPrequest and Regex. The challenge with Google is that sometimes the layout changes, causing Regex to fail. Plus it litters the results with paid ads and wikipedia links.
But it is certainly possible. We just finished writing a custom script for someone which did exactly this.
-
Semper
- Junior Coder
- Posts: 30
- Joined: Mon Feb 25, 2008 3:28 pm
Post
by Semper » Thu Oct 03, 2013 8:10 am
Thank you for your replyes.
I've tried HTTPRequest with several RegEx patterns.
I almost always end up with several 5-20 URLs but they all are
http://www.google.
This is my latest regex
Code: Select all
Let>HTTP_SSL=1
HTTPRequest>https://www.google.co.uk/#q=filetype:swf+dentist+london,,GET,,res
Let>Pat=[paste from http://alanstorm.com/url_regex_explained]
RegEx>Pat,res,0,match,num,,,
mdl>match
The whole pattern can't be shown here but it's from
http://alanstorm.com/url_regex_explained
I look for the results in debugger
Can't figure this out

-
Marcus Tettmar
- Site Admin
- Posts: 7395
- Joined: Thu Sep 19, 2002 3:00 pm
- Location: Dorset, UK
-
Contact:
Post
by Marcus Tettmar » Thu Oct 03, 2013 1:48 pm
Why try and scrape the Google front end? It was designed for a human to look at, has all kinds of dynamic stuff in it, adverts and all sorts.
Instead you should use Google's simple API which returns pure results with less clutter.
Try this example:
Code: Select all
//location of output file
Let>out_file=%SCRIPT_DIR%\google_results.txt
//specify the search term:
Let>theSearchTerm=Nord Keyboards
//specify how many results youwant:
Let>theQuantityWanted=20
//replace spaces insearch term with + symbol
StringReplace>theSearchTerm,SPACE,+,theSearchTerm
DeleteFile>out_file
Let>start=1
Let>numresults=0
Label>get_results
//Uses Google's Ajax API - Simpler For Parsing - returns 8 at a time (use start=)
Let>url=http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%theSearchTerm%&rsz=large&start=%start%
HTTPRequest>url,,GET,,htmlRES
//pull out all the URLs
RegEx>(?<="url":").*?(?="),htmlRes,0,URLs,numURLS,0
//loop through all the URLs
Let>sres=0
Repeat>sres
Let>sres=sres+1
Let>this_url=URLS_%sres%
Let>xlRow=xlRow+1
If>numresults<theQuantityWanted
WriteLn>out_file,wlres,this_url
Endif
Let>numresults=numresults+1
Until>sres=numURLS
If>numresults<theQuantityWanted
Let>start=start+8
Goto>get_results
Endif
ExecuteFile>out_file
-
Semper
- Junior Coder
- Posts: 30
- Joined: Mon Feb 25, 2008 3:28 pm
Post
by Semper » Thu Oct 03, 2013 6:21 pm
Thanks Marcus!
Just what I need
