One of my sons has a need to process websites like these:
(an easier one) https://partners.sophos.com/english/dir ... ?l=Germany
(a harder one) https://www.cloudtango.org/
to pickout company name, address, phone number etc and put into a spreadsheet.
To date I've been doing this using MS to copy each page into a text file and C# to parse all the info dumped into the text file.
It doesn't have to be done like that; what's needed is website => spreadsheet.
The difficulty being that all company records on a website are not always totally consistent (eg a phone number might be missing and hence mess up the regularity) & typically more time is spent checking/accommodating the anomalies than doing the 98% of good records.
I need to move onto other areas in his business so I said I would post on the forum here to see if anyone would be interested in picking up this work. The projects generally arrive in burst mode, but on average I would say 1 project every 2 or 3 weeks. And depending on the difficulty, 1 to 3 days for a project.
Preferred would be someone who could do the whole website-to-spreadsheet route, but failing that, web-site-to-textfile would be a step in the right direction.
Paul
Work Opportunity
Moderators: JRL, Dorian (MJT support)
- Dorian (MJT support)
- Automation Wizard
- Posts: 1415
- Joined: Sun Nov 03, 2002 3:19 am
Re: Work Opportunity
I just replied to your support request regarding this.
Re: Work Opportunity
Thanks Dorian. I've replied to your email.
Paul
Paul
Re: Work Opportunity
I think this has been sorted already, but it was a good training exercise so I did some work on getting data from the simple web site. I post the script in case anybody is interested. The result can then be copied from the message box and pasted into excel.
Code: Select all
// Needs to be adjusted with location of chromedriver.exe
Let>CHROMEDRIVER_EXE=C:\Users\Christer\Desktop\ChromeFile\chromedriver.exe
// Start session
ChromeStart>session_id
// Navigate to site
Let>URL1=https://partners.sophos.com/english/directory/search?l=Germany
ChromeNavigate>session_id,url,URL1
// Get source data
ChromeGetInfo>session_id,source,strResult
// Get all records
Let>tmp0=(?ms)plSearch\.allResults = \[\K.+(?=\}\];\RplSearch.pagination =)
RegEx>tmp0,strResult,0,m,nm,0
//Remove garbage
Let>tmp0=\{"Name":|"MailingStreet":|"MailingCityStatePostalCode":|"MailingCountry":|"Phone":|"Website":|"ViewPartnerUrl":".+?",|"TierLogoUrl":|"TierLogoName":|"/images/icons.+?",
RegEx>tmp0,m_1,0,m2,nm2,1,,strResult
// Create one company per line
Let>tmp0=\},
RegEx>tmp0,strResult,0,m3,nm3,1,CRLF,strResult
// Adj null -> empty
Let>tmp0=null
RegEx>tmp0,strResult,0,m4,nm4,1,,strResult
// Add initial/ending space to phone numbers to avoid excel formatting
Let>tmp0=(?m-s)(("[^"]+",){4})\K(?P<Phone>"[^"]+")
RegEx>tmp0,strResult,0,m5,nm5,1, $<Phone> ,strResult
// Adj \u0026 -> &
StringReplace>strResult,\u0026,&,strResult
// Adj \u0027 -> @
StringReplace>strResult,\u0027,@,strResult
// Close session
ChromeQuit>session_id
MDL>strResult
Re: Work Opportunity
There are 3 or 4 current projects that Dorian is doing. Would you be interested in any future ones? As I say, they are likely to be intermittent.
Also I notice you're in Sweden. George will likely have a need for a small amount of Swedish translation (of Job Titles) - would you be interested in that? Paul
Also I notice you're in Sweden. George will likely have a need for a small amount of Swedish translation (of Job Titles) - would you be interested in that? Paul