Getting Closer...Need help with RegEx...

Technical support and scripting issues

Moderators: JRL, Dorian (MJT support)

Post Reply
rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Getting Closer...Need help with RegEx...

Post by rjw524 » Tue Jan 27, 2015 9:05 am

Well, thanks to several of the moderators (JRL, Bob, et al.), other regular contributors, the archives (and of course, Marcus) I'm getting a little better everyday with this.

I'm starting to play with Regular Expressions which I'm seeing the power of almost instantly.

I'm still having difficulty writing them however, and I have to put together a few for a pretty large text file that I created by basically copying and pasting the text of a website to which I have access into a text file.

I decided to do so after giving it a go with IEGetTags and IEGetTagsByAttri without much success. So, instead I just automated the "view source" copy the whole thing to the clipboard and paste it to a notepad file.

Now, what I need to do is probably pretty simple in terms of RegEx expressions, but it's still a bit beyond me as this point. I've listed them as steps below:

STEP 1: EXTRACT THE NEEDED DATA:

I need to extract EACH OCCURRENCE of the following data: (in actual practice this text file will contain hundreds possibly thousands of occurrences if I pasted each page of the results into one big text file. For the purposes of this post however, I only included the snippets I'd need to parse out.)

Actual Text File String:
<span class="given-name">John Smith</span>
What I need:
John Smith

Actual Text File String:
<dd class="location first">
Greater Detroit Area
</dd>
What I Need:
Greater Detroit Area

Actual Text File String:
<dd class="industry">
Accounting
</dd>

What I Need:
Accounting

Actual Text File String:
<dd class="current">
<span class='block'>Manager of <strong>Accounting</strong> and <strong>Financial</strong> <strong>Reporting</strong> at <strong>ACME Corporation</strong> <strong>Inc</strong></span><span class='block'>Independent Consultant at BeautiControl/True to You Beauty by Sandy</span>
</dd>

What I Need:
Manager of Accounting and Financial Reporting

What I Need:
ACME Corporation Inc.

STEP 2: PLACING THESE PARSED OUT VALUES INTO A CSV OR EXCEL FILE IN A ROW BY ROW FORMAT
(IDEALLY WITHOUT HAVING TO OPEN EXCEL)

COL A, COL B, COL C, COL D, COL E
ROW 1 John Smith, Greater Detroit Area, Accounting, Mgr of Accounting and Financial Reporting, ACME Corporation Inc.
ROW 2
ROW 3
...
ROW 500

Any ideas on how to structure this is MS 13?

I'm under the gun here, somewhat, so any help would be greatly appreciated!

R.J.

(P.S. I've tried attaching the text file, but the site wouldn't let me do it.)

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Re: Getting Closer...Need help with RegEx...

Post by Marcus Tettmar » Tue Jan 27, 2015 11:34 am

Hi,

1. Step through this:

Code: Select all

Let>str1=<span class="given-name">John Smith</span>
RegEx>(?<=name">).*?(?=</span),str1,0,matches,nm,0
Let>name=matches_1
MessageModal>name

Let>str2=<dd class="location first">Greater Detroit Area</dd>
RegEx>(?<=first">).*?(?=</dd),str2,0,matches,nm,0
Let>location=matches_1
MessageModal>location

Let>str3=<dd class="industry">Accounting</dd>
RegEx>(?<=industry">).*?(?=</dd),str3,0,matches,nm,0
Let>industry=matches_1
MessageModal>industry

Let>str4=<dd class="current"><span class='block'>Manager of <strong>Accounting</strong> and <strong>Financial</strong> <strong>Reporting</strong> at <strong>ACME Corporation</strong> <strong>Inc</strong></span><span class='block'>Independent Consultant at BeautiControl/True to You Beauty by Sandy</span></dd>

RegEx>(?<=block'>).*?(?=</strong> at),str4,0,matches,nm,0
Let>title=matches_1
StringReplace>title,<strong>,,title
StringReplace>title,</strong>,,title
MessageModal>title

RegEx>(?<=at <strong>).*?(?=</strong>),str4,0,matches,nm,0
Let>cmpy=matches_1
MessageModal>cmpy
2. CSV is just plain text. So to write a new row to a CSV file just use the plain old WriteLn command:

Code: Select all

Let>line_of_data="%name%","%location%","%title%","%cmpy%"
WriteLn>c:\bla\mycsvfile.csv,result,line_of_data
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Re: Getting Closer...Need help with RegEx...

Post by rjw524 » Tue Jan 27, 2015 3:02 pm

Marcus,

This is a great help! Thanks,

My only question is this:

I literally have hundreds if not thousands of these in an individual file all with different names, locations, current companies, etc. Will the examples you've written above help me find all of them? (I only ask because I don't see any "Let>k=0...Let k=k+1" repeat type coding.).

Once again, thanks for the huge assist here!

rjw

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Re: Getting Closer...Need help with RegEx...

Post by Marcus Tettmar » Tue Jan 27, 2015 3:09 pm

I've only answered your initial questions - I've given you some example RegEx code for the patterns you specified and showed you how to write to a CSV file. I didn't write anything about looping through files.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Re: Getting Closer...Need help with RegEx...

Post by rjw524 » Tue Jan 27, 2015 4:03 pm

Marcus Tettmar wrote:I've only answered your initial questions - I've given you some example RegEx code for the patterns you specified and showed you how to write to a CSV file. I didn't write anything about looping through files.
Hi Marcus,

Ok, that's what I thought. That's good though, I'll give it a shot and see what I can do.

Thanks!

rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Re: Getting Closer...Need help with RegEx...

Post by rjw524 » Tue Jan 27, 2015 4:37 pm

Hi Marcus,

Ok, I'm not getting it correct. This is something I would normally play around with and study a bit more, but I'm kinda under a deadline, so I'm sorry for coming back to you so quickly.

I need to know how to code those RegEx strings above assuming the values (ie: "John Smith", "ACME Corporation Inc.", "Manager of Accounting and Financial Reporting", "Greater Detroit Area", etc) are variables and constantly changing.

I've tried doing some LabelToVar codes, but they're not working. (y'know, because I still suck, lol)

Can you give a quick assist?

(also, I need to loop it)

Thanks...(so ashamed)

rjw

User avatar
JRL
Automation Wizard
Posts: 3532
Joined: Mon Jan 10, 2005 6:22 pm
Location: Iowa

Re: Getting Closer...Need help with RegEx...

Post by JRL » Tue Jan 27, 2015 5:33 pm

I'm pitiful with regex but Marcus has already supplied the regex you need. The regex doesn't just work on the sample string you provided, it will find every pattern in the web page source code and create an array of all the matches it finds. If there are 200 names it will find all 200 names and put them into an array. We can then access the array(s) and construct your csv file.

I modified the regex and came up with this code but it is untested and could have typos. Step though in the editor before running.

Hope it make sense.


Code: Select all

//Use the regexes against the variable that contains the source code of the entire web page
//Let's just say you called it "Str1" and that it is already in your script.




RegEx>(?<=name">).*?(?=</span),str1,0,vName,nm,0

RegEx>(?<=first">).*?(?=</dd),str1,0,vLocation,nm,0

RegEx>(?<=industry">).*?(?=</dd),str1,0,vIndustry,nm,0

RegEx>(?<=block'>).*?(?=</strong> at),str1,0,vTitle,nm,0

RegEx>(?<=at <strong>).*?(?=</strong>),str1,0,vCompany,nm,0


Let>kk=0
Repeat>kk
  Add>kk,1
  Let>Name=vName_%kk%
  Let>Location=vLocation_%kk%
  Let>Industry=vIndustry_%kk%
  Let>Title=vTitle_%kk%
    StringReplace>Title,<strong>,,Title
    StringReplace>Title,</strong>,,Title
  Let>Company=vCompany_%kk%
  WriteLn>%desktop_dir%\NewCSVFile.csv,wres,"%Name%"%comma%"%Location%"%comma%"%Industry%"%comma%"%Title%"%comma%"%Company%"
Until>kk=nm

rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Re: Getting Closer...Need help with RegEx...

Post by rjw524 » Tue Jan 27, 2015 6:12 pm

Huge Assist, JRL!

I think it's close!

I made the following change:

Since the macro has to read an entire .txt file, I just included a "ReadFile" command with the actual text file and an Result Variable named "str1".

I must have botched something once again, because the csv file it creates at the end is basically the entire unparsed text file in csv format. I know it's something on my end, because:

a) it's ALWAYS something on my end
and
b) Marcus' RegEx expressions did work in the one-shot examples above.

Code: Select all

//Use the regexes against the variable that contains the source code of the entire web page
//Let's just say you called it "Str1" and that it is already in your script.

ReadFile>C:\Users\Gabby\Documents\Rashod\Macros\MS13_Extraction_Using_REGEX_Test.txt,str1

RegEx>(?<=name">).*?(?=</span),str1,0,vName,nm,0,,
RegEx>(?<=first">).*?(?=</dd),str1,0,vLocation,nm,0
RegEx>(?<=industry">).*?(?=</dd),str1,0,vIndustry,nm,0
RegEx>(?<=block'>).*?(?=</strong> at),str1,0,vTitle,nm,0
RegEx>(?<=at <strong>).*?(?=</strong>),str1,0,vCompany,nm,0

Let>kk=0
Repeat>kk
  Add>kk,1
  Let>Name=vName_%kk%
  Let>Location=vLocation_%kk%
  Let>Industry=vIndustry_%kk%
  Let>Title=vTitle_%kk%
    StringReplace>Title,<strong>,,Title
    StringReplace>Title,</strong>,,Title
  Let>Company=vCompany_%kk%
  WriteLn>C:\Users\Gabby\Documents\Rashod\Macros\JRL_RegEx_LinkedIn_Extraction_Test.csv,wres,"%Name%"%comma%"%Location%"%comma%"%Industry%"%comma%"%Title%"%comma%"%Company%"
Until>kk=nm
So where did I blow it?

rjw524
Pro Scripter
Posts: 104
Joined: Wed May 09, 2012 9:45 pm
Location: Michigan

Re: Getting Closer...Need help with RegEx...

Post by rjw524 » Thu Jan 29, 2015 6:04 pm

I was able to get this to work out great.

Thanks again for all the help, JRL & Marcus!

rjw

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts