RegEx expression hangs MS

Technical support and scripting issues

Moderators: JRL, Dorian (MJT support)

Post Reply
JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

RegEx expression hangs MS

Post by JBurger » Thu Aug 20, 2009 7:10 pm

I am trying to do cleanup up html files with lots of text in them

The following regex I use to find paragraph errors. Basically Paragraph that start with a lower case letter should be appended to the line above, keeping the html tags. The below works fine in other regex programs I have used but hangs in MS.

Code: Select all

Let>vF=(?-i)(<DIV>    .*?)</DIV>\r\n<DIV>    ([a-z])
Let>vR=$1 $2
RegEx>%vF%,%r%,0,matches,matchesnum,1,%vR%,r
If I take out the (?-i) switch it runs but finds paragraphs that start with upper case, not what I need. Any ideas?

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Thu Aug 20, 2009 10:00 pm

can you also provide a sample of the source text, %r% ? It will help us to diagnose your problem.

I would also suggest that you change %r% to something different from a single letter variable. Not just in this case, but in all of your scripts. The reasons have been discussed her many times. I see that "\r" is in your Find string, and might be a potential problem.

Make a naming convention for yourself, I usually start all my variables in Macro Scheduler with the lerrer "v......". For RegEx commands for example,, I usually use vHaystack and vNeedle and vReplacement to help distinguish what I am searching and what for and the final replacement if there is one.
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

Post by JBurger » Fri Aug 21, 2009 3:09 pm

Thanks for the incite Bob.

You can use this sample for %res%

http://pastebin.com/m51fe1c9d

I have cleaned up the code a little but still have the same issue. Pleas use this code to debug.

Code: Select all

'Find bad sentences
'Let>vF=(?-i)(<P>.*?)</P>\r\n<P>([a-z])
'Let>vR=$1 $2
'RegEx>%vF%,%res%,0,matches,matchesnum,1,%vR%,res

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Fri Aug 21, 2009 4:03 pm

Thanks for the "sample". Sure glad the 2651 lines were available vs. the entire data source :shock:

I used both of the two blocks of data, which one should the test be done on?

Could you also clarify what you are doing:
"The following regex I use to find paragraph errors. Basically Paragraph that start with a lower case letter should be appended to the line above, keeping the html tags."

I am not sure what you mean by a "paragraph". Using another RegEx tool I did some quick tests....

1. I could not find any "paragraph" that started with vs. . (HTML paragraph tags)
2. I also could not find any followed by a lower case character. (HTML paragraph beginnings).
3. I did find instances with a period followed by a space and then a lower case character vs. an upper case char...here is example:
"This pursuit lasted nearly three-quarters of an hour, without the frigate gaining two yards on the cetacean. it te we should never come up with it." (Non HTML paragraphs)
I guess all of these could be interpreted as a paragraph.

Could use a real simple example of "before" and "after" text so I can properly evaluate the RegEx.

I will try to look at this over the weekend, sorry for the delay....
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

Post by JBurger » Fri Aug 21, 2009 5:00 pm

OK I shortened the sample (sorry I just cut and pasted the whole thing before)

http://pastebin.com/m708cefc8

This one has errors in it.

So this is what I am trying to do.

When looking at the html in a browser you will see that some paragraphs are broken, meaning that the paragraph ends and then the next one begins mid sentence. So I am trying to use regex to find where a paragraph begins with a lower case letter and append it to the paragraph above


So if you had

The quick brown fox
jumped over the fence.

It would be changed to

The quick brown fox jumped over the fence.

Hope that makes sense.

So now that I am playing with it a little more, and using a shorter example I think that the problem is that the regex is taking too much. I seem to be getting multiple lines rather than just two. I'm not sure why it is doing that, this is the same regex I have used in TextPad and it works fine on megabyte sized files.

Thanks for the help
-Joe

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Fri Aug 21, 2009 6:01 pm

I did a searchin TextPad for [a-z] and j was not found in the original samples.

So, you want to find all lower case and following text thru and append it to the previous ... text.

Thanks for the clarification.....will work on it when I can.

PS..The TextPad RegEx tools are very limited, and will not cover multiple lines. I use TextPad every day.
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Fri Aug 21, 2009 9:34 pm

Don't have access to Macro Scheduler right now, so I haven't tried this against your sample yet, but I think this should work:

Search for: (?-i)\n([a-z])
Replace with: %space%\1
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

Post by JBurger » Fri Aug 21, 2009 10:01 pm

That's weird. TextPad definitely does multiple lines I do it all the time

In TextPad

Find what: (.*)\n([a-z])
Replace with: /1 /2

Match Case X
Regular expression X

on this text http://pastebin.com/m708cefc8

Finds three items in textpad, always selecting 2 lines. And does the replace connecting the two lines.

When I try to do the same in MS it just hangs.

Hope that is a little more clear?

JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

Post by JBurger » Fri Aug 21, 2009 10:16 pm

So I can get the code to work in MS with the same sample, the large one hangs.

The large one has no matches so it should not even be doing anything.

User avatar
Bob Hansen
Automation Wizard
Posts: 2475
Joined: Tue Sep 24, 2002 3:47 am
Location: Salem, New Hampshire, US
Contact:

Post by Bob Hansen » Sat Aug 22, 2009 4:43 am

Don't mean to drag this out, but I still am confused about the samples you have referenced. You are showing large blocks of data, two blocks, which one to use? I find it easier to work with sample data, then apply it to the real thing.

I am going to use a simple sample like this:
This is part of line one.
This is part of line two
and this is part of line two also.
This is part of line three.
This is part of line four
and this is part of line four also.

This works in TextPad:
Search for: (.*)\n([a-z])
Replace with: \1 \2 (note reverse slash from your example)

My earlier example also works in TextPad, a little easier:
Search for: \n([a-z])
Replace with: _\1 (Underscore "_" is for space character)

Both end up with this:
I am going to use a simple sample like this:
This is part of line one.
This is part of line two and this is part of line two also.
This is part of line three.
This is part of line four and this is part of line four also.


Both expressions do multiple lines in TextPad only because we have specified \n. But it will not handle \n+ or \n*, you need to specify the exact number of lines. That is what I meant by my comment. Their WildEdit product does scan multiple liines though. Any way, enough about TextPad. Let's get back to Macro Scheduler.....

This works fine in Macro Scheduler, based on my easier expression:

Code: Select all

Let>vHaystack=<P>This is part of line one.</P>%CRLF%<P>This is part of line two</P>%CRLF%<P>and this is part of line two also.</P>%CRLF%<P>This is part of line three.</P>%CRLF%<P>This is part of line four</P>%CRLF%<P>and this is part of line four also.</P>
Let>vNeedle=(?-i)</P>%CRLF%<P>([a-z])
Let>vReplacement=%space%$1
RegEx>%vNeedle%,%vHaystack%,0,vMatch,vMatchNumber,1,%vReplacement%,vResult
MessageModal>%vResult%

So, if this is hanging in Macro Scheduler I suspect that you need to make the Needle into a file and then do a ReadFile like this:
ReadFile>vNeedle

Make sure the line numbers are not included. My example is not using "\n" because the Macro Scheduler does not handle multiple lines in the Haystack. Must be put all on one line and use %CRLF%. But if this was in a file with multiple lines, you could replace the %CRLF% with \n in the Needle.

Hopefully this has made sense to you?
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!

JBurger
Junior Coder
Posts: 33
Joined: Wed Nov 12, 2003 7:16 pm
Location: NY

Post by JBurger » Wed Aug 26, 2009 2:13 pm

I apologize for not getting back to you earlier, I did not receive an email that you had posted.

OK, I tried your code with the CRLF and it seems to work fine. I will use that going forward, Although I still think that something is wrong with MS as IMO the code I was using should work. I just am not sure why it doesn't.

Thanks you so much for your help Bob!

-Joe

Post Reply
cron
Sign up to our newsletter for free automation tips, tricks & discounts