Removing the first character in a very large file

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

Post Reply
terencepjf
Newbie
Posts: 15
Joined: Wed Dec 19, 2007 9:16 pm

Removing the first character in a very large file

Post by terencepjf » Thu Nov 21, 2013 12:58 pm

Hi; I have a MS script that dumps contents from an Oracle database via sqlplus and for some reason it puts a funky character at the start of the file that make the contents unreadable.

-I042,I042:I(A),AMPS,IA,4,07/21/2013 06:00 AM,88.812897
I042,I042:I(B),AMPS,IB,4,07/21/2013 06:00 AM,75.329751
(The - really is an extended ASCII code 254 or the likes of it)

I subsequently extended the script to read & strip out this char and write to a new file - this works if the file is small but as the data grows in size it's not efficient!

How can I just delete/replace this ONE char without all the read/write overhead?

Thanks

User avatar
Rain
Automation Wizard
Posts: 550
Joined: Tue Aug 09, 2005 5:02 pm
Contact:

Post by Rain » Thu Nov 21, 2013 3:18 pm

Have you tried Windows Powershell? I've tested the script below with 100k and 1 Million lines:
100K lines took roughly 18 seconds.
1 Million lines took roughly 174 seconds.

Code: Select all

Timer>StartTimer
Let>InputFile=%DESKTOP_DIR%\temp.txt
Let>OutpuFile=%DESKTOP_DIR%\out.txt
Let>RP_WINDOWMODE=0
Let>RP_WAIT=1
Run>powershell.exe Get-Content %InputFile% | ForEach-Object {$_ -replace '-', ''} | Set-Content %OutpuFile%
Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed

Maybe someone has a faster solution.

terencepjf
Newbie
Posts: 15
Joined: Wed Dec 19, 2007 9:16 pm

Post by terencepjf » Thu Nov 21, 2013 4:46 pm

Thanks Rain; I've steered clear of Powershell as we are still on XP Pro, but it's time to dive right in based on this need..

User avatar
JRL
Automation Wizard
Posts: 3501
Joined: Mon Jan 10, 2005 6:22 pm
Location: Iowa

Post by JRL » Thu Nov 21, 2013 8:23 pm

This took 52 seconds on a million plus (1075076) line file. Could be better or just a computer difference. I can't run Rain's because I can't find powershell.exe.

It reads the first line of the file, uses midstr> to remove the first character, writes that line to a new output file. then uses DOS "type | find" to write the rest of the input file to the output file.

Code: Select all

Timer>StartTimer
Let>InputFile=%DESKTOP_DIR%\temp.txt
Let>OutputFile=%DESKTOP_DIR%\out.txt

ReadLn>InputFile,1,res
MidStr>res,2,999999,res
WriteLn>OutputFile,wres,res

Let>RP_Windowmode=0
Let>RP_Wait=1

RunProgram>cmd /c type "%InputFile%" | find /v "%res%" >> "%OutputFile%"

Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed

terencepjf
Newbie
Posts: 15
Joined: Wed Dec 19, 2007 9:16 pm

Post by terencepjf » Tue Nov 26, 2013 7:05 pm

Thanks to all for the suggestions - I got powershell to work!

hagchr
Automation Wizard
Posts: 328
Joined: Mon Jul 05, 2010 7:53 am
Location: Stockholm, Sweden

Post by hagchr » Wed Nov 27, 2013 1:38 pm

Hi, I was curious to see if one could use RegEx to solve it. Not sure if there are any upper limits when the file gets much larger but for one million lines it will complete it in around 1 second.

Code: Select all

Let>InputFile=C:\Users\Christer\Documents\testfile.txt
Let>OutputFile=C:\Users\Christer\Documents\resfile.txt

Timer>StartTimer

ReadFile>InputFile,strInput
RegEx>(?s)(?<=-).+,strInput,0,Matches,NumMatches,0,,
WriteLn>OutputFile,nWLNRes,Matches_1

Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts