Needed Advice for reading a large CSV file

kpassaur · Post by **kpassaur** » Wed Feb 13, 2013 1:06 pm

With MS I have used a couple of different methods to read a CSV and Excel files. Howver the files were realitivity small, not more than 5000 records.

This time I need to read one that is has over 100,000 records, with the record only containing two values. "The classic look up a customer number and get their name."

I don't have access to the CSV file as it is considered confidential by the owner so I have no way of testing the large file. Any suggestions as to performance? I don't want to use one method and have it be so wrong that it takes 30 seconds to find one. I don't mind 3-4 seconds.

I would perfer to use CSVFiletoArray, read it once then have a loop that keeps accessing it as opposed to using DBQuery and accessing it each time. I don't want to leave the file in Excel and use XLGetCell function even though the odds are 50 percent that it will not have to read all the cells. My concern with this is someone opening the file and then having issues with it being locked.

Jerry Thomas · Post by **Jerry Thomas** » Wed Feb 13, 2013 3:44 pm

A couple of usage questions might help guide your ideas...

How will the entries be accessed?
- Sequential (this would be too easy)
- Random

How often will you be reading the file?
- Will you search through 4 or 5 times and then be done?
- Or are you going to read 1000's of entries / day?

How frequently will you be reading the file?
- Read an entry and then 5 minutes later read another entry?
- Or read an entry, process it, and then read the next entry milliseconds later?

How often will the file be updated?
- Will you get a copy and work from that for the whole day?
- Or will updates be live and you are working from the updated list?

Is the enduser's equipment reasonably robust or are they stuck with Windows XP with only 1 GB ram?

kpassaur · Post by **kpassaur** » Wed Feb 13, 2013 4:55 pm

These are patient numbers and they will be random. They maybe sequential numbers in the CSV file but they will not be retrieved sequentially.

This is the difficult part - how often it will be run - a decision has not been made. I would suggestt that they run it at the end of the day and then it does not matter. However, they maybe runing it every 5 minutes, doing 10 look ups, then nothing for a hour and then 5, 10, 30 or more lookups.

So they would read an entry, once found process it (a couple of seconds) read another etc. five or ten times and then nothing until the script is run again which could be minutes, hours or days later.

It will make more sense when I explain what is going on. Files are being scanned and named a patient number and document type with the use of a cover page that contains barcodes. This utility is to read the patient number and pull the patient name out of the csv file and rename and move the file.

Origionally it was not a big deal as they planned on exporting the patient names and numbers to a Excel spreadsheet for the next weeks appointments. To create the barcodes they need to export this list (so they should have a small list). Then they would scan the old files with the barcodes and import them into their new system after the name was added. (I won't bother you with tall he reasons for not putting the name in the barcode)

So if they had 1000 apointments the list would be only contain a 1000 records. However, they would prefer to automate it as much as possible. So use the entire list, which means the 100,000 or so old patient files.

They change their mind on the workflow all the time. I would not do it this way to begin with. I would create a utiliy where when an appointment is made the user enters the name and number, the cover pages are printed, the csv file updated and the user could just take the file to the copier and scan it. Then once it is processed delete the line in the csv file.

However at this point that is not the case. I have written the script use CSVFiletoArray and ran it with a test file containing 100,000 lines and it takes about 40 seconds to locate the record with an old 32 bit machine running Vista with 2 gigs of memory.

I was curious as to if I used DBQuery would it be 2 seconds or something else drmatic.

EnderFFX · Post by **EnderFFX** » Thu Feb 21, 2013 4:38 am

Any merit to reading the file one time to put everything into a database and then do lookups in the database?

jpuziano · Post by **jpuziano** » Thu Feb 21, 2013 6:58 am

Hi kpassaur,

40 seconds huh... ouch. Here's some thoughts for faster record retrieval:

- Read all the data into a single variable using ReadFile>
- Then use a RegEx command to locate the record you want by matching the patient number and also match all non CRLF chars after that which will get you everything up until the end of the line i.e. the entire patient record.

Hmm... Just for fun I simulated this. I built a file called patient_records.csv

Code: Select all

100000,John Doe,123 Main Street,Any Town,Country,comment
100000,John Doe,123 Main Street,Any Town,Country,comment
100000,John Doe,123 Main Street,Any Town,Country,comment
100659,Target Patient,40 seconds is way too long Street,Any Town,Country,yes we can
100000,John Doe,123 Main Street,Any Town,Country,comment

The file has 100,000 lines (records) and all lines are the same except for the second last line which is our target patient we are trying to look up.

Here is the code which illustrates the technique:

Code: Select all

ReadFile>C:\patient_records.csv,patient_records

Label>next_lookup
Input>patient_number,Records Search - Enter six digit patient number or Q to Quit:,100659

Trim>patient_number,patient_number
UpperCase>patient_number,patient_number

If>patient_number=Q,end

Length>patient_number,patient_number_length

If>patient_number_length=6,length_ok
//If we clicked Cancel, length will be 0 so end
If>patient_number_length=0,end

MDL>Please enter a SIX digit patient number...
Goto>next_lookup

Label>length_ok

//Now verify the six chars are all digits
Let>pattern=[0-9]{6,6}
RegEx>pattern,patient_number,0,matches,num,0
If>num=1,yes_six_digits

MDL>Numbers Only - enter a SIX digit patient number please...
Goto>next_lookup

Label>yes_six_digits
//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0

If>num=1,yes_one_matching_patient_record_found

If>num=0
  MDL>No record matching patient number %patient_number% could be found!
Else
  MDL>%num% separate records matching patient number %patient_number% were found! Is this a problem?
EndIf

Goto>next_lookup

Label>yes_one_matching_patient_record_found

MDL>Patient Record:%CRLF%%CRLF%%matches_1%

Goto>next_lookup

Label>end

As for performance, searching for patient 100659 came back instantly for me... well under a second. I would be curious how it would perform on your target machine.

For fun, search for patient number 100000 and it will bring an "error condition" to your attention... multiple matching records.

I hope you found this useful... take care.

kpassaur · Post by **kpassaur** » Thu Feb 21, 2013 8:11 am

Thanks i will give it a go. I still have to figure out the logic you used to get the line.

JRL · Post by **JRL** » Thu Feb 21, 2013 4:12 pm

Just to finish out your request here's a sample that uses dbquery. Using jpuziano's format, I constructed a fake 100,000 line CSV file with one line of usable data.

The format of the CSV file is important. If the Patient number in the CSV has quotes around it, use the remarked line. Also note that there must be headers in the CSV file. The first line in the fake file I created is:
PATIENT_NO,NAME,ADDRESS,CITY,LOCATION,NOTES

Running this outside the editor I was consistently hitting 0.4 seconds. Inside the editor 0.6 seconds.

I added a timer to jpuziano's script and ran it also. I got inconsistent times but always under 0.1 seconds. Several times as little as 0.03 seconds.

Code: Select all

Let>PatientNo=100659
Let>FileLocationPath=C:\
Let>CSVFileName=KeithTest.csv


Timer>StartTime
GoSub>GetDataFromCSV
Timer>EndTime
Let>TotalTime={(%EndTime%-%StartTime%)/1000}

MDL>Time to Process = %TotalTime%%crlf%Patient Name = %Res_1_Name%


SRT>GetDataFromCSV
  Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=%FileLocationPath%;Extensions=asc,csv,tab,txt;
  DBConnect>ConStr,dbH
  Let>sql=select * from %CSVFileName% where PATIENT_NO = %PatientNo%
  //Let>sql=select * from %CSVFileName% where PATIENT_NO = '%PatientNo%'
  DBQuery>dbH,sql,res,numrecords,numfields,1
  DBClose>dbH
END>GetDataFromCSV

jpuziano's script with a timer added.

Code: Select all

ReadFile>C:\Keithtest.csv,patient_records

Label>next_lookup
Input>patient_number,Records Search - Enter six digit patient number or Q to Quit:,100659
Timer>StartTime

Trim>patient_number,patient_number
UpperCase>patient_number,patient_number

If>patient_number=Q,end

Length>patient_number,patient_number_length

If>patient_number_length=6,length_ok
//If we clicked Cancel, length will be 0 so end
If>patient_number_length=0,end

MDL>Please enter a SIX digit patient number...
Goto>next_lookup

Label>length_ok

//Now verify the six chars are all digits
Let>pattern=[0-9]{6,6}
RegEx>pattern,patient_number,0,matches,num,0
If>num=1,yes_six_digits

MDL>Numbers Only - enter a SIX digit patient number please...
Goto>next_lookup

Label>yes_six_digits
//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0

If>num=1,yes_one_matching_patient_record_found

If>num=0
  MDL>No record matching patient number %patient_number% could be found!
Else
  MDL>%num% separate records matching patient number %patient_number% were found! Is this a problem?
EndIf

Goto>next_lookup

Label>yes_one_matching_patient_record_found

Timer>EndTime
Let>TotalTime={(%EndTime%-%StartTime%)/1000}


MDL>Time to Process = %TotalTime%%crlf%Patient Record:%CRLF%%CRLF%%matches_1%

Goto>next_lookup

Label>end

kpassaur · Post by **kpassaur** » Thu Feb 21, 2013 4:44 pm

Great looks like DBQuery is the way to go!

JRL · Post by **JRL** » Thu Feb 21, 2013 5:06 pm

Great looks like DBQuery is the way to go!

Yes... no... maybe.... re-read my post. jpuziano's method is at least 5 times faster than dbquery.

dbquery is fast too. To a user there's not really much difference between a tenth of a second and a half a second. Of course you can use whatever method you're most comfortable with.

Just want to make sure you understand dbquery is not as fast as regex in this instance.

jpuziano · Post by **jpuziano** » Thu Feb 21, 2013 5:09 pm

Hi kpassaur,

kpassaur wrote:Great looks like DBQuery is the way to go!

Not sure why you'd conclude that

because according to JRL's timings... my method is from 4 to 20 times faster than DBQuery.

Also now just saw JRL's second post... Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.

Anyway, both methods will work, thanks JRL for posting the DBQuery method and timing data.

Below was a reply I composed earlier (before JRL's post) that explains my method in a bit more detail.

This was an interesting example kpassaur, thanks for posting!

Hi kpassaur,

kpassaur wrote:I still have to figure out the logic you used to get the line.

Retrieving the line (patient record) is done in the following script lines:

Code: Select all

//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0

- Once the following line runs...
Let>pattern=%patient_number%[^\x0D\x0A]*

...the value of the pattern variable will be:
100659[^\x0D\x0A]*

- for those not familiar with RegEx (Regular Expressions) here is a breakdown of the pattern:

100659[^\x0D\x0A]*
100659 matches that number exactly

100659[^\x0D\x0A]*
the square brackets [ ] define a character class

100659[^\x0D\x0A]*
when the first char in your character class is ^ it negates the class... meaning we want to match all chars EXCEPT the ones in our character class

100659[^\x0D\x0A]*
\x0D represents a Carriage Return char using its hex value

100659[^\x0D\x0A]*
\x0A represents a Linefeed char using its hex value

...so the character class below matches all chars except CR or LF
[^\x0D\x0A]

100659[^\x0D\x0A]*
The * at the end is for repetition, it modifies the expression before it which is our character class... and it means, match "zero or more times" so we will be matching as many non-CR and non-LF chars as possible

So again, this pattern matches the patient number entered and the rest of the pattern matches all chars to the end of the line but not including CR or LF chars... and the line we want will be in variable matches_1

Note the following line:
Let>matches_1=

The above line isn't strictly required in the above example however I found out the hard way that when using RegEx multiple times to search for various things and in the case where you don't find a match, matches_1 will still contain what it did from some other match done earlier and that threw me for a loop once so I always use a Let> statement to set it to nothing before running a RegEx line just as good practice.

Also, good practice is to check the value of the "number of matches" variable (I used num above) after running the RegEx> line as it will ALWAYS tell you if you found one or more matches... rather than looking at matches_1 right after the match.

Note that if you wanted to, you could:

- take the line you found
- break it up into separate field values
- display those field values in a dialog
- allow the user to add, edit or delete the values
- allow the user to click a button to update the patient_records variable and then write that back to your csv file on disk

You could just use StringReplace> to replace the old value of the line now stored in matches_1 with the new value you would like it to be, built up from the values in the fields on the dialog after the user has finished editing the record.

Anyway... I am curious. Was this or JRL's method any faster on your target machines?

Please let us know and take care.

Rain · Post by **Rain** » Thu Feb 21, 2013 7:15 pm

jpuziano wrote:Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.

Results with 100k and 1 million patients.

Total Patients: 100,000
JP's method
Time to Process = 0.037

JRL's method
Time to Process = 0.45

Total Patients: 1,000,000
JP's method
Time to Process = 0.363

JRL's method
Time to Process = 4.18

jpuziano · Post by **jpuziano** » Thu Feb 21, 2013 7:15 pm

Rain wrote:
jpuziano wrote:Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.
Results with 100k and 1 million patients.

Total Patients: 100,000
JP's method
Time to Process = 0.037

JRL's method
Time to Process = 0.45

Total Patients: 1,000,000
JP's method
Time to Process = 0.363

JRL's method
Time to Process = 4.18

Vindicated... Thanks Rain!

Needed Advice for reading a large CSV file

Needed Advice for reading a large CSV file

Speed of reading file

Searching CSV

DBQuery

Re: DBQuery

Re: DBQuery