Needed Advice for reading a large CSV file
Moderators: JRL, Dorian (MJT support)
Needed Advice for reading a large CSV file
With MS I have used a couple of different methods to read a CSV and Excel files. Howver the files were realitivity small, not more than 5000 records.
This time I need to read one that is has over 100,000 records, with the record only containing two values. "The classic look up a customer number and get their name."
I don't have access to the CSV file as it is considered confidential by the owner so I have no way of testing the large file. Any suggestions as to performance? I don't want to use one method and have it be so wrong that it takes 30 seconds to find one. I don't mind 3-4 seconds.
I would perfer to use CSVFiletoArray, read it once then have a loop that keeps accessing it as opposed to using DBQuery and accessing it each time. I don't want to leave the file in Excel and use XLGetCell function even though the odds are 50 percent that it will not have to read all the cells. My concern with this is someone opening the file and then having issues with it being locked.
This time I need to read one that is has over 100,000 records, with the record only containing two values. "The classic look up a customer number and get their name."
I don't have access to the CSV file as it is considered confidential by the owner so I have no way of testing the large file. Any suggestions as to performance? I don't want to use one method and have it be so wrong that it takes 30 seconds to find one. I don't mind 3-4 seconds.
I would perfer to use CSVFiletoArray, read it once then have a loop that keeps accessing it as opposed to using DBQuery and accessing it each time. I don't want to leave the file in Excel and use XLGetCell function even though the odds are 50 percent that it will not have to read all the cells. My concern with this is someone opening the file and then having issues with it being locked.
-
- Macro Veteran
- Posts: 267
- Joined: Mon Sep 27, 2010 8:57 pm
- Location: Seattle, WA
A couple of usage questions might help guide your ideas...
How will the entries be accessed?
- Sequential (this would be too easy)
- Random
How often will you be reading the file?
- Will you search through 4 or 5 times and then be done?
- Or are you going to read 1000's of entries / day?
How frequently will you be reading the file?
- Read an entry and then 5 minutes later read another entry?
- Or read an entry, process it, and then read the next entry milliseconds later?
How often will the file be updated?
- Will you get a copy and work from that for the whole day?
- Or will updates be live and you are working from the updated list?
Is the enduser's equipment reasonably robust or are they stuck with Windows XP with only 1 GB ram?
How will the entries be accessed?
- Sequential (this would be too easy)
- Random
How often will you be reading the file?
- Will you search through 4 or 5 times and then be done?
- Or are you going to read 1000's of entries / day?
How frequently will you be reading the file?
- Read an entry and then 5 minutes later read another entry?
- Or read an entry, process it, and then read the next entry milliseconds later?
How often will the file be updated?
- Will you get a copy and work from that for the whole day?
- Or will updates be live and you are working from the updated list?
Is the enduser's equipment reasonably robust or are they stuck with Windows XP with only 1 GB ram?
Speed of reading file
These are patient numbers and they will be random. They maybe sequential numbers in the CSV file but they will not be retrieved sequentially.
This is the difficult part - how often it will be run - a decision has not been made. I would suggestt that they run it at the end of the day and then it does not matter. However, they maybe runing it every 5 minutes, doing 10 look ups, then nothing for a hour and then 5, 10, 30 or more lookups.
So they would read an entry, once found process it (a couple of seconds) read another etc. five or ten times and then nothing until the script is run again which could be minutes, hours or days later.
It will make more sense when I explain what is going on. Files are being scanned and named a patient number and document type with the use of a cover page that contains barcodes. This utility is to read the patient number and pull the patient name out of the csv file and rename and move the file.
Origionally it was not a big deal as they planned on exporting the patient names and numbers to a Excel spreadsheet for the next weeks appointments. To create the barcodes they need to export this list (so they should have a small list). Then they would scan the old files with the barcodes and import them into their new system after the name was added. (I won't bother you with tall he reasons for not putting the name in the barcode)
So if they had 1000 apointments the list would be only contain a 1000 records. However, they would prefer to automate it as much as possible. So use the entire list, which means the 100,000 or so old patient files.
They change their mind on the workflow all the time. I would not do it this way to begin with. I would create a utiliy where when an appointment is made the user enters the name and number, the cover pages are printed, the csv file updated and the user could just take the file to the copier and scan it. Then once it is processed delete the line in the csv file.
However at this point that is not the case. I have written the script use CSVFiletoArray and ran it with a test file containing 100,000 lines and it takes about 40 seconds to locate the record with an old 32 bit machine running Vista with 2 gigs of memory.
I was curious as to if I used DBQuery would it be 2 seconds or something else drmatic.
This is the difficult part - how often it will be run - a decision has not been made. I would suggestt that they run it at the end of the day and then it does not matter. However, they maybe runing it every 5 minutes, doing 10 look ups, then nothing for a hour and then 5, 10, 30 or more lookups.
So they would read an entry, once found process it (a couple of seconds) read another etc. five or ten times and then nothing until the script is run again which could be minutes, hours or days later.
It will make more sense when I explain what is going on. Files are being scanned and named a patient number and document type with the use of a cover page that contains barcodes. This utility is to read the patient number and pull the patient name out of the csv file and rename and move the file.
Origionally it was not a big deal as they planned on exporting the patient names and numbers to a Excel spreadsheet for the next weeks appointments. To create the barcodes they need to export this list (so they should have a small list). Then they would scan the old files with the barcodes and import them into their new system after the name was added. (I won't bother you with tall he reasons for not putting the name in the barcode)
So if they had 1000 apointments the list would be only contain a 1000 records. However, they would prefer to automate it as much as possible. So use the entire list, which means the 100,000 or so old patient files.
They change their mind on the workflow all the time. I would not do it this way to begin with. I would create a utiliy where when an appointment is made the user enters the name and number, the cover pages are printed, the csv file updated and the user could just take the file to the copier and scan it. Then once it is processed delete the line in the csv file.
However at this point that is not the case. I have written the script use CSVFiletoArray and ran it with a test file containing 100,000 lines and it takes about 40 seconds to locate the record with an old 32 bit machine running Vista with 2 gigs of memory.
I was curious as to if I used DBQuery would it be 2 seconds or something else drmatic.
Hi kpassaur,
40 seconds huh... ouch. Here's some thoughts for faster record retrieval:
- Read all the data into a single variable using ReadFile>
- Then use a RegEx command to locate the record you want by matching the patient number and also match all non CRLF chars after that which will get you everything up until the end of the line i.e. the entire patient record.
Hmm... Just for fun I simulated this. I built a file called patient_records.csv
The file has 100,000 lines (records) and all lines are the same except for the second last line which is our target patient we are trying to look up.
Here is the code which illustrates the technique:
As for performance, searching for patient 100659 came back instantly for me... well under a second. I would be curious how it would perform on your target machine.
For fun, search for patient number 100000 and it will bring an "error condition" to your attention... multiple matching records.
I hope you found this useful... take care.
40 seconds huh... ouch. Here's some thoughts for faster record retrieval:
- Read all the data into a single variable using ReadFile>
- Then use a RegEx command to locate the record you want by matching the patient number and also match all non CRLF chars after that which will get you everything up until the end of the line i.e. the entire patient record.
Hmm... Just for fun I simulated this. I built a file called patient_records.csv
Code: Select all
100000,John Doe,123 Main Street,Any Town,Country,comment
100000,John Doe,123 Main Street,Any Town,Country,comment
100000,John Doe,123 Main Street,Any Town,Country,comment
100659,Target Patient,40 seconds is way too long Street,Any Town,Country,yes we can
100000,John Doe,123 Main Street,Any Town,Country,comment
Here is the code which illustrates the technique:
Code: Select all
ReadFile>C:\patient_records.csv,patient_records
Label>next_lookup
Input>patient_number,Records Search - Enter six digit patient number or Q to Quit:,100659
Trim>patient_number,patient_number
UpperCase>patient_number,patient_number
If>patient_number=Q,end
Length>patient_number,patient_number_length
If>patient_number_length=6,length_ok
//If we clicked Cancel, length will be 0 so end
If>patient_number_length=0,end
MDL>Please enter a SIX digit patient number...
Goto>next_lookup
Label>length_ok
//Now verify the six chars are all digits
Let>pattern=[0-9]{6,6}
RegEx>pattern,patient_number,0,matches,num,0
If>num=1,yes_six_digits
MDL>Numbers Only - enter a SIX digit patient number please...
Goto>next_lookup
Label>yes_six_digits
//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0
If>num=1,yes_one_matching_patient_record_found
If>num=0
MDL>No record matching patient number %patient_number% could be found!
Else
MDL>%num% separate records matching patient number %patient_number% were found! Is this a problem?
EndIf
Goto>next_lookup
Label>yes_one_matching_patient_record_found
MDL>Patient Record:%CRLF%%CRLF%%matches_1%
Goto>next_lookup
Label>end
For fun, search for patient number 100000 and it will bring an "error condition" to your attention... multiple matching records.
I hope you found this useful... take care.
jpuziano
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -

Searching CSV
Thanks i will give it a go. I still have to figure out the logic you used to get the line.
Just to finish out your request here's a sample that uses dbquery. Using jpuziano's format, I constructed a fake 100,000 line CSV file with one line of usable data.
The format of the CSV file is important. If the Patient number in the CSV has quotes around it, use the remarked line. Also note that there must be headers in the CSV file. The first line in the fake file I created is:
PATIENT_NO,NAME,ADDRESS,CITY,LOCATION,NOTES
Running this outside the editor I was consistently hitting 0.4 seconds. Inside the editor 0.6 seconds.
I added a timer to jpuziano's script and ran it also. I got inconsistent times but always under 0.1 seconds. Several times as little as 0.03 seconds.
jpuziano's script with a timer added.
The format of the CSV file is important. If the Patient number in the CSV has quotes around it, use the remarked line. Also note that there must be headers in the CSV file. The first line in the fake file I created is:
PATIENT_NO,NAME,ADDRESS,CITY,LOCATION,NOTES
Running this outside the editor I was consistently hitting 0.4 seconds. Inside the editor 0.6 seconds.
I added a timer to jpuziano's script and ran it also. I got inconsistent times but always under 0.1 seconds. Several times as little as 0.03 seconds.
Code: Select all
Let>PatientNo=100659
Let>FileLocationPath=C:\
Let>CSVFileName=KeithTest.csv
Timer>StartTime
GoSub>GetDataFromCSV
Timer>EndTime
Let>TotalTime={(%EndTime%-%StartTime%)/1000}
MDL>Time to Process = %TotalTime%%crlf%Patient Name = %Res_1_Name%
SRT>GetDataFromCSV
Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=%FileLocationPath%;Extensions=asc,csv,tab,txt;
DBConnect>ConStr,dbH
Let>sql=select * from %CSVFileName% where PATIENT_NO = %PatientNo%
//Let>sql=select * from %CSVFileName% where PATIENT_NO = '%PatientNo%'
DBQuery>dbH,sql,res,numrecords,numfields,1
DBClose>dbH
END>GetDataFromCSV
jpuziano's script with a timer added.
Code: Select all
ReadFile>C:\Keithtest.csv,patient_records
Label>next_lookup
Input>patient_number,Records Search - Enter six digit patient number or Q to Quit:,100659
Timer>StartTime
Trim>patient_number,patient_number
UpperCase>patient_number,patient_number
If>patient_number=Q,end
Length>patient_number,patient_number_length
If>patient_number_length=6,length_ok
//If we clicked Cancel, length will be 0 so end
If>patient_number_length=0,end
MDL>Please enter a SIX digit patient number...
Goto>next_lookup
Label>length_ok
//Now verify the six chars are all digits
Let>pattern=[0-9]{6,6}
RegEx>pattern,patient_number,0,matches,num,0
If>num=1,yes_six_digits
MDL>Numbers Only - enter a SIX digit patient number please...
Goto>next_lookup
Label>yes_six_digits
//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0
If>num=1,yes_one_matching_patient_record_found
If>num=0
MDL>No record matching patient number %patient_number% could be found!
Else
MDL>%num% separate records matching patient number %patient_number% were found! Is this a problem?
EndIf
Goto>next_lookup
Label>yes_one_matching_patient_record_found
Timer>EndTime
Let>TotalTime={(%EndTime%-%StartTime%)/1000}
MDL>Time to Process = %TotalTime%%crlf%Patient Record:%CRLF%%CRLF%%matches_1%
Goto>next_lookup
Label>end
Yes... no... maybe.... re-read my post. jpuziano's method is at least 5 times faster than dbquery.Great looks like DBQuery is the way to go!
dbquery is fast too. To a user there's not really much difference between a tenth of a second and a half a second. Of course you can use whatever method you're most comfortable with.
Just want to make sure you understand dbquery is not as fast as regex in this instance.
Re: DBQuery
Hi kpassaur,
because according to JRL's timings... my method is from 4 to 20 times faster than DBQuery.
Also now just saw JRL's second post... Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.
Anyway, both methods will work, thanks JRL for posting the DBQuery method and timing data.
Below was a reply I composed earlier (before JRL's post) that explains my method in a bit more detail.
This was an interesting example kpassaur, thanks for posting!
Hi kpassaur,
- Once the following line runs...
Let>pattern=%patient_number%[^\x0D\x0A]*
...the value of the pattern variable will be:
100659[^\x0D\x0A]*
- for those not familiar with RegEx (Regular Expressions) here is a breakdown of the pattern:
100659[^\x0D\x0A]*
100659 matches that number exactly
100659[^\x0D\x0A]*
the square brackets [ ] define a character class
100659[^\x0D\x0A]*
when the first char in your character class is ^ it negates the class... meaning we want to match all chars EXCEPT the ones in our character class
100659[^\x0D\x0A]*
\x0D represents a Carriage Return char using its hex value
100659[^\x0D\x0A]*
\x0A represents a Linefeed char using its hex value
...so the character class below matches all chars except CR or LF
[^\x0D\x0A]
100659[^\x0D\x0A]*
The * at the end is for repetition, it modifies the expression before it which is our character class... and it means, match "zero or more times" so we will be matching as many non-CR and non-LF chars as possible
So again, this pattern matches the patient number entered and the rest of the pattern matches all chars to the end of the line but not including CR or LF chars... and the line we want will be in variable matches_1
Note the following line:
Let>matches_1=
The above line isn't strictly required in the above example however I found out the hard way that when using RegEx multiple times to search for various things and in the case where you don't find a match, matches_1 will still contain what it did from some other match done earlier and that threw me for a loop once so I always use a Let> statement to set it to nothing before running a RegEx line just as good practice.
Also, good practice is to check the value of the "number of matches" variable (I used num above) after running the RegEx> line as it will ALWAYS tell you if you found one or more matches... rather than looking at matches_1 right after the match.
Note that if you wanted to, you could:
- take the line you found
- break it up into separate field values
- display those field values in a dialog
- allow the user to add, edit or delete the values
- allow the user to click a button to update the patient_records variable and then write that back to your csv file on disk
You could just use StringReplace> to replace the old value of the line now stored in matches_1 with the new value you would like it to be, built up from the values in the fields on the dialog after the user has finished editing the record.
Anyway... I am curious. Was this or JRL's method any faster on your target machines?
Please let us know and take care.
Not sure why you'd conclude thatkpassaur wrote:Great looks like DBQuery is the way to go!

Also now just saw JRL's second post... Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.
Anyway, both methods will work, thanks JRL for posting the DBQuery method and timing data.
Below was a reply I composed earlier (before JRL's post) that explains my method in a bit more detail.
This was an interesting example kpassaur, thanks for posting!
Hi kpassaur,
Retrieving the line (patient record) is done in the following script lines:kpassaur wrote:I still have to figure out the logic you used to get the line.
Code: Select all
//Perform the lookup
Let>pattern=%patient_number%[^\x0D\x0A]*
Let>matches_1=
RegEx>pattern,patient_records,0,matches,num,0
Let>pattern=%patient_number%[^\x0D\x0A]*
...the value of the pattern variable will be:
100659[^\x0D\x0A]*
- for those not familiar with RegEx (Regular Expressions) here is a breakdown of the pattern:
100659[^\x0D\x0A]*
100659 matches that number exactly
100659[^\x0D\x0A]*
the square brackets [ ] define a character class
100659[^\x0D\x0A]*
when the first char in your character class is ^ it negates the class... meaning we want to match all chars EXCEPT the ones in our character class
100659[^\x0D\x0A]*
\x0D represents a Carriage Return char using its hex value
100659[^\x0D\x0A]*
\x0A represents a Linefeed char using its hex value
...so the character class below matches all chars except CR or LF
[^\x0D\x0A]
100659[^\x0D\x0A]*
The * at the end is for repetition, it modifies the expression before it which is our character class... and it means, match "zero or more times" so we will be matching as many non-CR and non-LF chars as possible
So again, this pattern matches the patient number entered and the rest of the pattern matches all chars to the end of the line but not including CR or LF chars... and the line we want will be in variable matches_1
Note the following line:
Let>matches_1=
The above line isn't strictly required in the above example however I found out the hard way that when using RegEx multiple times to search for various things and in the case where you don't find a match, matches_1 will still contain what it did from some other match done earlier and that threw me for a loop once so I always use a Let> statement to set it to nothing before running a RegEx line just as good practice.
Also, good practice is to check the value of the "number of matches" variable (I used num above) after running the RegEx> line as it will ALWAYS tell you if you found one or more matches... rather than looking at matches_1 right after the match.
Note that if you wanted to, you could:
- take the line you found
- break it up into separate field values
- display those field values in a dialog
- allow the user to add, edit or delete the values
- allow the user to click a button to update the patient_records variable and then write that back to your csv file on disk
You could just use StringReplace> to replace the old value of the line now stored in matches_1 with the new value you would like it to be, built up from the values in the fields on the dialog after the user has finished editing the record.
Anyway... I am curious. Was this or JRL's method any faster on your target machines?
Please let us know and take care.
Last edited by jpuziano on Thu Feb 21, 2013 11:11 pm, edited 4 times in total.
jpuziano
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -

Results with 100k and 1 million patients.jpuziano wrote:Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.
Total Patients: 100,000
JP's method
Time to Process = 0.037
JRL's method
Time to Process = 0.45
Total Patients: 1,000,000
JP's method
Time to Process = 0.363
JRL's method
Time to Process = 4.18
Re: DBQuery
Vindicated... Thanks Rain!Rain wrote:Results with 100k and 1 million patients.jpuziano wrote:Yes a few tenths of a second may not matter but who knows what would happen with a much larger recordset. It is good to know about both methods.
Total Patients: 100,000
JP's method
Time to Process = 0.037
JRL's method
Time to Process = 0.45
Total Patients: 1,000,000
JP's method
Time to Process = 0.363
JRL's method
Time to Process = 4.18
jpuziano
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -
Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post -
