January 28, 2010

Force DBQuery to Read CSV Columns as Text

Filed under: Scripting — Marcus Tettmar @ 1:04 pm

In More on Reading and Writing CSV Files and various other posts I have demonstrated how CSV data can be read via a SQL SELECT statement using DBQuery.

In order to do this we used an ADO connection string in the DBConnect line:

Let>ConStr=Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\My Documents;Extended Properties='text;HDR=NO;FMT=Delimited'
DBConnect>ConStr,dbH

“Proper” CSV files use the double quote character to delimit strings. But there are many CSV files that do not contain these delimiters. When string delimiters aren’t used this should normally mean that the field is numeric, but that may not always be the case. So ADO sometimes has to take a guess as to whether a column is a string or a number. By default it does this by looking at the first 8 rows of data. This may not always be enough.

Watch what happens if we try to read in this file, which contains IP addresses:

abc.pluto.server.com,18.140.1.2
zxy.saturn.server.com,18.120.2.1
fyg.pluto.server.com,18.5.6.2

We’ll read it in with:

DBQuery>dbH,SELECT * FROM servers.csv,fields,num_recs,num_fields

Something weird happens. Look in the watch list and you’ll see the IP addresses have become:

18.1401
18.1202
18.562

ADO has decided that the IP addresses are numbers. It didn’t see any quote marks and after analysing the rows decided they are meant to be numbers. Bonus points go to anyone who can tell me why it has transformed the numbers in that way.

As mentioned before, ADO looks at the first 8 rows. If more of those 8 are more obviously strings, or contain string delimiters, it will treat them as strings. Try delimiting just one row of the above three. You’ll still get numbers. Delimit two and the problem is solved. Where the data type is mixed it will choose the type of the majority.

But we can’t go editing the CSV files to correct them.

What we can do is use a Schema.ini file. Using a Schema.ini file we can tell ADO what data types should be used for each column. So for our Servers.csv example above we could make a Schema.ini file as follows:

[Servers.csv]
Col1=ServerName Text
Col2=IPAddress Text

This file should be in the same folder as the CSV file.

Now, when we run our query we get the correct values, because we’ve forced ADO to treat the columns as text. Another advantage of this is that we have given our columns names, which is useful because our file doesn’t have a header column. This means we could set the Fieldnames parameter of our DBQuery line to 1 to return our array using the field names.

For more information on Schema.ini options for text files see:
http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx

Most of the time you’ll probably be working with CSV files that you know about, so manually creating a schema.ini file once is no issue. But you could have the macro create it for you using WriteLn and/or EditIniFile.

What if your macro needs to read in a CSV file and you don’t already know how many fields there are, but you want to force everything to text? Well, how about doing this:

//Create schema.ini file
DBQuery>dbH,select * from servers.csv,fields,num_recs,num_fields,0
WriteLn>%SCRIPT_DIR%\schema.ini,res,[servers.csv]
Let>k=0
Repeat>k
  Let>k=k+1
  EditIniFile>%SCRIPT_DIR%\schema.ini,servers.csv,Col%k%,Col%k% Text
Until>k=num_fields

//now requery using schema.ini file
DBQuery>dbH,select * from servers.csv,fields,num_recs,num_fields,1
...

So this code uses a DBQuery to get the column count, and creates a schema.ini with each column called Coln where n is the column number and setting it to Text.


Side note: There’s another option called MaxScanRows which is supposed to help by telling ADO how many rows to scan, so that instead of looking at only 8 rows you can tell it to look at more, or less. According to the docs setting it to zero will cause it to scan all rows. Given that in our IP address experiment it took the IP addresses as numbers and we only had 3 which is less than the default 8 rows, I would have expected that setting MaxScanRows to 0 would make no difference. In fact it causes the IP addresses to come back as Text. So it seems to solve our initial problem., but I’m not sure why. Googling MaxScanRows reveals a number of forum and news group posts where others have had problems and there seems to be some confusion over exactly what it does. If we use MaxScanRows without ColNameHeader it also causes ADO to see the first line as a header line. And then setting ColNameHeader to false, while ensuring the first line is not seen as a header, brings back the original problem, where the IP addresses are converted to numbers. All seems a bit strange. Hence my preference for explicitly setting the data type of each column.