Parse csv files

R

rob

Here is what I want to do. From the internet I download some data that
are in csv format. All data will be in one long string. Now I need to
extract every cell. The problem ist that some cell content contains
commas and/or double quotes. Some of the cell contents that contain
double quotes use double double quotes and others don't, i.e. some look
like

"this "item" is bad", "this item is ok"

and others like

"this ""item"" is bad", "this item is ok"

There is also a chance that some cell are in double quotes (if they
contains commas or double quotes) and others are not in double quotes
(if they do not contain commas or double quotes). Considering all this
(and possibly more stuff) parsing becomes non trivial.

As a first approach I stored the content downloaded into a file and
then use odbc like this:

connectionString = @"Driver={Microsoft Text Driver (*.txt;
*.csv)};DBQ=" + Path.GetDirectoryName(filename);
connection = new OdbcConnection(connectionString);
connection.Open();
command = new OdbcCommand("Select * FROM " +
Path.GetFileName(filename), connection);
reader = command.ExecuteReader();

Unfortunately, this approach does not work for the above scenarios.
Excel reads the files in question just fine, though. So my question is
what is the best approach to read csv files, preferably without having
to create temporary files?

Thanks
 
B

Bryan Hessey

My understanding is that a .csv file has data fields separated by
comma character, and if the field contains a comma then that field i
enclosed in quotes (being double-quotes), and if the field contain
quotes then those are indicated by two consecutive quotes.

The site http://en.wikipedia.org/wiki/Comma-separated_values perhap
better explains .csv files, and has a pointer to required drivers.

Hope this helps.

--
 
R

rob

Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks
 
R

rob

Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks
 
A

AA2e72E

An (Excel) CSV file is:
1. simply a text file with columns delimited by a comma
2. simply a text file with columns delimited by a comma and text values
enclosed in double quotes.
In 1, a missing value is indicated by two contiguous commas, and in 2. a
missing value is indicated by ,"",
Your CSV file is much more complicated (dos not follow the patter 1 or 2)
and you appear to be using ADO.NET. Although you have ruled it out, you might
have to pre-process your file as 1 or 2 above before you can execute SQL
queries. You might find that when the SQL query works, it ignores the first
row (unless it contains or you add column names in the first row).
 
J

james

rob said:
Bryan,

This is also my understanding of csv files (though some people say that
for double quotes you don't need double double quotes). In any case,
the problem is that most cells use double double quotes for double
quotes. Unfortunately, some cells don't seem to be formated right and
don't enclose double quotes into double double quotes as outlined in my
example. I know that Excel can load it just fine but with odbc it does
not work. Any idea if the parsing algorithm used by Excel is somehow
accessible through the .NET framework (I was hoping Excel uses odbc).

Thanks


Is this CSV file(s) going to be an on-going thing? If not, and since you
say it
works fine in Excel, how about opening it in Excel and then re-exporting it
to
either another CSV file or TabDelimited file and then running your code
against that?
That way, you may be able to get the data out in the formatting you need for
your
application.
james
 
R

rob

This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of
the cell content contains tabs as well. Another option would be to open
the file in Excel (programatically) and then read out the cells. This
is a huge overkill, though. Worse it's something that will run on a
server. Although it usually runs during off-peak times I still don't
like the idea doing it this way. Therefore, I might really end up
writing my own parser. Thanks for the input, though. It's definitely
appreciated.
 
J

james

rob said:
This will be an ongoing thing. I was thinking about opening it in Excel
(programatically) and store it with tabs but there is a chance some of
the cell content contains tabs as well. Another option would be to open
the file in Excel (programatically) and then read out the cells. This
is a huge overkill, though. Worse it's something that will run on a
server. Although it usually runs during off-peak times I still don't
like the idea doing it this way. Therefore, I might really end up
writing my own parser. Thanks for the input, though. It's definitely
appreciated.

Your welcome. Sorry I couldn't offer any better suggestion. It does sound as
though you will have to write some sort of parser yourself. The scary part
is
you have no control over how the original file is created and what sort of
input
it allows. And with that being the case, it seems that writing a parser
routine that
will consistantly return the correct output, will be tough. Especially if
the app will
be running unattended. Otherwise you could write an editor of sorts that
would allow
a user to make some decisions on what to keep and what to discard. That
would go
a long way towards increasing the dependability of the app. (as long as the
end uers
understands what needs to be retained and what needs to be discarded)
I don't envy you on this one!!!
james
 
R

rob

Jason,

Thanks a lot for this link. It sounds promising and I'll give it a try.
I hope that'll safe my day...:)
 
R

rob

I did try the CsvReader and it fails with the exact same item as in the
odbc approach. One additional thing I realized is that many items/cells
do have words in double quotes. As far as I know double quotes should
be put in double double quotes. The data I have sometimes does that but
often it does not. Both approaches (odbc, cvsReader) work fine (can
read though don't render correctly) if the cell in question is the last
one. If it is not the last one both odbc and cvsReader get thrown off,
though.

So the conclusion here is that from all I know only Excel can handle
things in a way that works out but that parser is not directly
accessible in the .NET framework. Therefore, I will have to write my
own parser or maybe modify cvsReader.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top