[R] reading csv files

Jim Lemon jim at bitwrit.com.au
Sat Feb 6 01:16:01 CET 2010


On 02/06/2010 09:05 AM, analyst41 at hotmail.com wrote:
>
>
> On Feb 5, 8:57 am, Barry Rowlingson<b.rowling... at lancaster.ac.uk>
> wrote:
>> On Fri, Feb 5, 2010 at 10:23 AM, analys... at hotmail.com
>>
>> <analys... at hotmail.com>  wrote:
>>> the csv files are downloaded from a database and it looks like some
>>> character fields contain the CR-LF sequence within them.
>>
>>> This causes R to see a new record/row and the number of rows it sees
>>> is different (usually higher) from the number of rows actually
>>> extracted.
>>
>>   Hard to tell without an example, but I just tried this in a file:
>>
>> 1,2,"this
>> is a test",99
>> 2,3,"oneliner",45
>>
>> and:
>>
>>> read.table("test.csv",sep=",")
>>
>>    V1 V2              V3 V4
>> 1  1  2 this\nis a test 99
>> 2  2  3        oneliner 45
>>
>> seemed to work. But if your strings aren't "quoted" (hard to tell
>> without an example) then you might have to find another way. Hard to
>> tell without an example.
>>
>> Barry
>>
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> Here is a Hex dump (please igmore the '>' at the start of each line) -
> of the file that results from extracting two rows.
>
>
>> EF BB BF 64 65 73 63 72-69 70 74 69 6F 6E 0D 0A   ...description..
>> 22 3C 73 74 72 6F 6E 67-3E 55 6E 6B 6E 6F 77 6E   "<strong>Unknown
>> 20 41 6E 79 74 69 6D 65-2C 20 41 6E 79 77 68 65    Anytime, Anywhe
>> 72 65 20 4C 65 61 72 6E-69 6E 67 3C 62 72 20 2F   re Learning<br /
>> 3E 0D 0A 3C 2F 73 74 72-6F 6E 67 3E 20 54 68 65>..</strong>  The
>> 20 61 6E 73 77 65 72 20-69 73 20 55 6E 6B 6E 6F    answer is Unkno
>> 77 6E 2E 20 3C 73 74 72-6F 6E 67 3E 20 79 6F 75   wn.<strong>  you
>> 20 63 61 6E 20 73 74 61-72 74 20 61 6E 64 20 66    can start and f
>> 69 6E 69 73 68 20 69 6E-20 6C 65 73 73 20 74 68   inish in less th
>> 65 6E 20 31 37 20 6D 6F-6E 74 68 73 2E 3C 2F 73   en 17 months.</s
>> 74 72 6F 6E 67 3E 20 3C-62 72 20 2F 3E 0D 0A 3C   trong>  <br />..<
>> 62 72 20 2F 3E 0D 0A 55-6E 6B 6E 6F 77 6E 20 61   br />..Unknown a
>> 62 6F 75 74 20 65 6E 73-75 72 69 6E 67 20 79 6F   bout ensuring yo
>> 75 20 6C 65 61 72 6E 20-2E 22 0D 0A 03 D8 26 8A   u learn ."....&.
>
>
>
> R, Fortran and Excel see five lines, but the database has only two
> lines.
>
Okay, you have five CR-LF pairs with two being EORs. It looks like the 
<br />CR-LF is the EOR sequence, so it should be possible to preserve 
those while changing the others to something like "~" or deleting them. 
As I said previously, the regexperts can work out a way to distinguish 
the CR-LF pairs that are _not_ in an EOR sequence.

You might want to think about dumping the control characters as well.

Jim



More information about the R-help mailing list