[R] Problem reading mixed CSV file

Ashish Agarwal ashish.agarwala at gmail.com
Tue Mar 20 16:43:37 CET 2012


The file is 20MB having 2 Million rows.
I understand that I two different formats  - 6 columns and 7 columns.
How do I read chunks to different files by using scan with modifying
skip and nlines parameters?

On Mon, Mar 19, 2012 at 3:59 PM, Petr PIKAL <petr.pikal at precheza.cz> wrote:
>
> I would follow Jims suggestion,
> nFields <- count.fields(fileName, sep = ',')
> count fields and read chunks to different files by using scan with
> modifying skip and nlines parameters. However if there is only few lines
> which differ it would be better to correct those few lines manually in
> some suitable editor.
>
> Elaborating omnipotent function for reading any kind of
> corrupted/nonstandard files seems to me suited only if you expect to read
> such files many times.
>
> Regards
> Petr
>
>
>>
>>
>>
>> On Sat, Mar 17, 2012 at 4:54 AM, jim holtman <jholtman at gmail.com> wrote:
>> > Here is a solution that looks for the line with 7 elements and inserts
>> > the quotes:
>> >
>> >
>> >> fileName <- '/temp/text.txt'
>> >> input <- readLines(fileName)
>> >> # count the fields to find 7
>> >> nFields <- count.fields(fileName, sep = ',')
>> >> # now fix the data
>> >> for (i in which(nFields == 7)){
>> > +     # split on comma
>> > +     z <- strsplit(input[i], ',')[[1]]
>> > +     input[i] <- paste(z[1], z[2]
>> > +         , paste('"', z[3], ',', z[4], '"', sep = '') # put on quotes
>> > +         , z[5], z[6], z[7], sep = ','
>> > +         )
>> > + }
>> >>
>> >> # now read in the data
>> >> result <- read.table(textConnection(input), sep = ',')
>> >>
>> >>         result
>> >                         V1       V2                   V3   V4 V5 V6
>> > 1                                                         1968 21  0
>> > 2                                                  Boston 1968 13  0
>> > 3                                                  Boston 1968 18  0
>> > 4                                                 Chicago 1967 44  0
>> > 5                                              Providence 1968 17  0
>> > 6                                              Providence 1969 48  0
>> > 7                                                   Binky 1968 24  0
>> > 8                                                 Chicago 1968 23  0
>> > 9                                                   Dally 1968  7  0
>> > 10                                   Raleigh, North Carol 1968 25  0
>> > 11 Addy ABC-Dogs Stars-W8.1                    Providence 1968 38  0
>> > 12              DEF_REQPRF/                     Dartmouth 1967 31  1
>> > 13                       PL                               1967 38  1
>> > 14                       XY PopatLal                      1967  5  1
>> > 15                       XY PopatLal                      1967  6  8
>> > 16                       XY PopatLal                      1967  7  7
>> > 17                       XY PopatLal                      1967  9  1
>> > 18                       XY PopatLal                      1967 10  1
>> > 19                       XY PopatLal                      1967 13  1
>> > 20                       XY PopatLal               Boston 1967  6  1
>> > 21                       XY PopatLal               Boston 1967  7 11
>> > 22                       XY PopatLal               Boston 1967  9  2
>> > 23                       XY PopatLal               Boston 1967 10  3
>> > 24                       XY PopatLal               Boston 1967  7  2
>> >>
>> >
>> >
>> > On Fri, Mar 16, 2012 at 2:17 PM, Ashish Agarwal
>> > <ashish.agarwala at gmail.com> wrote:
>> >> I have a file that is 5000 records and to edit that file is not easy.
>> >> Is there any way to line 10 differently to account for changes in the
>> >> third field?
>> >>
>> >> On Fri, Mar 16, 2012 at 11:35 PM, Peter Ehlers <ehlers at ucalgary.ca>
> wrote:
>> >>> On 2012-03-16 10:48, Ashish Agarwal wrote:
>> >>>>
>> >>>> Line 10 has City and State that too separated by comma. For line 10
>> >>>> how can I read differently as compared to the other lines?
>> >>>
>> >>>
>> >>> Edit the file and put quotes around the city-state combination:
>> >>>  "Raleigh, North Carol"
>> >>>
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>> >
>> > --
>> > Jim Holtman
>> > Data Munger Guru
>> >
>> > What is the problem that you are trying to solve?
>> > Tell me what you want to do, not how you want to do it.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list