[R] Problem reading mixed CSV file

jim holtman jholtman at gmail.com
Wed Mar 21 03:09:08 CET 2012


On Tue, Mar 20, 2012 at 3:17 PM, Ashish Agarwal
<ashish.agarwala at gmail.com> wrote:
> Given x<- count.fields(..) could you pls help in following:
> 1. how to create a data vector with data being line numbers of original file
> where x==6?

That is what the expression:

 writeLines(input[x == 6], file = '6fields.csv')

is doing.  'x == 6' is a logical vector with TRUE in the position of
the line that has 6 fields in it, so it is only extracting the lines
with 6 fields and writing them to the output file.  You probably need
to read the section on "indexing" in the "Intro to R" manual.


> 2. what is the way to read only the nth line (only) of an input file into a
> data vector with first three attributes to be read as string, 4th
> being categorical, 5th and 6th being numeric with width 10?

You might want to give an example of the the line looks like.  I would
use 'readLines' to read in the file and then I could index to the
'nth' line and parse it using 'strsplit' or 'regexpr' depending on its
complexity.  This would depend on the format of the line which has not
been provided.


>
>
> On Tue, Mar 20, 2012 at 9:37 PM, jim holtman <jholtman at gmail.com> wrote:
>> use 'count.fields' to determine which line have 6 and 7 fields in them.
>>
>> then use 'readLines' to read in the entire file and the use the data
>> from count.fields to write out to separate files"
>>
>> x <- count.fields(...)
>> input <- readLines(..)
>> writeLines(input[x == 6], file = '6fields.csv')
>> writeLines(input[x==7], file = '7fields.csv')
>>
>> On Tue, Mar 20, 2012 at 11:43 AM, Ashish Agarwal
>> <ashish.agarwala at gmail.com> wrote:
>>> The file is 20MB having 2 Million rows.
>>> I understand that I two different formats  - 6 columns and 7 columns.
>>> How do I read chunks to different files by using scan with modifying
>>> skip and nlines parameters?
>>>
>>> On Mon, Mar 19, 2012 at 3:59 PM, Petr PIKAL <petr.pikal at precheza.cz>
>>> wrote:
>>>>
>>>> I would follow Jims suggestion,
>>>> nFields <- count.fields(fileName, sep = ',')
>>>> count fields and read chunks to different files by using scan with
>>>> modifying skip and nlines parameters. However if there is only few lines
>>>> which differ it would be better to correct those few lines manually in
>>>> some suitable editor.
>>>>
>>>> Elaborating omnipotent function for reading any kind of
>>>> corrupted/nonstandard files seems to me suited only if you expect to
>>>> read
>>>> such files many times.
>>>>
>>>> Regards
>>>> Petr
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 17, 2012 at 4:54 AM, jim holtman <jholtman at gmail.com>
>>>>> wrote:
>>>>> > Here is a solution that looks for the line with 7 elements and
>>>>> > inserts
>>>>> > the quotes:
>>>>> >
>>>>> >
>>>>> >> fileName <- '/temp/text.txt'
>>>>> >> input <- readLines(fileName)
>>>>> >> # count the fields to find 7
>>>>> >> nFields <- count.fields(fileName, sep = ',')
>>>>> >> # now fix the data
>>>>> >> for (i in which(nFields == 7)){
>>>>> > +     # split on comma
>>>>> > +     z <- strsplit(input[i], ',')[[1]]
>>>>> > +     input[i] <- paste(z[1], z[2]
>>>>> > +         , paste('"', z[3], ',', z[4], '"', sep = '') # put on
>>>>> > quotes
>>>>> > +         , z[5], z[6], z[7], sep = ','
>>>>> > +         )
>>>>> > + }
>>>>> >>
>>>>> >> # now read in the data
>>>>> >> result <- read.table(textConnection(input), sep = ',')
>>>>> >>
>>>>> >>         result
>>>>> >                         V1       V2                   V3   V4 V5 V6
>>>>> > 1                                                         1968 21  0
>>>>> > 2                                                  Boston 1968 13  0
>>>>> > 3                                                  Boston 1968 18  0
>>>>> > 4                                                 Chicago 1967 44  0
>>>>> > 5                                              Providence 1968 17  0
>>>>> > 6                                              Providence 1969 48  0
>>>>> > 7                                                   Binky 1968 24  0
>>>>> > 8                                                 Chicago 1968 23  0
>>>>> > 9                                                   Dally 1968  7  0
>>>>> > 10                                   Raleigh, North Carol 1968 25  0
>>>>> > 11 Addy ABC-Dogs Stars-W8.1                    Providence 1968 38  0
>>>>> > 12              DEF_REQPRF/                     Dartmouth 1967 31  1
>>>>> > 13                       PL                               1967 38  1
>>>>> > 14                       XY PopatLal                      1967  5  1
>>>>> > 15                       XY PopatLal                      1967  6  8
>>>>> > 16                       XY PopatLal                      1967  7  7
>>>>> > 17                       XY PopatLal                      1967  9  1
>>>>> > 18                       XY PopatLal                      1967 10  1
>>>>> > 19                       XY PopatLal                      1967 13  1
>>>>> > 20                       XY PopatLal               Boston 1967  6  1
>>>>> > 21                       XY PopatLal               Boston 1967  7 11
>>>>> > 22                       XY PopatLal               Boston 1967  9  2
>>>>> > 23                       XY PopatLal               Boston 1967 10  3
>>>>> > 24                       XY PopatLal               Boston 1967  7  2
>>>>> >>
>>>>> >
>>>>> >
>>>>> > On Fri, Mar 16, 2012 at 2:17 PM, Ashish Agarwal
>>>>> > <ashish.agarwala at gmail.com> wrote:
>>>>> >> I have a file that is 5000 records and to edit that file is not
>>>>> >> easy.
>>>>> >> Is there any way to line 10 differently to account for changes in
>>>>> >> the
>>>>> >> third field?
>>>>> >>
>>>>> >> On Fri, Mar 16, 2012 at 11:35 PM, Peter Ehlers <ehlers at ucalgary.ca>
>>>> wrote:
>>>>> >>> On 2012-03-16 10:48, Ashish Agarwal wrote:
>>>>> >>>>
>>>>> >>>> Line 10 has City and State that too separated by comma. For line
>>>>> >>>> 10
>>>>> >>>> how can I read differently as compared to the other lines?
>>>>> >>>
>>>>> >>>
>>>>> >>> Edit the file and put quotes around the city-state combination:
>>>>> >>>  "Raleigh, North Carol"
>>>>> >>>
>>>>> >>
>>>>> >> ______________________________________________
>>>>> >> R-help at r-project.org mailing list
>>>>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> >> and provide commented, minimal, self-contained, reproducible code.
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Jim Holtman
>>>>> > Data Munger Guru
>>>>> >
>>>>> > What is the problem that you are trying to solve?
>>>>> > Tell me what you want to do, not how you want to do it.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list