[R] help with read.csv() for files with different number of columns

Tue Aug 29 23:59:35 CEST 2017

Hi Ace,
You can just read the file first to find out:

max_fields<-function(file,sep=" ") {
 rlines<-readLines(file)
 return(max(unlist(lapply(sapply(rlines,strsplit,sep),length))))
}
nmax<-max_fields(test.txt,"\t")

Jim

On Wed, Aug 30, 2017 at 2:22 AM, Fix Ace <acefix at rocketmail.com> wrote:
> Thank you very much! Looks like I have to know the length of each record
> ahead of time.
>
> Ace
>
>
> On Monday, August 28, 2017 12:56 AM, Jim Lemon <drjimlemon at gmail.com> wrote:
>
>
> Hi Ace,
> With tabs as separators:
>
> testdf<-read.table("test.txt",header=FALSE,fill=TRUE,sep="\t",
> col.names=paste("V",1:19,sep=""),stringsAsFactors=FALSE)
>
> Also note that I got the number of columns wrong the first time.
>
> Jim
>
>
> On Mon, Aug 28, 2017 at 12:56 PM, Fix Ace <acefix at rocketmail.com> wrote:
>> Hi, Jim,
>>
>> Thank you very much for pointing out the format issue. Here is the
>> original
>> text:
>>
>> ===
>> I have a text file (test.txt) with different number of columns:
>>
>> 0610007P14Rik%%% Tcf19 Gtf2i
>> 0610010O12Rik%%% Ivns1abp Etv6
>> 1100001G20Rik%%% Nmi
>> 1500015O10Rik%%% Foxi1 Ascl3 Sirt3
>> 1700003E16Rik%%% Ascl2 Ifnar2
>> 1700028J19Rik%%% Musk Nfe2l3
>> 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 Sox10 Smarca2
>> 1810019D21Rik%%% Asb8
>> 1810037I17Rik%%% Zfp612
>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i
>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6
>>
>> I wold like to read it into R using
>>
>>> test=read.csv("test.txt",sep="\t",header=FALSE)
>>
>> However, when I check the r object "test", I found that all the rows have
>> 5
>> columns:
>>
>>> test
>>                  V1            V2      V3    V4      V5
>> 1  0610007P14Rik%%%        Tcf19  Gtf2i
>> 2  0610010O12Rik%%%      Ivns1abp    Etv6
>> 3  1100001G20Rik%%%          Nmi
>> 4  1500015O10Rik%%%        Foxi1  Ascl3  Sirt3
>> 5  1700003E16Rik%%%        Ascl2  Ifnar2
>> 6  1700028J19Rik%%%          Musk  Nfe2l3
>> 7  1810011O10Rik%%%      Ppp1r13b  Bpnt1 Cdkn2c  Foxc1
>> 8            Sox10      Smarca2
>> 9  1810019D21Rik%%%          Asb8
>> 10 1810037I17Rik%%%        Zfp612
>> 11 1810055G02Rik%%%        Nkx2-3  Maged1  Runx1    Ugp2
>> 12            Elk4        Spdef  Tcf19  Isl2  Gtf2i
>> 13          Ctnnbl1        Tcea3    Ank2 Zfp612 Creb3l1
>> 14            Nupr1 3632451O06Rik Creb3l4  Lass6
>>
>> Basically it breaks some rows into more than one rows. For example, row 7
>> in
>> the original record becomes two rows. Looks like the "test" always has 5
>> columns.
>>
>> How does this happen? How should I fix it to make one record into one two
>> in
>> R object?
>>
>> ==
>>
>> Please let me know if it is readable now. Thank you very much for your
>> time!
>>
>> Kind regards,
>>
>> Ace
>>
>>
>> On Sunday, August 27, 2017 7:25 PM, Jim Lemon <drjimlemon at gmail.com>
>> wrote:
>>
>>
>> Hi Ace,
>> As your example seems to have spaces as separators,
>>
>> testdf<-read.table("test.txt",header=FALSE,fill=TRUE,
>> col.names=paste("V",1:14,sep=""),stringsAsFactors=FALSE)
>>
>> By specifying the number of columns with "col.names" and using
>> "fill=TRUE" you can get a data frame with zero length strings where
>> values are missing in the input file.
>>
>> Jim
>>
>> On Mon, Aug 28, 2017 at 6:25 AM, Fix Ace via R-help
>> <r-help at r-project.org> wrote:
>>> Dear R community,
>>> I have a text file (test.txt) with different number of columns:
>>> 0610007P14Rik%%% Tcf19 Gtf2i 0610010O12Rik%%% Ivns1abp Etv6
>>> 1100001G20Rik%%% Nmi 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 1700003E16Rik%%%
>>> Ascl2 Ifnar2 1700028J19Rik%%% Musk Nfe2l3 1810011O10Rik%%% Ppp1r13b Bpnt1
>>> Cdkn2c Foxc1 Sox10 Smarca2 1810019D21Rik%%% Asb8 1810037I17Rik%%% Zfp612
>>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i
>>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6
>>> I wold like to read it into R using
>>>  > test=read.csv("test.txt",sep="\t",header=FALSE)
>>> However, when I check the r object "test", I found that all the rows have
>>> 5 columns:
>>>> test                V1            V2      V3    V4      V51
>>>> 0610007P14Rik%%%        Tcf19  Gtf2i              2  0610010O12Rik%%%
>>>> Ivns1abp    Etv6              3  1100001G20Rik%%%          Nmi
>>>> 4  1500015O10Rik%%%        Foxi1  Ascl3  Sirt3        5
>>>> 1700003E16Rik%%%
>>>> Ascl2  Ifnar2              6  1700028J19Rik%%%          Musk  Nfe2l3
>>>> 7  1810011O10Rik%%%      Ppp1r13b  Bpnt1 Cdkn2c  Foxc18            Sox10
>>>> Smarca2                      9  1810019D21Rik%%%          Asb8
>>>> 10 1810037I17Rik%%%        Zfp612                      11
>>>> 1810055G02Rik%%%
>>>> Nkx2-3  Maged1  Runx1    Ugp212            Elk4        Spdef  Tcf19
>>>> Isl2
>>>> Gtf2i13          Ctnnbl1        Tcea3    Ank2 Zfp612 Creb3l114
>>>> Nupr1 3632451O06Rik Creb3l4  Lass6
>>> Basically it breaks some rows into more than one rows. For example, row 7
>>> in the original record becomes two rows. Looks like the "test" always has
>>> 5
>>> columns.
>>> How does this happen? How should I fix it to make one record into one two
>>> in R object?
>>> Thank you very much!
>>> Ace
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>