[R] Function read.table(…) reads in only 40% of a my table's lines

Asis Hallab asis.hallab at gmail.com
Mon Aug 5 14:45:10 CEST 2013


Dear Jim,


2013/8/5 jim holtman <jholtman at gmail.com>:
> Couple of things to try.  May have an extra quote, so put:
>
> quote = ''

thank you very much. That did the trick.

Much obliged!

>
> as one of the parameters.  Also, might have comments, so try:
>
> comment.char = ""
>
> Take alook at your file and determine what line was the last complete one
> and see if there might be a problem in that line, or preceeding ones.
>
> On Mon, Aug 5, 2013 at 7:11 AM, Asis Hallab <asis.hallab at gmail.com> wrote:
>>
>> Dear R experts,
>>
>> I have a large table saved in a file called "plant_genome.gff". The
>> file has 481848 lines in nine columns, which are TAB delimited, and is
>> 53 MegaBytes large.
>> For anyone who might know the GFF3 format: The table holds a plant
>> genome's annotation.
>>
>> If I read in the table with
>> read.table( "plant_genome.gff" )
>> I get the following error
>> "line 2 did not have 12 elements".
>>
>> If I read in the table with
>> read.table( "plant_genome.gff", sep="\t" )
>> no error or warning is given, but my resulting table has only 193547
>> instead of the expected 481848 rows! 60% of the lines are omitted.
>>
>> Also passing in the arguments
>> as.is = TRUE
>> or setting the columns' classes with
>> colClasses = c( "character", …, "integer", "integer", "numeric",
>> "character", … )
>>    # columns 4, and 5 are integers, column 6 is numeric, all others
>> are characters
>> does not resolve the problem.
>>
>> If I read in the file with readLines and then manually split them using
>> strplit(…)
>> and combine them into a data.frame with
>> as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…)
>> I get the expected and correct data.frame, representing my GFF3 data.
>>
>> My questions are:
>> 1) Am I using read.table wrong, or did I miss something in the
>> documentation?
>> 2) Or is this is known problem with large TAB delimited tables, whose
>> columns contain white-spaces and are not surrounded by quotes?
>>
>> Unfortunately due to the unpublished nature of the plant genome I am
>> not allowed to give access to the GFF table that causes this problem.
>>
>> Any ideas, hints, help - or comments on my stupidity having missed
>> something important - will be much appreciated!
>>
>> Cheers!
>>



More information about the R-help mailing list