[R] negative vector length when merging data frames

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Thu Oct 24 21:17:16 CEST 2019


Hello,

Sometimes sqldf::sqldf tends to save memory. Maybe if you try

library(sqldf)

sqldf('select l4.*, asign.gene, asign.chr_pos, asign.`p.val.Retina`
       from l4
       inner join asign
       on X1 = asign.chr and X2 = asign.pos')

Or you can filter the rows that match first, then merge the results.
Something along the lines of

# read in only the columns needed with fread, it's fast
l4join <- data.table::fread(l4_file, select = c("X1", "X2"))
ajoin <- data.table::fread(asign_file, select = c("chr", "pos"))

# create indices with the matches on both sides
i1 <- (l4join$X1 %in% ajoin$chr) & (l4join$X2 %in% ajoin$pos)
i2 <- (ajoin$chr %in% l4join$X1) & (ajoin$pos %in% l4join$X2)

rm(l4join, ajoin)   # don't need this any more, remove them

# now the real fread's
l4 <- data.table::fread(l4_file)
asign <- data.table::fread(asign_file)

# extract the relevant rows and merge
res <- l4[i1, ]
res2 <- asign[i2, setdiff(names(asign), names(l4))]
merge(res, res2, by.x = c("X1", "X2"), by.y = c("chr", "pos"))


Hope this helps,

Rui Barradas






Às 00:08 de 24/10/19, Ana Marija escreveu:
> Hi Jim,
> 
> I think one of the issue is that data frames are so big,
>> dim(l4)
> [1] 166941635         8
>> dim(asign)
> [1] 107371528         5
> 
> so my example would not reproduce the error
> 
> On Wed, Oct 23, 2019 at 6:05 PM Jim Lemon <drjimlemon using gmail.com> wrote:
>>
>> Hi Ana,
>> When I run this example taken from your email:
>>
>> l4<-read.table(text="X1 X2 X3 X4  X5 variant_id pval_nominal gene_id.LCL
>> chr1 13550  G  A b38 1:13550:G:A     0.375614 ENSG00000227232
>> chr1 14671  G  C b38 1:14671:G:C     0.474708 ENSG00000227232
>> chr1 14677  G  A b38 1:14677:G:A     0.699887 ENSG00000227232
>> chr1 16841  G  T b38 1:16841:G:T     0.127895 ENSG00000227232
>> chr1 16856  A  G b38 1:16856:A:G     0.627822 ENSG00000227232
>> chr1 17005  A  G b38 1:17005:A:G     0.802803 ENSG00000227232",
>> header=TRUE,stringsAsFactors=FALSE)
>> asign<-read.table(text="gene  chr  chr_pos   pos p.val.Retina
>> ENSG00000227232 chr1           1:10177:A:AC 10177     0.381708
>> ENSG00000227232 chr1 rs145072688:10352:T:TA 10352     0.959523
>> ENSG00000227232 chr1            1:11008:C:G 11008     0.218132
>> ENSG00000227232 chr1            1:11012:C:G 11012     0.218132
>> ENSG00000227232 chr1            1:13110:G:A 13110     0.998262
>> ENSG00000227232 chr1  rs201725126:13116:T:G 13116     0.438572",
>> header=TRUE,stringsAsFactors=FALSE)
>> merge(l4, asign, by.x=c("X1", "X2"), by.y=c("chr", "pos"))
>>   [1] X1           X2           X3           X4           X5
>> [6] variant_id   pval_nominal gene_id.LCL  gene         chr_pos
>> [11] p.val.Retina
>> <0 rows> (or 0-length row.names)
>>
>> It works okay, but there are no matches in the join. So I can't even
>> guess what the problem is.
>>
>> Jim
>>
>> On Thu, Oct 24, 2019 at 9:33 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I have two data frames like this:
>>>
>>>> head(l4)
>>>      X1    X2 X3 X4  X5  variant_id pval_nominal     gene_id.LCL
>>> 1 chr1 13550  G  A b38 1:13550:G:A     0.375614 ENSG00000227232
>>> 2 chr1 14671  G  C b38 1:14671:G:C     0.474708 ENSG00000227232
>>> 3 chr1 14677  G  A b38 1:14677:G:A     0.699887 ENSG00000227232
>>> 4 chr1 16841  G  T b38 1:16841:G:T     0.127895 ENSG00000227232
>>> 5 chr1 16856  A  G b38 1:16856:A:G     0.627822 ENSG00000227232
>>> 6 chr1 17005  A  G b38 1:17005:A:G     0.802803 ENSG00000227232
>>>> head(asign)
>>>                gene  chr                chr_pos   pos p.val.Retina
>>> 1: ENSG00000227232 chr1           1:10177:A:AC 10177     0.381708
>>> 2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352     0.959523
>>> 3: ENSG00000227232 chr1            1:11008:C:G 11008     0.218132
>>> 4: ENSG00000227232 chr1            1:11012:C:G 11012     0.218132
>>> 5: ENSG00000227232 chr1            1:13110:G:A 13110     0.998262
>>> 6: ENSG00000227232 chr1  rs201725126:13116:T:G 13116     0.438572
>>>> m = merge(l4, asign, by.x=c("X1", "X2"), by.y=c("chr", "pos"))
>>> Error in merge.data.frame(l4, asign, by.x = c("X1", "X2"), by.y = c("chr",  :
>>>    negative length vectors are not allowed
>>>> sapply(l4,class)
>>>            X1           X2           X3           X4           X5   variant_id
>>>   "character"  "character"  "character"  "character"  "character"  "character"
>>> pval_nominal  gene_id.LCL
>>>     "numeric"  "character"
>>>> sapply(asign,class)
>>>          gene          chr      chr_pos          pos p.val.Retina
>>>   "character"  "character"  "character"  "character"  "character"
>>>
>>> Please advise as to why I am getting this error when merging?
>>>
>>> Thanks
>>> Ana
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list