[R] Efficient way to create new column based on comparison with another dataframe

Ulrik Stervbo ulrik.stervbo at gmail.com
Sat Jan 30 07:34:45 CET 2016


Hi Gaius,

Could you use data.table and loop over the small Chr.arms?

library(data.table)
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")

Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
  Arm <- Arm[ , Arm:=cur.row$Arm][]
  Arms <- rbind(Arms, Arm)
}

# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
  mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
  mapfile <- mapfile[ , Arm:=cur.row$Arm][]
  return(mapfile)
}, mapfile = mapfile)

I have just started to use the data.table and I have the feeling the code
above can be greatly improved - maybe the loop can be dropped entirely?

Hope this helps
Ulrik

On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:

> I have two dataframes. One has chromosome arm information, and the other
> has SNP position information. I am trying to assign each SNP an arm
> identity.  I'd like to create this new column based on comparing it to the
> reference file.
>
> *1) Mapfile (has millions of rows)*
>
> Name    Chr   Position
> S1      1      3000
> S2      1      6000
> S3      1      1000
>
> *2) Chr.Arms   file (has 39 rows)*
>
> Chr    Arm    Start   End
> 1      p      0       5000
> 1      q      5001    10000
>
>
> *R Script that works, but slow:*
> Arms  <- c()
> for (line in 1:nrow(Mapfile)){
>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
>  Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
> Chr.Arms$End]}
> }
> Mapfile$Arm <- Arms
>
>
> *Output Table:*
>
> Name   Chr   Position   Arm
> S1      1     3000      p
> S2      1     6000      q
> S3      1     1000      p
>
>
> In words: I want each line to look up the location ( 1) find the right Chr,
> 2) find the line where the START < POSITION < END), then get the ARM
> information and place it in a new column.
>
> This R script works, but surely there is a more time/processing efficient
> way to do it.
>
> Thanks in advance for any help,
> Gaius
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list