[R] Efficient way to create new column based on comparison with another dataframe

Gaius Augustus gaiusjaugustus at gmail.com
Sun Jan 31 20:17:07 CET 2016


Thanks Denes,
I should have thought of foverlaps as an option.  I wonder how fast it is
compared to my solution!

My particular solution does not need data.table in order to work.  It just
loops through the ChrArms (Chromosome Arms, which always has 39 rows) and
assigns the proper arm to all rows within mapfile that lie within Start and
End on a particular Chr.  This is opposed to my first solution, where I was
trying to loop through mapfile (which could be millions of rows) and assign
each row one at a time.  That's why I used data.frame.

For some reason, yesterday, data.table was acting funny on the computer I
remote to, so I need to figure out why that is once I can get on it.  Then
I want to time my solution and one with foverlaps to see if one is faster.

Thanks,
Gaius

On Sun, Jan 31, 2016 at 2:17 AM, Dénes Tóth <toth.denes at ttk.mta.hu> wrote:

> Hi,
>
> I have not followed this thread from the beginning, but have you tried the
> foverlaps() function from the data.table package?
>
> Something along the lines of:
>
> ---
> # create the tables (use as.data.table() or setDT() if you
> # start with a data.frame)
> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1,
>                       Position = c(3000, 6000, 1000))
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
>                        Start = c(0, 5001), End = c(5000, 10000))
>
> # add a dummy variable to be able to define Position as an interval
> mapfile[, Position2 := Position]
>
> # add keys
> setkey(mapfile, Chr, Position, Position2)
> setkey(Chr.Arms, Chr, Start, End)
>
> # use data.table::foverlaps (see ?foverlaps)
> mapfile <- foverlaps(mapfile, Chr.Arms, type = "within")
>
> # remove the dummy variable
> mapfile[, Position2 := NULL]
>
> # recreate original order
> setorder(mapfile, Chr, Name)
>
> ---
>
> BTW, there is a typo in your *SOLUTION*. I guess you wanted to write
> data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000,
> 1000), key = "Chr") instead of data.frame(Name = c("S1", "S2", "S3"), Chr =
> 1, Position = c(3000, 6000, 1000), key = "Chr").
>
> HTH,
>   Denes
>
>
>
> On 01/30/2016 07:48 PM, Gaius Augustus wrote:
>
>> I'll look into the Intervals idea.  The data.table code posted might not
>> work (because I don't believe it would put the rows in the correct order
>> if
>> the chromosomes are interspersed), however, it did make me think about
>> possibly assigning based on values...
>>
>> *SOLUTION*
>>
>> mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position =
>> c(3000, 6000, 1000), key = "Chr")
>> Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
>> = c(5000, 10000), key = "Chr")
>>
>> for(i in 1:nrow(Chr.Arms)){
>>    cur.row <- Chr.Arms[i, ]
>>    mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >=
>> cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
>> }
>>
>> This took out the need for the intermediate table/vector.  This worked for
>> me, and was VERY fast.  Took <5 minutes on a dataframe with 35 million
>> rows.
>>
>> Thanks for the help,
>> Gaius
>>
>> On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <
>> gaiusjaugustus at gmail.com>
>> wrote:
>>
>> I'll look into the Intervals idea.  The data.table code posted might not
>>> work (because I don't believe it would put the rows in the correct order
>>> if
>>> the chromosomes are interspersed), however, it did make me think about
>>> possibly assigning based on values...
>>>
>>> Something like:
>>> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
>>> c(3000, 6000, 1000), key = "Chr")
>>> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
>>> End
>>> = c(5000, 10000), key = "Chr")
>>>
>>> for(i in 1:nrow(Chr.Arms)){
>>>    cur.row <- Chr.Arms[i, ]
>>>    mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
>>> cur.row$End] <- Chr.Arms$Arm
>>> }
>>>
>>> This might take out the need for the intermediate table/vector.  Not sure
>>> yet if it'll work, but we'll see.  I'm interested to know if anyone else
>>> has any ideas, too.
>>>
>>> Thanks,
>>> Gaius
>>>
>>> On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com
>>> >
>>> wrote:
>>>
>>> Hi Gaius,
>>>>
>>>> Could you use data.table and loop over the small Chr.arms?
>>>>
>>>> library(data.table)
>>>> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
>>>> c(3000, 6000, 1000), key = "Chr")
>>>> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001),
>>>> End = c(5000, 10000), key = "Chr")
>>>>
>>>> Arms <- data.table()
>>>> for(i in 1:nrow(Chr.Arms)){
>>>>    cur.row <- Chr.Arms[i, ]
>>>>    Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
>>>>    Arm <- Arm[ , Arm:=cur.row$Arm][]
>>>>    Arms <- rbind(Arms, Arm)
>>>> }
>>>>
>>>> # Or use plyr to loop over each possible arm
>>>> library(plyr)
>>>> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
>>>>    mapfile <- mapfile[ Position >= cur.row$Start & Position <=
>>>> cur.row$End]
>>>>    mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>>>>    return(mapfile)
>>>> }, mapfile = mapfile)
>>>>
>>>> I have just started to use the data.table and I have the feeling the
>>>> code
>>>> above can be greatly improved - maybe the loop can be dropped entirely?
>>>>
>>>> Hope this helps
>>>> Ulrik
>>>>
>>>> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
>>>> wrote:
>>>>
>>>> I have two dataframes. One has chromosome arm information, and the other
>>>>> has SNP position information. I am trying to assign each SNP an arm
>>>>> identity.  I'd like to create this new column based on comparing it to
>>>>> the
>>>>> reference file.
>>>>>
>>>>> *1) Mapfile (has millions of rows)*
>>>>>
>>>>> Name    Chr   Position
>>>>> S1      1      3000
>>>>> S2      1      6000
>>>>> S3      1      1000
>>>>>
>>>>> *2) Chr.Arms   file (has 39 rows)*
>>>>>
>>>>> Chr    Arm    Start   End
>>>>> 1      p      0       5000
>>>>> 1      q      5001    10000
>>>>>
>>>>>
>>>>> *R Script that works, but slow:*
>>>>> Arms  <- c()
>>>>> for (line in 1:nrow(Mapfile)){
>>>>>        Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
>>>>>   Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
>>>>> Chr.Arms$End]}
>>>>> }
>>>>> Mapfile$Arm <- Arms
>>>>>
>>>>>
>>>>> *Output Table:*
>>>>>
>>>>> Name   Chr   Position   Arm
>>>>> S1      1     3000      p
>>>>> S2      1     6000      q
>>>>> S3      1     1000      p
>>>>>
>>>>>
>>>>> In words: I want each line to look up the location ( 1) find the right
>>>>> Chr,
>>>>> 2) find the line where the START < POSITION < END), then get the ARM
>>>>> information and place it in a new column.
>>>>>
>>>>> This R script works, but surely there is a more time/processing
>>>>> efficient
>>>>> way to do it.
>>>>>
>>>>> Thanks in advance for any help,
>>>>> Gaius
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

	[[alternative HTML version deleted]]



More information about the R-help mailing list