[BioC] Determining an overlapping annotation data subset (overlap/overlaps)

Herve Pages hpages at fhcrc.org
Tue Aug 7 03:01:34 CEST 2007


Herve Pages wrote:
> Hi Stephen,
> 
>> A <- data.frame(start=(1:5)*10L, end=(4:8)*10L)
>> A
>   start end
> 1    10  40
> 2    20  50
> 3    30  60
> 4    40  70
> 5    50  80
> 
>> B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L))
>> B
>   start end
> 1    31  60
> 2    39  40
> 3    80  84
> 
> You can create a logical vector of the length the number of rows in A: for each
> A-row it says if there is any B-row inside:
> 
>   contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend),
>                             A$start, A$end)

This will be TRUE for A-rows that have at least 1 B-row within their limits.
For selecting the A-rows that are _overlapping_ with at least 1 B-rows, use:

  contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$end & B$start <= Aend),
                            A$start, A$end)

H.


> 
> Then use this logical vector to subset A:
> 
>   A[contains_a_Brow, ]
> 
> Cheers,
> H.
> 
> Stephen Montgomery wrote:
>> Hello Bioconductor -
>>
>> Apologies as this a fairly rookie bioinformatics based R question, but I
>> am trying to determine if there is a R one-liner to extract a subset of
>> a data frame which possesses annotation contained within it that has
>> been stored in another data frame?  (For example extracting genomic
>> intervals which contain certain features/annotation)
>>
>> Such that:
>> If I have dataframe "A" possessing an "id", "start", and "end"; And
>> dataframe "B" also possessing an "id", "start", and "end"; The output is
>> all the rows of A which contain an entry of B (B$start, B$end) within
>> A$start and A$end.
>>
>> I have tried my own fairly uninformed variants like this to no-avail
>> A[length(B[B$start <= A$end & B$end >= A$start]) > 0,]
>> I fear the solution will be trivial but as yet it has eluded me. :/
>>
>> Thanks for any help!  (Theoretically, I can also see doing this in its
>> own function by creating a vector of counts for each member of "A" and
>> then reporting those that are non-zero but I was wondering if there was
>> a more succinct and likely efficient way)
>>
>> Thanks again,
>> Stephen
>>
>>
>>
>> Stephen Montgomery, B.A.Sc., Ph.D.
>> Postdoctoral Researcher, Team 16
>> Wellcome Trust Sanger Institute
>> Hinxton, Cambridge CB10 1SA
>> Phone: 44-1223-834244 (ext 7297)
>> Skype: stephen.b.montgomery
>>  
>>
>>
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list