[R] Form groups of lines and select specific values

arun smartpink111 at yahoo.com
Wed Feb 12 06:37:27 CET 2014


Hi,

Not sure I understand it correctly.
May be this helps:

dat <- read.table(text="type1      chrx         startx          endx          chry           starty      endy         type2
gain_765   chr15     9681969       9685418      chr15        9660912    9712719    loss_1136
gain_766   chr15     9706682       9852347      chr15        9660912    9712719    loss_1136
gain_766   chr15     9706682       9852347      chr15        9765125    9863990    loss_765
gain_780   chr20     9706682       9852347      ch20         9765125    9863990    loss_769
gain_760   chr15     9706682       9852347      chr15        9660912    9712719    loss_1137
gain_760   chr15     9706682       9852347      chr15        9765125    9863990    loss_763",sep="",header=TRUE,stringsAsFactors=FALSE) 
 indx <- rle(dat$chry)$lengths
 indx1 <- cumsum(indx)
 indx2 <- indx1-(indx-1)
chr <- dat$chrx[indx1]
start <- do.call(pmin,data.frame(startx=dat$startx[indx2],starty=dat$starty[indx2]))
end <- do.call(pmax,data.frame(endx=dat$endx[indx1],endy=dat$endy[indx1]))
dat2 <- data.frame(chr,start,end,stringsAsFactors=FALSE)

dat2
#    chr   start     end
#1 chr15 9660912 9863990
#2 chr20 9706682 9863990
#3 chr15 9660912 9863990
A.K.



I would like to form group of lines based in interconection (two ways) 
between "type1" collumn and "type2" collumn. The logic is: if a string 
in "type1" are in the same line of "type2" collumn they are in the same 
group. However if "type2" are more than one line all those are in the 
same group. 

Please take a look in the first 3 lines: "gain_765" and 
"loss_1136" are related. However, "loss_1136" are related with 
"gain_766" and subsenquently "gain_766" are relate with "loss_765". Then
 these is my group: 1- "gain_765", 2- "loss_1136", 3-"gain_766", 
4-"loss_765". 

Inside this group I wanna to make a new line with string in 
"chrx" on first line of the group; lowest value in "startx" and 
"starty"; larger value in "endx" and "endy". Follow a example of my 
data: 

 type1      chrx         startx          endx          chry           starty      endy         type2 
gain_765   chr15     9681969       9685418      chr15        9660912    9712719    loss_1136 
gain_766   chr15     9706682       9852347      chr15        9660912    9712719    loss_1136 
gain_766   chr15     9706682       9852347      chr15        9765125    9863990    loss_765 
gain_780   chr20     9706682       9852347      ch20         9765125    9863990    loss_769 
gain_760   chr15     9706682       9852347      chr15        9660912    9712719    loss_1137 
gain_760   chr15     9706682       9852347      chr15        9765125    9863990    loss_763 

To first group (line 1 to 3) this is the expected output: 

     chr       start        end 
   chr15    9660912   9863990 

Now, please take a look in line 4: "gain_780" is related just 
with "loss_769". Is this group (just line 4) the output expected 
follows: 

     chr         start        end 
    chr20     9706682   9863990 

Now, lines 5 and 6 the group is formed by "gain_760"; "loss_1137" and "loss_763". In this last case the expected output is: 

     chr         start         end 
    chr15     9660912   9863990 

But, I have many of this cases in thousands of lines. Therefore, I need all results in a unique output, like that: 

      chr       start         end 
     chr15    9660912   9863990 
     chr20    9706682   9863990 
     chr15    9660912   9863990 

Cheers.



More information about the R-help mailing list