[R] create list of names where two df contain == values

Rob Griffin robgriffin247 at hotmail.com
Wed Nov 16 16:35:29 CET 2011


Ok, thanks for looking in to this so far, I seem to have confused you all a 
little though so I think I need to make this a bit clearer:

in the real situation:
df.1 is 271*13891, and contains (amongst others) columns with Flybase.CG, 
rMF, and Affyid values.
df.2  is 14*12572 and is made from subset of df.1 which removed rows with 
duplicated Flybase.CG values, and df.2 also includes the rMF column
because df.2 is made from the non-duplicated values it is shorter.

I now need to put the Affyid column from df.1 in to df.2 -

My idea is:
to match a value on each row that is unique to that row (within column) but 
shared on both datasets - rMF contains such numbers
then get R to copy the corresponding Affyid value (an alphanumeric id) from 
df.1 and place it in df.2$Affy (or at least in to a list which I could then 
put in to a column) with all "shared" rMF values and ignore all others

for example df.1 and df.2 both contain the rMF value 0.3393211 which 
corresponds to the same data point which in df.1 has this Affyid: 1638273_at

if you imagine the two rMF columns lined up next to each other they start 
the same and run in the same order, but df.2's has had "random" points 
removed as was the aim of making df.2, so as soon as you get to that point 
the rest of the list doesn't line up.
What R needs to do is go down the df.2 rMF list one by one, and for each 
df.2 rMF check the entire df.1 rMF list for a match, then take the 
corresponding Affyid.

for example df.1 and df.2 both contain the rMF value      0.3393211 
which corresponds to the same sample point which in df.1 has this 
Affyid: 1638273_at     but they occur on different rows in the data frame.

is that a bit clearer? I know this is pretty complex.

David, your idea with ifelse worked for the first few lines then as soon as 
it got to a point where one of the Flybase.CG values had been removed during 
the process of making df.2 it got out of line between the data frames and 
just gave NA after there.


Rob





-----Original Message----- 
From: Dennis Murphy
Sent: Wednesday, November 16, 2011 4:03 PM
To: Rob Griffin
Cc: r-help at r-project.org
Subject: Re: [R] create list of names where two df contain == values

Hi:

I think you're overthinking this problem. As is usually the case in R,
a vectorized solution is clearer and provides more easily understood
code.

It's not obvious to me exactly what you want, so we'll try a couple of
variations on the same idea. Equality of floating point numbers is a
difficult computational problem (see R FAQ 7.31), but if it makes
sense to define a threshold difference between floating numbers that
practically equates to zero, then you're in business. In your example,
the difference in numb1 for letter h in the two data frames is far
from zero, so define 'equal' to be a difference < 10 ^{-6}. Then:

# Return the entire matching data frame
df.1[abs(df.1$numb1 - df.2$numb1) < 0.000001, ]
   Letters     numb1 extra.col    id
1        a 0.3735462         1 CG234
2        b 1.1836433         2 CG232
3        c 0.1643714         3 CG441
4        d 2.5952808         4 CG128
5        e 1.3295078         5 CG125
6        f 0.1795316         6 CG182
7        g 1.4874291         7 CG982
9        i 1.5757814         9 CG282
10       j 0.6946116        10 CG154

# Return the matching letters only as a vector:
df.1[abs(df.1$numb1 - df.2$numb1) < 0.000001, 'Letters' ]

If you want the latter object to remain a data frame, use drop = FALSE
as an extra argument after 'Letters'. If you want to create a list
object such that each letter comprises a different list component,
then the following will do - the as.character() part coerces the
factor Letters into a character object:

as.list(as.character(df.1[abs(df.1$numb1 - df.2$numb1) < 0.000001,
             'Letters' ]))

HTH,
Dennis


On Wed, Nov 16, 2011 at 5:03 AM, Rob Griffin <robgriffin247 at hotmail.com> 
wrote:
> Hello again... sorry to be posting yet again, but I hadn't anticipated 
> this
> problem.
>
> I am trying to now put the names found in one column in data frame 1 (lets
> call it df.1[,1]) in to a list from the rows where the values in df.1[,2]
> match values in a column of another dataframe (df.2[3])
> I tried to write this function so that it put the list of names (called
> Iffy) where the 2 criteria (df.1[141] and df.2[21]) matched but I think 
> its
> too complex for a beginner R-enthusiast
>
> ify<-function(x,y,a,b,c) if(x[[,a]]==y[[,b]]) {list(x[[,c]])} else {NULL}
> Iffy<-apply(  df.1,  1,  FUN=ify,  x=df.1,  y=df.2,  a=2,  b=3,  c=1  )
>
> But this didn't work... Error in FUN(newX[, i], ...) : unused argument(s)
> (newX[, i])
>
>
> Here is a dataset that replicates the problem, you'll notice the "h"
> criteria values are different between the two dataframes and therefore it
> would produce a list  of the 9 letters where the two criteria columns
> matched (a,b,c,d,e,f,g,i,j):
>
>
>
> df.1<-data.frame(rep(letters[1:10]))
> colnames(df.1)[1]<-("Letters")
> set.seed(1)
> df.1$numb1<-rnorm(10,1,1)
> df.1$extra.col<-c(1,2,3,4,5,6,7,8,9,10)
> df.1$id<-c("CG234","CG232","CG441","CG128","CG125","CG182","CG982","CG541","CG282","CG154")
> df.1
>
> df.2<-data.frame(rep(letters[1:10]))
> colnames(df.2)[1]<-("Letters")
> set.seed(1)
> df.2$extra.col<-c(1,2,3,4,5,6,7,8,9,10)
> df.2$numb1<-rnorm(10,1,1)
> df.2$id<-c("CG234","CG232","CG441","CG128","CG125","CG182","CG982","CG541","CG282","CG154")
> df.2[8,3]<-12
>
> df.1
> df.2
>
>
>
>
> Your patience is much appreciated,
> Rob
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list