[R] problem with merge

Don MacQueen macq at llnl.gov
Tue Mar 18 16:16:09 CET 2008


Do any of your Entrez.Gene values appear in more than one row in 
either of the dataframes?

For example:

>  dfa <- data.frame( a = c('A','A','B','C') , x= 1:4)
>  dfb <- data.frame( a = c('A','B','B','D'), y=1:4)
>
>  merge(dfa,dfb)
   a x y
1 A 1 1
2 A 2 1
3 B 3 2
4 B 3 3

>  length(intersect(dfa$a, dfb$a))
[1] 2


-Don

At 8:40 PM -0400 3/17/08, Mark W Kimpel wrote:
>I have used merge regularly and thought I understood how it worked, but
>I must not. I have two dataframes with identical colnames from two
>different experiments, TL01 and LC01. Each dataframe has a column named
>"Entrez.Gene", which I have converted to "as.character" just to make
>sure merge is not looking at factor levels. Because I have done some
>filtering, the Entrez.Gene values in each experiment overlap but are not
>identical. I want to produce a summary report with only those
>identifiers found in each experiment. I could do this with intersect and
>matching, but I thought merge could easily do this.
>
>Below is my code and sessionInfo. For some reason there are over twice
>as many rows as I would expect. I can't quite figure out which arguments
>I have screwed up. What am I missing? It has to be something simple, I'm
>just not seeing it.  Thanks, Mark
>
>  > TL01.LC01.data <- merge(TL01.data, LC01.data, by = "Entrez.Gene",
>all.x = FALSE, all.y = FALSE, suffixes = c(".TL01",".LC01"))
>  > length(intersect(TL01.data$Entrez.Gene, LC01.data$Entrez.Gene))
>[1] 13401
>  > dim(TL01.LC01.data)
>[1] 29471    57
>  > dim(TL01.data)
>[1] 16479    29
>  > dim(LC01.data)
>[1] 16479    29
>--
>  > sessionInfo()
>R version 2.7.0 Under development (unstable) (2008-03-05 r44683)
>x86_64-unknown-linux-gnu
>
>locale:
>LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
>attached base packages:
>[1] splines   tools     stats     graphics  grDevices datasets  utils
>[8] methods   base
>
>other attached packages:
>   [1] affycoretools_1.11.4 annaffy_1.11.5       KEGG.db_2.1.3
>   [4] gcrma_2.11.4         matchprobes_1.11.1   biomaRt_1.13.9
>   [7] RCurl_0.8-3          GOstats_2.5.2        Category_2.5.7
>[10] genefilter_1.17.12   survival_2.34        RBGL_1.15.7
>[13] annotate_1.17.11     xtable_1.5-2         GO.db_2.1.3
>[16] AnnotationDbi_1.1.26 RSQLite_0.6-8        DBI_0.2-4
>[19] graph_1.17.17        limma_2.13.6         affy_1.17.9
>[22] preprocessCore_1.1.5 affyio_1.7.15        Biobase_1.99.2
>
>loaded via a namespace (and not attached):
>[1] cluster_1.11.10 XML_1.93-2
>
>Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>Indiana University School of Medicine
>
>15032 Hunter Court, Westfield, IN  46074
>
>(317) 490-5129 Work, & Mobile & VoiceMail
>(317) 204-4202 Home (no voice mail please)
>
>mwkimpel<at>gmail<dot>com
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.


-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062



More information about the R-help mailing list