[BioC] Problem using the %in% command

Wed Feb 20 18:29:19 CET 2008

Hi Paul -- I saw this on the R mailing list, too.  Such
'cross-posting' is discouraged (though in this case you get answers
that you wouldn't have got if you'd restricted yourself to just one
list!)

I wonder if your problem was splitting the 'genes' character string
with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if
you have a data frame (from jim holtman's reply on the R list)

> func_gen
   Function                             x
1 Function1 gene5, gene19, gene22, gene23
2 Function2          gene1, gene7, gene19
3 Function3   gene2, gene3, gene7, gene23

I would have created a named list associating function and gene name:

> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*")
> names(fids) <- func_gen[["Function"]]

and converted this to an incidence matrix:

> uids <- unique(unlist(fids))
> incidence <- sapply(fids, "%in%", x=uids)
> rownames(incidence) <- uids

Since these seem like gene sets, and your work flow might continue
along these lines, it might be convenient to represent your data as a
gene set collection

> library(GSEABase)
> gs <- mapply(GeneSet, fids, setName=names(fids))
> gsc <- GeneSetCollection(gs)

and then let the package do the clever operation

> incidence(gsc)
          gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3
Function1     1      1      1      1     0     0     0     0
Function2     0      1      0      0     1     1     0     0
Function3     0      0      0      1     0     1     1     1

Martin

Paul Christoph Schröder <pschrode at alumni.unav.es> writes:

> Hello all!
>
> I have the following problem with the %in% command:
>  
> 1) I have a data frame that consists of functions (rows) and genes 
> (columns). The whole has been loaded with the "read.delim" command 
> because of gene-duplications between the different rows.
> 2) Now, there is another data frame that contains all the genes (only 
> the genes and without duplicates) from all the functions of the above 
> data frame.
>
> What I want to do now is to use the "% in %" command to obtain a 
> TRUE-FALSE data frame. This should be a data frame, where for every 
> function some genes are TRUE and some are FALSE depending if they were 
> or not in the specific function when matched against the "all genes" 
> data frame.
>
> The main problem I have is the way how the genes are in the first data 
> frame. I used the "unlist" command to separate them through commas ",". 
> But every time I do the match between the first and second data frame it 
> returns out FALSE for every gene in every function.
>
> Can anyone please give me a hind how to handle the problem?
> Thank you very much in advance!
>
> Paul
>
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793