[BioC] Problem using the %in% command
mtmorgan at fhcrc.org
Wed Feb 20 18:29:19 CET 2008
Hi Paul -- I saw this on the R mailing list, too. Such
'cross-posting' is discouraged (though in this case you get answers
that you wouldn't have got if you'd restricted yourself to just one
I wonder if your problem was splitting the 'genes' character string
with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if
you have a data frame (from jim holtman's reply on the R list)
1 Function1 gene5, gene19, gene22, gene23
2 Function2 gene1, gene7, gene19
3 Function3 gene2, gene3, gene7, gene23
I would have created a named list associating function and gene name:
> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*")
> names(fids) <- func_gen[["Function"]]
and converted this to an incidence matrix:
> uids <- unique(unlist(fids))
> incidence <- sapply(fids, "%in%", x=uids)
> rownames(incidence) <- uids
Since these seem like gene sets, and your work flow might continue
along these lines, it might be convenient to represent your data as a
gene set collection
> gs <- mapply(GeneSet, fids, setName=names(fids))
> gsc <- GeneSetCollection(gs)
and then let the package do the clever operation
gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3
Function1 1 1 1 1 0 0 0 0
Function2 0 1 0 0 1 1 0 0
Function3 0 0 0 1 0 1 1 1
Paul Christoph Schröder <pschrode at alumni.unav.es> writes:
> Hello all!
> I have the following problem with the %in% command:
> 1) I have a data frame that consists of functions (rows) and genes
> (columns). The whole has been loaded with the "read.delim" command
> because of gene-duplications between the different rows.
> 2) Now, there is another data frame that contains all the genes (only
> the genes and without duplicates) from all the functions of the above
> data frame.
> What I want to do now is to use the "% in %" command to obtain a
> TRUE-FALSE data frame. This should be a data frame, where for every
> function some genes are TRUE and some are FALSE depending if they were
> or not in the specific function when matched against the "all genes"
> data frame.
> The main problem I have is the way how the genes are in the first data
> frame. I used the "unlist" command to separate them through commas ",".
> But every time I do the match between the first and second data frame it
> returns out FALSE for every gene in every function.
> Can anyone please give me a hind how to handle the problem?
> Thank you very much in advance!
> [[alternative HTML version deleted]]
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor