[BioC] Problem using the %in% command

Oleg Sklyar osklyar at ebi.ac.uk
Wed Feb 20 19:08:22 CET 2008


It might be reasonable to split on space (" "), then paste/collapse 
together with "" and then split on ",". This will ensure that all spaces 
(before or after comma) are removed at once. Oleg

Martin Morgan wrote:
> Hi Paul -- I saw this on the R mailing list, too.  Such
> 'cross-posting' is discouraged (though in this case you get answers
> that you wouldn't have got if you'd restricted yourself to just one
> list!)
> 
> I wonder if your problem was splitting the 'genes' character string
> with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if
> you have a data frame (from jim holtman's reply on the R list)
> 
>> func_gen
>    Function                             x
> 1 Function1 gene5, gene19, gene22, gene23
> 2 Function2          gene1, gene7, gene19
> 3 Function3   gene2, gene3, gene7, gene23
> 
> I would have created a named list associating function and gene name:
> 
>> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*")
>> names(fids) <- func_gen[["Function"]]
> 
> and converted this to an incidence matrix:
> 
>> uids <- unique(unlist(fids))
>> incidence <- sapply(fids, "%in%", x=uids)
>> rownames(incidence) <- uids
> 
> Since these seem like gene sets, and your work flow might continue
> along these lines, it might be convenient to represent your data as a
> gene set collection
> 
>> library(GSEABase)
>> gs <- mapply(GeneSet, fids, setName=names(fids))
>> gsc <- GeneSetCollection(gs)
> 
> and then let the package do the clever operation
> 
>> incidence(gsc)
>           gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3
> Function1     1      1      1      1     0     0     0     0
> Function2     0      1      0      0     1     1     0     0
> Function3     0      0      0      1     0     1     1     1
> 
> Martin
> 
> Paul Christoph Schröder <pschrode at alumni.unav.es> writes:
> 
>> Hello all!
>>
>> I have the following problem with the %in% command:
>>  
>> 1) I have a data frame that consists of functions (rows) and genes 
>> (columns). The whole has been loaded with the "read.delim" command 
>> because of gene-duplications between the different rows.
>> 2) Now, there is another data frame that contains all the genes (only 
>> the genes and without duplicates) from all the functions of the above 
>> data frame.
>>
>> What I want to do now is to use the "% in %" command to obtain a 
>> TRUE-FALSE data frame. This should be a data frame, where for every 
>> function some genes are TRUE and some are FALSE depending if they were 
>> or not in the specific function when matched against the "all genes" 
>> data frame.
>>
>> The main problem I have is the way how the genes are in the first data 
>> frame. I used the "unlist" command to separate them through commas ",". 
>> But every time I do the match between the first and second data frame it 
>> returns out FALSE for every gene in every function.
>>
>> Can anyone please give me a hind how to handle the problem?
>> Thank you very much in advance!
>>
>> Paul
>>
>>
>>
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Dr Oleg Sklyar * EBI-EMBL, Cambridge CB10 1SD, UK * +44-1223-494466



More information about the Bioconductor mailing list