[BioC] Problems with iteration (sappily) over RNAStringSet

Thu Jul 5 23:15:54 CEST 2012

Hi,

I want to iterate over an RNAStringSet (rs) to do a calculation for each
of the sequences in the form of:

    1) get the sequence
    2) do the calculations
    3) plot the results and
    4) use the sequence name (names(rs) in plot legends and titles,
    e.g.
    plot(x, main = paste(sequence_name, 'in condition X'), sep = '  ').

The name I want to use is the first field from the FASTA description, 
and I don't want to use the other information. However,
the extraction of the name does not work as assumed.

The input FASTA file looks like this:

> Gene1 Description
UUUUUUUUUUUUUUUUUUUUUUU
> Gene2 Description
AAAAAAAAAAAAAAAAAAAAAAA
> Gene3 Description
GGGGGGGGGGGGGGGGGGGGGGG
> Gene4 Description
CCCCCCCCCCCCCCCCCCCCCCC

library("Biostrings")
rs = read.RNAStringSet('test.fa')

R> rs
  A RNAStringSet instance of length 4
    width seq                                               names               
[1]    23 UUUUUUUUUUUUUUUUUUUUUUU                           Gene1 Description
[2]    23 AAAAAAAAAAAAAAAAAAAAAAA                           	      Gene2 Description
[3]    23 GGGGGGGGGGGGGGGGGGGGGGG                           Gene3 Description
[4]    23 CCCCCCCCCCCCCCCCCCCCCCC                           Gene4 Description

The following commands return what I was expecting:

R> strsplit(names(rs), split = ' ')[[1]][1]
[1] "Gene1"
R> strsplit(toString(rs), split = ',')[[1]][1]
[1] "UUUUUUUUUUUUUUUUUUUUUUU"

To iterate I wrote this function:

myFun = function(x){
  name = strsplit(names(x), split = ' ')[[1]][1]
  seq = strsplit(toString(x), split = ',')[[1]][1]
  names(seq) = name
  return(seq)
}

However, this returns an error:

R> myFun = function(x){
+   name = strsplit(names(x), split = ' ')[[1]][1]
+   seq = strsplit(toString(x), split = ',')[[1]][1]
+   names(seq) = name
+   return(seq)
+ }
R> sapply(y, myFun)
Error in strsplit(names(x), split = " ") : non-character argument
Calls: sapply ... lapply -> lapply -> lapply -> FUN -> FUN -> strsplit

Simplyfing the function to

R> myFun = function(x){
+   seq = strsplit(toString(x), split = ',')[[1]][1]
+ }

Returns the full sequence names as entered in the original FASTA file.

R> sapply(rs, myFun)
        Gene1 Description         Gene2 Description         Gene3 Description 
"UUUUUUUUUUUUUUUUUUUUUUU" "AAAAAAAAAAAAAAAAAAAAAAA" "GGGGGGGGGGGGGGGGGGGGGGG" 
        Gene4 Description 
"CCCCCCCCCCCCCCCCCCCCCCC" 

I would appreciate if anyone could offer a solution or explain why the strsplit
does not work with the looping (sapply)?

Thank you!
Kemal

R> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] illuminaHumanv4.db_1.14.0 org.Hs.eg.db_2.7.1       
 [3] RSQLite_0.11.1            DBI_0.2-5                
 [5] AnnotationDbi_1.18.1      beadarray_2.6.0          
 [7] Biobase_2.16.0            ShortRead_1.14.4         
 [9] latticeExtra_0.6-19       RColorBrewer_1.0-5       
[11] Rsamtools_1.8.5           lattice_0.20-6           
[13] GenomicRanges_1.8.7       ggplot2_0.9.1            
[15] edgeR_2.6.7               limma_3.12.1             
[17] Biostrings_2.24.1         IRanges_1.14.3           
[19] BiocGenerics_0.2.0        colorout_0.9-9           

loaded via a namespace (and not attached):
 [1] BeadDataPackR_1.8.0 bitops_1.0-4.1      colorspace_1.1-1   
 [4] dichromat_1.2-4     digest_0.5.2        grid_2.15.0        
 [7] hwriter_1.3         labeling_0.1        MASS_7.3-18        
[10] memoise_0.1         munsell_0.3         plyr_1.7.1         
[13] proto_0.3-9.2       reshape2_1.2.1      scales_0.2.1       
[16] stats4_2.15.0       stringr_0.6         tools_2.15.0       
[19] zlibbioc_1.2.0