[BioC] forge BSgenome data package

Hervé Pagès hpages at fhcrc.org
Fri Jan 28 19:49:21 CET 2011


Hi Steve,

On 01/27/2011 10:37 PM, Steve Shen wrote:
> Hi Herve,
>
> Sorry for bothering you again. With you help, I was able to build this
> BSgenome data package. Now it brings me another level of problem. How do
> I manipulate the sequences since it is quite different from the examples
> in your documentation? For example, how can I find the sequences that
> length is 1000bp or less? I have issued couple of commands and list the
> output below. Thanks so much for your help. By the way, do you guys
> offer a training class in a near future?
>
> Steve
>
>  > seq <- CFlo$cflo_v3.3.fold
>  > str(seq)
> Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
>    ..@ pool           :Formal class 'SharedRaw_Pool' [package "IRanges"]
> with 2 slots
>    .. .. ..@ xp_list                    :List of 1
>    .. .. .. ..$ :<externalptr>
>    .. .. ..@ .link_to_cached_object_list:List of 1
>    .. .. .. ..$ :<environment: 0x7a68ec0>
>    ..@ ranges         :Formal class 'GroupedIRanges' [package "IRanges"]
> with 7 slots
>    .. .. ..@ group          : int [1:24026] 1 1 1 1 1 1 1 1 1 1 ...
>    .. .. ..@ start          : int [1:24026] 1 372 921 1459 2139 3015
> 17388 34650 67069 78840 ...
>    .. .. ..@ width          : int [1:24026] 371 549 538 680 876 14373
> 17262 32419 11771 16416 ...
>    .. .. ..@ NAMES          : chr [1:24026] "scaffold9" "scaffold12"
> "scaffold16" "scaffold2" ...
>    .. .. ..@ elementMetadata: NULL
>    .. .. ..@ elementType    : chr "integer"
>    .. .. ..@ metadata       : list()
>    ..@ elementMetadata: NULL
>    ..@ elementType    : chr "DNAString"
>    ..@ metadata       : list()
>
>  > seq
>    A DNAStringSet instance of length 24026
>            width seq                                         names
>      [1]     371 GTATATAAATAAAAATCTTA...AAACTGTCATACAAACAGTT scaffold9
>      [2]     549 TATCTATATTTTATAAATTA...GAAATTTTTATAAGCTTATT scaffold12
>      [3]     538 ATAATGAGCAAGAATAAATT...TACATATTAGTTGGTTCTCT scaffold16
>      [4]     680 CGCCAATCTATGAAGACGAA...GATCATGAATTCAAATGGGA scaffold2
>      [5]     876 CACATGATTTATTTGCAGAA...CGCCGATTGTACATGTGATA scaffold4
>      [6]   14373 CGTTCGGTCTGCTCAATGTT...ACTGCCGCATGTTTAGCATT scaffold20
>      [7]   17262 ATTATCTACAAGCTTGGTAA...TAGGAGCCGCCCGACCGCTC scaffold10
>      [8]   32419 AGAGAAGGAACAGAGAAGAA...TACTTTTGTTATTTTAGAAG scaffold11
>      [9]   11771 CAATGTCAAACTGATAACAT...GTCTGATTAGGATTCATCGC scaffold24
>      ...     ... ...
> [24018]    5878 TTTTTTTTTTTTTCGAAAAT...ATAGTTATTTTACATATTAG C4054188
> [24019]    5945 TTATATTAGATTCTTAAAGT...AAATAATAAATAATAAATAA C4054414
> [24020]    5777 ATAATAATACCGTTCACAAG...CGATGCGTCCTTTTCTTTTT C4053796
> [24021]    6301 TTTATTATTTATTAAGCTTT...TATTTTACGTAGCGAAAAAA C4055556
> [24022]    6987 TAAAAAAAATAGTTATATTT...AGAAATTTATTATTATTATT C4057106
> [24023]    7532 TCTCAGGGGCTGGTCTGCCT...CGGACGCGTGGAGAGAGTCA C4058062
> [24024]    9748 TCCTCGCGCTTACCGTTTGA...AAAAATTTGGTTGCGTTAAT C4059958
> [24025]   32760 GCTTCCGAGGTCGGGGACGC...AACGACGGTTCGCCTGCTAT scaffold5109
> [24026]   18211 TTATTTATTAAAATAAATAA...AGCGAGGGTGCGGCCCCGGC scaffold4322
>

'seq' is a DNAStringSet object so you could start by looking at the
man page for those objects (?DNAStringSet) for the basics of how to
manipulate yours. This is a good starting point. From there you
will eventually end up visiting other man pages in the Biostrings
package by following the suggestions given in the "See also" section.
Biostrings has a lot of man pages! Most man pages provide a lot of
examples that you will definitely want to look at.

For your particular problem of selecting the elements that are
<= 1000bp, just do 'seq[width(seq) <= 1000]'.

Please see our website for course announcements:

   http://bioconductor.org/

Cheers,
H.


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list