[BioC] Extarcting large number of sequences using BSgenome package

Hervé Pagès hpages at fhcrc.org
Tue Aug 16 17:40:20 CEST 2011


Hi Viritha,

On 11-08-16 08:01 AM, viritha kaza wrote:
> Hi group,
> I am interested in retrieving about 2000 sequences with the specific
> chromosome number,start and end site.
> I was thinking of using BSgenome package for this.
>
>
>> source("http://bioconductor.org/biocLite.R")
>
>> biocLite("BSgenome","BSgenome.Hsapiens.UCSC.hg19")
>
>> library(BSgenome)
>
>> library("BSgenome.Hsapiens.UCSC.hg19")
>
>> myseq<- getSeq(Hsapiens,"chr2",start=10000,end=10020)
>
> # this would work for specific values.
>
> #So I tried to use a dataframe where it would retrieve chromosome number
> from by using
>
>> full<-as.matrix(read.table(“test_seq.txt”,sep=’\t’,quote=””,header=True,
> as.is=TRUE))

Why are you doing as.matrix here? Do you *really* want all the columns
to be coerced to the same type? (which in your case is probably
character because you have already one column of that type).

>
>> full.df<-data.frame(full)

So you're back to a data.frame, except that now all the columns are
character vectors :-/

>
> # test_seq contains this information
>
> #chromosome Start End
> #chr2 10000 10020
> #chr3 10000 10020

Why not copy and paste what you get when you just type full.df? That
way would know exactly what you ended up with (if full.df is too big,
just show us the output of head(full.df)).
Anyway, it looks like you'll have to get rid of those #. Otherwise
passing Hsapiens,full.df$chromosome to getSeq() with those # in it
will surely cause problems.

>
>> myseq<- getSeq(Hsapiens,full.df$chromosome,start=10000,end=10020)
>
> #but then when I use start=full.df$Start. It naturally throws an error
> saying 'start' must be a vector of integers
>
> Questions:
>
> How Do I handle this?

By having full.df$Start be a vector of integers and also by reading
the man page for getSeq().

>
> Does start here mean that each chromosome numbering starts from 1?

Yes, the first base at the 5' end of the + strand (or 3' end of the -
strand) is considered to be position 1.

>
> How do I split each sequence retrieved and create as fasta format (>)with
> sequence name attached to them retrieved from my input file?

How do you want to "split" each sequence retrieved? As a good starting
point you could have a look at write.XStringSet(). Note that
the names you put on your DNAStringSet object will be used as the
record description lines in the resulting FASTA file.

Cheers,
H.

>
>
>
>> sessionInfo()
>
> R version 2.13.0 (2011-04-13)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 BSgenome_1.20.0
> [3] Biostrings_2.20.1                  GenomicRanges_1.4.6
> [5] IRanges_1.10.4
>
> loaded via a namespace (and not attached):
> [1] tools_2.13.0
>
>
>
> Waiting for your suggestions,
>
> Thank you,
>
> Viritha
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list