[BioC] Query

Martin Morgan mtmorgan at fhcrc.org
Wed Feb 24 15:53:49 CET 2010


On 02/23/2010 09:43 PM, Yogesh Kumar wrote:
> Respected Sir/Madam,
> 
> Good morning, I just started my PhD in Next generation sequencing datasets..
> I have one query , can you help me to handle this query.
> 
> we  have sequence of 1MB region from Human chr 22 (haploid). we want to cut
> the length of max 450-500bp  ( given use Normal distribution, Mean 500 and
> SD=?). *I want to know how can we generate reads from our sequence*   (
> paired end of 75bp at both ends, Assuming  perfect reads). we want to do
> simulation of NGS run and reassemble it  again  without using reference
> sequence. Can we use R for simulation work also? if yes please provide me
> which package we can use for it and possible send me R script too.

Good morning --

>From here

  http://bioconductor.org/packages/release/BiocViews.html

The BSgenome software package and the appropriate BSgenome annotation
(e.g., BSgenome.Hsapiens.UCSC.hg19) would be a good starting point for
reference sequence. The Biostrings and IRanges software packages,
especially 'Views', coupled with standard R random number generators
e.g., rnorm would be a good place to simulate reads. These could be
written as fasta files. Basic code might be

  library(BSgenome.Hsapiens.UCSC.hg19)
  N <- 1e8 # 10M reads
  start <- 5e7 + runif(N, 0, 1e7) # start of fragment, genome coords
  width <- rnorm(N, 500, 50) # width of fragment
  v <- Views(Hsapiens[["chr1"]], start=start, width=width)

Each element of v is a fragment

  r1 <- narrow(v, 1, 80)
  r2 <- narrow(v, width(v) - 80 + 1, width(v))

r1 and r2 are the pairs of each read. The 10M reads didn't quite fit in
my 8Gb of RAM, so likely I'd need to generate reads in batches of, e.g.,
5M reads.

Good luck.

Martin

> 
> 1. To establish simulation of reassembling sequence from NGS data. This will
> build from re-assembling a simple sequence of 1 Mb with no repeats in the
> haploid state, to inclusion of genetic variation and polyploidy.
>        -simulate a NGS run from a 1 Mb segment of human with little/no
> repeats. Average fragment size 500 bp with normal distribution. Paired end
> with 75 bp reads. Assume perfect sequencing. Check out other simulation
> methods
>        - align the reads back to the 1 Mb sequence. How much variation in
> coverage
>        - reassemble the reads WITHOUT using the reference sequence.
> 
> 
> 
> 
> http://maq.sourceforge.net/maq-manpage.shtml#7
> 
> I think we can use above link but not be able to work out . can you check it
> and send me solution
> 
> Regards
> Yogesh Kumar
> University of Otago
> Dunedin, NZ
> 0064226149242
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list