[BioC] forgeSeqFiles

Hervé Pagès hpages at fhcrc.org
Wed Dec 25 09:49:18 CET 2013


Hi Melo,

On 12/22/2013 11:47 PM, Melo [guest] wrote:
>
> Hello everyone,
>
> I am so new to R as well as Bioconductor but found them very helpful so I'm trying to use.
> Now I need to make a BSgenome pckg for my own organism. Therefore I have made a folder named seqs_srcdir which contains 17,000 files (one gene_sequence per file),

Trying to forge a BSgenome package from the gene sequences is a bad
idea. A lot of tools won't operate properly on this.

A BSgenome data package is intended to represent the full genome of a
given organism. The sequences in such a package are chromosomes and/or
scaffolds and/or whatever sequences that are considered to constitute
the genome assembly of the organism. A lot of tools that operate on
BSgenome objects assume that. For example, it's easy to extract the
gene sequences from a BSgenome object if you know the gene coordinates
with respect to the assembly. The gene/transcripts/exons/cds coordinates
are often stored in a GFF file or similar and can be imported in BioC
with tools like makeTranscriptDbFromGFF() followed by a call to genes(),
transcripts(), exons(), etc... which will return you the coordinates in
a GRanges or GRangesList object. Then use getSeq() on the BSgenome
and GRanges objects to extract the sequences as a DNAStringSet object.
See ?makeTranscriptDbFromGFF and ?transcripts in the GenomicFeatures
package for the details.

But if you have 17,000 files, one gene sequence per file, you could
directly load them in a DNAStringSet object by calling
readDNAStringSet() on the character vector containing the 17,000 file
paths. You can (and should) completely bypass the BSgenome data package
in that case.

> but when I downloaded, and even unzipped BSgenome

This is not the recommended way to install a BioC package. Please
always use biocLite() for that. See:

   http://bioconductor.org/install/

> I kept getting the  Error: could not find function "forgeSeqFiles".
> I know I need to make the seed files so I gave the command line:
>
> forgeSeqlengthsFile(seqnames, prefix="pi1>", suffix=".fa", seqs_srcdir="/Users/Me/Documents/Microarray", seqs_destdir="/Users/Me/Documents/Microarray/Seeds", verbose=TRUE)
>
> Error: could not find function "forgeSeqFiles".

You need to call forgeBSgenomeDataPkg() to forge a BSgenome
data package, not forgeSeqlengthsFile(). Please make sure you
follow the instructions in the BSgenomeForge vignette where all
the process of forging a BSgenome data package is explained.

>
> I would appreciate it if you could please advice,
> Melo
>
>   -- output of sessionInfo():
>
> forgeSeqlengthsFile(seqnames, prefix="pi1>", suffix=".fa", seqs_srcdir="/Users/Me/Documents/Microarray", seqs_destdir="/Users/Me/Documents/Microarray/Seeds", verbose=TRUE)
>
> Error: could not find function "forgeSeqFiles".

This doesn't look like the output of sessionInfo(). The output of
sessionInfo() is... well... the output you get when you run the
sessionInfo() command. Please always provide this information.

Thanks!
H.

>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list