[BioC] forge BSgenome data package

Thu Jan 20 06:20:39 CET 2011

Steve,

On 01/19/2011 04:37 PM, Steve Shen wrote:
> Hi Herve,
>
> Thank you very much. I tried carefully with correct path and file name
> by triggering auto-complete feature. But it didn't help either. It may
> be something wrong with my seed file, so I also appended the file below.
> Please have a look. Thanks in advance,
>
> Steve
>
>  > forgeBSgenomeDataPkg("/home/steve/Data/Genomes/seed")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'as.list'
>  > forgeBSgenomeDataPkg("seed")
> Error in as.list(.readSeedFile(x, verbose = verbose)) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'as.list'

An easy way to check if your seed file is a valid DCF file is:

   > read.dcf("seed")
   Error in read.dcf("seed") :
     Line starting 'I'm a broken DCF fil ...' is malformed!

Again, the error message you get when calling forgeBSgenomeDataPkg() on
a broken DCF file should return exactly that, but, unfortunately, it
doesn't with the current version of BSgenome. Updated versions of
BSgenome (1.18.3 in release and 1.19.3 in devel) are correcting this
and will become available tomorrow around noon, and midnight,
respectively (Seattle time).

>
> Seed file starts:
>
> Package: BSgenome.CFlo.YU.v3
> Title: Camponotus floridanus (Ants) full genome (YU version V3.3)
> Description: Camponotus floridanus (Ants) full genome as provided by YU
> (V3.3, Jan. 2011)
   ^^^^^^^^^^^^^^^^^
Is this on a line of its own in your file? This might be the problem.

Please consult the help for the read.dcf() function (?read.dcf from
your R session) for how to format a DCF file. Note that a 'tag:value'
should either be placed on a single line, or, if the value spans
several lines, then the extra lines (aka "continuation lines") must
start with a whitespace. Like this:

Description: Camponotus floridanus (Ants) full genome as provided
     by YU (V3.3, Jan. 2011)

(I'm using 4 whitespaces here but 1 is enough. Using a <TAB> would
also be OK.)

> Version: 1.0.0
> organism: camponotus floridanus
> species: Ant
> provider: YU
> provider_version: Assembly V3.3
> release_date: Jan, 2011
> release_name: Ant Genome Reference Consortium
> source_url: NA
> organism_biocview: C_flo
> BSgenomeObjname: CFlo
> seqnames: "cflo_v3.3.fold"
> mseqnames: character(0)
> nmask_per_seq:2
>
> seqs_srcdir: /home/steve/Data/Genomes/

So you have a FASTA file named "cflo_v3.3.fold.fa" located in
/home/steve/Data/Genomes/ that contains a *single* FASTA record
that corresponds to the cflo_v3.3.fold sequence? Please double check
this. (You can see the list of records contained in a FASTA file with
the fasta.info() function from Biostrings.)

H.

> # masks_srcdir: /home/steve/Data/Masks_src
> # AGAPSfiles_name: cflo_v3.3.fa.masked
>
>
> 2011/1/19 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Steve,
>
>
>     On 01/19/2011 01:04 PM, Steve Shen wrote:
>
>         Hi Herve,
>
>         Thank you so much for your quick reply. Your vignette is pretty
>         clear. I
>         think the problem is on my side. I just lack of experience on
>         dealing
>         with this issue and maybe the file format doesn't fit. I sort of
>         gave up
>         using the seed file but just using low level commands. I still have
>         errors. I really have no idea what is going on. Your help will
>         be much
>         appreciated.
>
>         Best,
>         Steve
>
>         1. with seed file, I got
>          > forgeBSgenomeDataPkg("./cflo_seed.R")
>         Error in as.list(.readSeedFile(x, verbose = verbose)) :
>            error in evaluating the argument 'x' in selecting a method for
>         function 'as.list'
>
>
>     I agree that the error message is kind of obscure but the problem
>     is probably that the specified path to your seed file is invalid.
>     I'm about to commit a change to the BSgenomeForge code that will
>     produce this error message instead:
>
>      > forgeBSgenomeDataPkg("aaaaa")
>     Error in .readSeedFile(x, verbose = verbose) :
>       seed file 'aaaaa' not found
>
>     An easy way to make sure the path is valid is to press <TAB> when
>     you are in the middle of typing the path: this will trigger the
>     auto-completion feature.
>
>
>
>         2. with command forgeSeqlengthsFile,
>          > forgeSeqlengthsFile("cflo_v3.3.fold", prefix="",
>         suffix=".fa",seqs_srcdir=".", seqs_destdir=".", verbose=TRUE)
>         Saving 'seqlengths' object to compressed data file
>         './seqlengths.rda'...
>         DONE
>         Warning messages:
>         1: In FUN("cflo_v3.3.fold"[[1L]], ...) :
>            In file './cflo_v3.3.fold.fa': 24026 sequences found, using first
>         sequence only
>         2: In FUN("cflo_v3.3.fold"[[1L]], ...) :
>            In file './cflo_v3.3.fold.fa': sequence description
>         "scaffold9scaffold12scaffold16scaffold2scaffold4scaffold20scaffold10scaffold11scaffold24scaffold1scaffold3scaffold6scaffold15scaffold28scaffold30scaffold22scaffold18scaffold21scaffold7scaffold32scaffold36scaffold13scaffold23scaffold39scaffold19scaffold41scaffold29scaffold37scaffold33scaffold45scaffold38scaffold17scaffold5scaffold46scaffold48scaffold31scaffold25scaffold51scaffold49scaffold44scaffold47scaffold54scaffold43scaffold55scaffold60scaffold35scaffold42scaffold63scaffold53scaffold50scaffold40scaffold67scaffold61scaffold69scaffold34scaffold70scaffold27scaffold73scaffold8scaffold75scaffold71scaffold74scaffold14scaffold66scaffold65scaffold80scaffold62scaffold76scaffold79scaffold85scaffold78scaffold86scaffold88scaffold81scaffold87scaffold26scaffold92scaffold93scaffold94scaffold59scaffold52scaffold84scaffold77scaffold89scaffold100scaffold97scaffold99scaffold64scaffold103scaffold90scaffold106scaffold56scaffold98scaffold109scaffold68sc
>         [... truncated]
>
>         3. with forgeSeqfiles,
>          > forgeSeqFiles("cflo_v3.3.fold", mseqnames=NULL, prefix="",
>         suffix=".fa",seqs_srcdir=".", seqs_destdir="./BSgenome",
>         verbose=TRUE)
>         Loading FASTA file './cflo_v3.3.fold.fa' in 'cflo_v3.3.fold'
>         object... DONE
>         Saving 'cflo_v3.3.fold' object to compressed data file
>         './BSgenome/cflo_v3.3.fold.rda'... DONE
>         Warning message:
>         In .forgeSeqFile(name, prefix, suffix, seqs_srcdir, seqs_destdir,  :
>            file contains 24026 sequences, using the first sequence only
>
>
>     Sorry but I can only try to help if you stick to the vignette and
>     use a seed file. Using low-level functions like forgeSeqlengthsFile()
>     or forgeSeqFiles() is undocumented/unsupported. The reason for this
>     is that this route is *much* more complicated and error-prone
>     than using a seed file! This is exactly why seed files where
>     invented: to make the whole process much easier for you, and to
>     make troubleshooting much easier for us.
>
>     Cheers,
>     H.
>
>
>
>
>         2011/1/19 Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>>>
>
>
>             Hi Steve,
>
>             There are several things that look wrong with your seed file.
>
>             1. The # must be the first character in a line to make it a
>                line of comment (ignored).
>                For example, those 2 lines will certainly not be interpreted
>                as you might expect:
>
>
>                  mseqnames: NA #paste("upstream", c("1000", "2000",
>         "5000"), sep="")
>
>                  source_url: NA #
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>
>             2. If you don't have multiple sequences, specify:
>
>                  mseqnames: character(0)
>
>             3. As explained in the BSgenomeForge vignette, the default for
>                seqfiles_prefix is .fa so this won't work:
>
>                  seqnames: "clof_v3.fa"
>
>                Unless your file is named clof_v3.fa.fa?
>                If your file is named clof_v3.fa, then you should specify:
>
>                  seqnames: "clof_v3"
>
>             4. This doesn't look like a valid file of assembly gaps:
>
>
>                  AGAPSfiles_name: clof_v3.fa.masked
>
>                Please refer to the BSgenomeForge vignette for what kind
>         of masks
>                and what file formats are supported.
>
>             5. Having this
>
>
>                  PkgExamples: Hsapiens
>                         seqlengths(Hsapiens)
>                         Hsapiens$chr1  # same as Hsapiens[["chr1"]]
>
>                in a seed file for clof obviously doesn't make sense and your
>                package won't pass 'R CMD check' because it will contain
>         broken
>                examples.
>
>             There might be other problems with your seed file. All you
>         need to do
>             is read and follow carefully the instructions described in the
>             BSgenomeForge vignette. Let me know if things are not clear
>         in the
>             vignette and I'll try to improve it. Thanks!
>
>             H.
>
>
>
>             On 01/18/2011 05:32 PM, Steve Shen wrote:
>
>                 Hi list,
>
>                 This wasn't sent out with a none registered ids. I have a
>                 problem with
>                 forging a new BSgenome data package. The sequence data
>         file is
>                 clof.v3.fa
>                 and the mask file is clof.v3.fa.masked. Below are seed file,
>                 command, error
>                 and sessioninfo. Your help will be much appreciated.
>
>                 Thanks a lot,
>                 Steve
>
>                 Package: BSgenome.Clof.yu.v3
>                 Title: Clof (insects) full genome (version 3)
>                 Description: Clof (insects) full genome as provided by
>         YU (V3.3,
>                 Jan. 2011)
>                 # and will store in Biostrings objects.
>                 Version: 1.0.0
>                 organism: clof
>                 species: insect
>                 provider: YU
>                 provider_version: Assembly V3.3
>                 release_date: Jan, 2011
>                 release_name: insects Genome Reference Consortium
>                 source_url: NA #
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>                 organism_biocview: Clof
>                 BSgenomeObjname: Clof
>                 seqnames: "clof_v3.fa"
>                 mseqnames: NA #paste("upstream", c("1000", "2000",
>         "5000"), sep="")
>                 nmask_per_seq: 2
>                 #SrcDataFiles1: sequences: chromFa.zip, upstream1000.zip,
>                 upstream2000.zip,
>                 upstream5000.zip
>                      #from
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
>                 #SrcDataFiles2: AGAPS masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
>                      #RM masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromOut.tar.gz
>                      #TRF masks:
>         http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromTrf.tar.gz
>                 PkgExamples: Hsapiens
>                      seqlengths(Hsapiens)
>                      Hsapiens$chr1  # same as Hsapiens[["chr1"]]
>                 seqs_srcdir: /home/steve/Data/Genomes
>                 masks_srcdir: /home/steve/Data/Genomes
>                 AGAPSfiles_name: clof_v3.fa.masked
>
>                 The command, error and sessionInfo are below
>
>                     forgeBSgenomeDataPkg("Clof_seed.R", seqs_srcdir=".",
>                     masks_srcdir=".",
>
>                 destdir=".", verbose=TRUE)
>                 Loading required package: Biobase
>
>                 Welcome to Bioconductor
>
>                    Vignettes contain introductory material. To view, type
>         'openVignette()'. To cite Bioconductor, see
>         'citation("Biobase")' and for packages 'citation(pkgname)'.
>
>
>                 Attaching package: 'Biobase'
>
>                 The following object(s) are masked from 'package:IRanges':
>
>                      updateObject
>
>                 Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir,
>                 masks_srcdir =
>                 masks_srcdir,  :
>                    values for symbols NMASKPERSEQ are not single strings
>                 In addition: Warning message:
>                 In storage.mode(x$nmask_per_seq)<- "integer" : NAs
>         introduced by
>                 coercion
>
>                     sessionInfo()
>
>                 R version 2.12.1 (2010-12-16)
>                 Platform: x86_64-pc-linux-gnu (64-bit)
>
>                 locale:
>                   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>                   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>                   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>                   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>                 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>                 attached base packages:
>                 [1] stats     graphics  grDevices utils     datasets
>           methods   base
>
>                 other attached packages:
>                 [1] Biobase_2.6.1       BSgenome_1.18.2
>         Biostrings_2.18.0
>                 [4] GenomicRanges_1.2.1 IRanges_1.8.8
>
>                 loaded via a namespace (and not attached):
>                 [1] tools_2.12.1
>
>                         [[alternative HTML version deleted]]
>
>                 _______________________________________________
>                 Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>>
>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>                 Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>             --
>             Hervé Pagès
>
>             Program in Computational Biology
>             Division of Public Health Sciences
>             Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N, M2-B876
>             P.O. Box 19024
>             Seattle, WA 98109-1024
>
>             E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>             Phone:  (206) 667-5791
>             Fax:    (206) 667-1319
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319