[BioC] singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big

Wed Jun 18 22:01:25 CEST 2014

Hi again,

I got a little bit confused and didn't realize that I was answering
such an old post (from April) and that you are also the person who
reported the following issues on the bioc-devel list in April (the
2nd issue forwarded to the list by Michael):

   https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005570.html

   https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005591.html

I hope all will be fine now with the BSgenome packages update.

Please let me know if you still run into issues with the new packages
(version 1.3.1000 or higher).

Thanks,
H.

On 06/18/2014 11:34 AM, Hervé Pagès wrote:
> Hi Sean,
>
> On 04/15/2014 11:30 PM, Sean Li [guest] wrote:
>>
>> singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big
>> to load. Why can you separate it into several files as
>> Bsgenome.Hsapiens.UCSC.hg19 do?
>
> How are you trying to access the genome sequences in
> BSgenome.Hsapiens.NCBI.GRCh38?
>
> Note that the singe_sequences.fa.gz file is the package internal
> business and you should avoid trying to access it directly. The
> "normal" way to access the genome sequences is via [[ or getSeq().
> Use [[ to load a given chromosome:
>
>    genome <- Bsgenome.Hsapiens.NCBI.GRCh38
>    genome[["1"]]
>
> Use getSeq() to extract a set of regions (typically specified via
> a GRanges object).
>
> Trying to load the entire genome will require that R is able to allocate
> more than 3Gb of RAM which I don't think is possible on your platform
> (32-bit Windows). That's just the size of the Human genome once in
> memory (i.e. in a DNAStringSet object) and whatever format is used to
> store it on disk (a single file or 1 file per chromosome) won't change
> that.
>
> Anyway, because of other issues with singe_sequences.fa.gz, today
> BSgenome.Hsapiens.NCBI.GRCh38 will be updated with a new version that
> uses one file per chromosome.
>
> Cheers,
> H.
>
>>
>>   -- output of sessionInfo():
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: i386-w64-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>> [4] LC_NUMERIC=C
>> [5] LC_TIME=Chinese_People's Republic of China.936
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319