[BioC] Limit on number of sequence files for forging a BSgenome

Tue Apr 2 07:22:32 CEST 2013

Hi Marco,

Sorry for the delay in answering this. Seems like you are hitting
a limitation in the implementation of the read.dcf() function.

Having so many entries in the 'seqnames' field of your seed file is
probably a bad idea anyway. This field is typically expected to list
the chromosomes or "top-level sequences" so it will always have a
small number of entries (i.e. < 100).
Things like contigs or scaffolds, which are generally counted
by hundreds or thousands, should go in the 'mseqnames' field.
Here they are grouped and stored in 1 (or a small number) of
DNAStringSet objects (i.e. < 100) but each object can contain
millions of sequences. One entry per object should be reported
in the 'mseqnames' field.

See for example the 'chrUn.scaffolds' multiple sequence in the
BSgenome.Btaurus.UCSC.bosTau4 package, or the 'Zv9_NA' and
'Zv9_scaffold' multiple sequences in the BSgenome.Drerio.UCSC.danRer7
package.

See the BSgenomeForge vignette for details about to handle the
'mseqnames' field when forging your own package.

Hope this helps. Please let me know if you have further questions.

H.

On 03/29/2013 12:10 PM, Blanchette, Marco wrote:
> I traced back the error "Error: Line longer than buffer size" that I am
> getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in the
> forgeBSgenomeDataPkg() that is used to read the seed file. I came to the
> realization that there is an upper limit to the number of character
> allowed per single line for the DCF files.
>
> For instance:
>
> This works
> cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="test.dc
> f");t <- read.dcf("test.dcf")
>
> While this breaks with the same error I get from forgeBSgenomeDataPkg()
> cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="test.dc
> f");t <- read.dcf("test.dcf")
>
> Since the seqnames: field I creates in my seed file contains several
> thousands entries, I am busting that upper limit. I can reproduce the
> error just by trying to read the seed file with read.dcf("mySeedFile.txt")
>
> At this point, I am not sure if there is an easy workaround and whether
> this should be consider a bug in BSgenome or read.dcf() that should be
> reported...
>
> Advise?
>
> --  Marco Blanchette, Ph.D.
> Stowers Institute for Medical Research
> 1000 East 50th Street
> Kansas City MO 64110
> www.stowers.org
>
>
> Tel: 816-926-4071
> Cell: 816-726-8419
> Fax: 816-926-2018
>
>
>
>
>
>
> On 3/28/13 11:58 AM, "Blanchette, Marco" <MAB at stowers.org> wrote:
>
>> Kasper,
>>
>> I see your line of thought, is there a particular fasta file causing
>> forgeBSgenomeDataPkg() to break?
>>
>> The answer is no. Once I reach a certain number of fasta files, adding one
>> more contig breaks the function. For instance, taking the first 454
>> contigs of C. brenneri breaks while removing the last or the first fasta
>> file from the list (keeping only 453) compile without a problem (neither
>> the last or the first fasta files are responsible for breaking the
>> function, the number of file is the trigger)
>>
>> What's even more puzzling is that the number that breaks is not a fixed
>> number. Selecting a random selection of contigs or changing genome will
>> change the number that triggers the function to break... However it's
>> always around 440 files, which might be due to the size of the fasta files
>> being all of very similar sizes.
>>
>> Any clues?
>>
>>
>> --  Marco Blanchette, Ph.D.
>> Stowers Institute for Medical Research
>> 1000 East 50th Street
>> Kansas City MO 64110
>> www.stowers.org
>>
>>
>> Tel: 816-926-4071
>> Cell: 816-726-8419
>> Fax: 816-926-2018
>>
>>
>>
>>
>>
>>
>> On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at gmail.com>
>> wrote:
>>
>>> Marco,
>>>
>>> You are probably right in diagnosing the problem, but sometimes I
>>> think I have seen FASTA files with the entire sequence on a single
>>> line, instead of (say) 80 nucleotides and then a newline.  I could
>>> believe that a really long contig on a single line without a newline,
>>> could cause an error like this. You could quickly check if there is a
>>> suspicious file by
>>>   wc -l *
>>> and look for files with #lines like 2-3.  Somehow 460 seems a weird
>>> number to fail at.
>>>
>>> This may not be your problem, and I am sure Herve will respond in due
>>> time.
>>>
>>> Best,
>>> Kasper
>>>
>>> On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <MAB at stowers.org>
>>> wrote:
>>>> Hi,
>>>>
>>>> Is there a maximum number of sequence files (chromosomes or contigs in
>>>> my case) that can be fed to the forgeBSgenomeDataPkg() function? I am
>>>> trying to build a BSgenome for C. brenneri and C. japonica available
>>> >from EnsemblGenomes. These genomes are made from thousands of contigs
>>>> with genes annotated to them. Currently, I get the following error when
>>>> running "Error: Line longer than buffer size" when running on the full
>>>> set of contigs. However, it works fine on a seed file containing a
>>>> subset of the contigs (I can forge a genome with 450 contigs but not
>>>> with 460!)
>>>>
>>>> Any suggestions will be appreciated (I can provide a toy example but I
>>>> am not sure what would be the merit of it at this point)
>>>>
>>>> Thanks
>>>>
>>>> --  Marco Blanchette, Ph.D.
>>>> Stowers Institute for Medical Research
>>>> 1000 East 50th Street
>>>> Kansas City MO 64110
>>>> www.stowers.org
>>>>
>>>> Tel: 816-926-4071
>>>> Cell: 816-726-8419
>>>> Fax: 816-926-2018
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319