[BioC] Limit on number of sequence files for forging a BSgenome

Fri Mar 29 22:08:43 CET 2013

advice:  repost this to R-devel with new subject and example demoing the bug in the base package

-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Blanchette, Marco
Sent: Friday, March 29, 2013 2:11 PM
To: Blanchette, Marco; Kasper Daniel Hansen
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Limit on number of sequence files for forging a BSgenome

I traced back the error "Error: Line longer than buffer size" that I am
getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in the
forgeBSgenomeDataPkg() that is used to read the seed file. I came to the
realization that there is an upper limit to the number of character
allowed per single line for the DCF files.

For instance:

This works
cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="test.dc
f");t <- read.dcf("test.dcf")

While this breaks with the same error I get from forgeBSgenomeDataPkg()
cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file="test.dc
f");t <- read.dcf("test.dcf")

Since the seqnames: field I creates in my seed file contains several
thousands entries, I am busting that upper limit. I can reproduce the
error just by trying to read the seed file with read.dcf("mySeedFile.txt")

At this point, I am not sure if there is an easy workaround and whether
this should be consider a bug in BSgenome or read.dcf() that should be
reported...

Advise?

--  Marco Blanchette, Ph.D.
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

On 3/28/13 11:58 AM, "Blanchette, Marco" <MAB at stowers.org> wrote:

>Kasper,
>
>I see your line of thought, is there a particular fasta file causing
>forgeBSgenomeDataPkg() to break?
>
>The answer is no. Once I reach a certain number of fasta files, adding one
>more contig breaks the function. For instance, taking the first 454
>contigs of C. brenneri breaks while removing the last or the first fasta
>file from the list (keeping only 453) compile without a problem (neither
>the last or the first fasta files are responsible for breaking the
>function, the number of file is the trigger)
>
>What's even more puzzling is that the number that breaks is not a fixed
>number. Selecting a random selection of contigs or changing genome will
>change the number that triggers the function to break... However it's
>always around 440 files, which might be due to the size of the fasta files
>being all of very similar sizes.
>
>Any clues? 
>
>
>--  Marco Blanchette, Ph.D.
>Stowers Institute for Medical Research
>1000 East 50th Street
>Kansas City MO 64110
>www.stowers.org
>
>
>Tel: 816-926-4071
>Cell: 816-726-8419
>Fax: 816-926-2018
>
>
>
>
>
>
>On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at gmail.com>
>wrote:
>
>>Marco,
>>
>>You are probably right in diagnosing the problem, but sometimes I
>>think I have seen FASTA files with the entire sequence on a single
>>line, instead of (say) 80 nucleotides and then a newline.  I could
>>believe that a really long contig on a single line without a newline,
>>could cause an error like this. You could quickly check if there is a
>>suspicious file by
>>  wc -l *
>>and look for files with #lines like 2-3.  Somehow 460 seems a weird
>>number to fail at.
>>
>>This may not be your problem, and I am sure Herve will respond in due
>>time.
>>
>>Best,
>>Kasper
>>
>>On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <MAB at stowers.org>
>>wrote:
>>> Hi,
>>>
>>> Is there a maximum number of sequence files (chromosomes or contigs in
>>>my case) that can be fed to the forgeBSgenomeDataPkg() function? I am
>>>trying to build a BSgenome for C. brenneri and C. japonica available
>>>from EnsemblGenomes. These genomes are made from thousands of contigs
>>>with genes annotated to them. Currently, I get the following error when
>>>running "Error: Line longer than buffer size" when running on the full
>>>set of contigs. However, it works fine on a seed file containing a
>>>subset of the contigs (I can forge a genome with 450 contigs but not
>>>with 460!)
>>>
>>> Any suggestions will be appreciated (I can provide a toy example but I
>>>am not sure what would be the merit of it at this point)
>>>
>>> Thanks
>>>
>>> --  Marco Blanchette, Ph.D.
>>> Stowers Institute for Medical Research
>>> 1000 East 50th Street
>>> Kansas City MO 64110
>>> www.stowers.org
>>>
>>> Tel: 816-926-4071
>>> Cell: 816-726-8419
>>> Fax: 816-926-2018
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor