[BioC] Rsubread - featureCounts for extracting read counts from De novo assembled transcripts - EdgeR DE analysis

Sun Aug 4 01:36:42 CEST 2013

Dear Alan,

Thank you for using Rsubread. The number of transcripts in the annotation
file would not be an issue -- our featureCounts function supports
unlimited number of annotations and chromosomes/transcripts (actually, at
most 2 billions).

I have just tried to run the in-development version of Rsubread to assign
reads in your last mail, using the two-line annotation in the other mail.
The program finished with no errors. The version number I tested was
"Rsubread_1.11.10".

If this is the version you are using, could you please run this command?

$ cat my_annotations.txt | awk 'BEGIN{max_feature=0; max_chro=0}
length($1)>max_feature{max_feature=length($1) }
length($2)>max_chro{max_chro=length($2)} END{print max_feature, max_chro}'

where "my_annotations.txt" is the annotation file in SAF format. This
command will give the length of the longest feature name and the longest
chromosome name. If either of them is longer than 99, it is still too long
to our program.

Thank you.

Cheers,

Yang

> Dear Wei,
>
> I tried with the devel version but still the problem persists (R crashed).
> Does this got to do with the large number of transcripts (129310) in the
> file?
>
> Could you also please comment on whether if the count data obtained from
> Rsubread processed on de novo assembled transcripts be used for edgeR
> differential expression analysis. Or using RSEM to obtain read count
> information is the way to go?
>
> Thank you,
> Alan
>
>
> On Fri, Aug 2, 2013 at 8:35 PM, Wei Shi <shi at wehi.edu.au> wrote:
>
>> Dear Alan,
>>
>> Thanks for sending through the data which helped us to identify the
>> problem. The problem was that featureCounts did not allow the chromosome
>> names in the provided annotation file to be longer than 48 characters.
>> We
>> have increased this limit to 100 characters. This solved the problem and
>> featureCounts worked fine for your data now from our testing.
>>
>> We have committed the changes to bioc devel repository and the latest
>> version should available in a couple of days (Rsubread v1.11.11).
>>
>> Note that we have also made some changes to parameters used by
>> featureCounts. The 'annot' parameter is now changed to 'annot.ext' and
>> the
>> 'genome' parameter changed to 'annot.inbuilt'. This makes it more
>> clearer
>> on how to use our in-built annotations and how to provide external
>> annotations.
>>
>> Let me know if the problem you encountered persists.
>>
>>
>> Best wishes,
>>
>> Wei
>>
>> On Aug 2, 2013, at 2:06 PM, Alan Smith wrote:
>>
>> Hi Wei,
>>
>> Thanks for looking into it. Here are few lines from my SAM file:
>>
>> @HD VN:1.0 SO:unsorted
>> @SQ SN:Locus_1_Transcript_1/43_Confidence_0.175_Length_1072 LN:1072
>> @SQ SN:Locus_1_Transcript_6/43_Confidence_0.105_Length_219 LN:219
>> @SQ SN:Locus_1_Transcript_8/43_Confidence_0.088_Length_258 LN:258
>> @SQ SN:Locus_1_Transcript_10/43_Confidence_0.088_Length_421 LN:421
>> @SQ SN:Locus_1_Transcript_14/43_Confidence_0.105_Length_217 LN:217
>> @SQ SN:Locus_1_Transcript_15/43_Confidence_0.105_Length_392 LN:392
>> @SQ SN:Locus_1_Transcript_20/43_Confidence_0.123_Length_547 LN:547
>> @SQ SN:Locus_1_Transcript_21/43_Confidence_0.105_Length_408 LN:408
>> @SQ SN:Locus_1_Transcript_23/43_Confidence_0.070_Length_246 LN:246
>>
>> *This goes for 129310 lines (number of assembled transcripts) then at
>> the end of the file:*
>>
>>
>> HWI-EAS146_0380:5:100:14414:21375#0/1	16	Locus_21817_Transcript_4/5_Confidence_0.200_Length_1641	933	0	26M	*	0	0	TCTCGGGGCTCCACTGGAGATGGTCC	^Qa]cf]\U\Z_Zbbeccfaf]`][S	AS:i:-12	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:0C0A24	YT:Z:UU
>> HWI-EAS146_0380:5:100:14769:21375#0/1	16	Locus_3777_Transcript_1/1_Confidence_0.000_Length_2776	283	42	26M	*	0	0	ATCACATCTCTTTACAGAACCCACTT	c]fbdbafdaeafdfdffeffa_aa^	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:26	YT:Z:UU
>> HWI-EAS146_0380:5:100:14923:21376#0/1	4	*	0	0	*	*	0	0	ACGACGAAGAAGATGACGACGAGGAC	^^\^Y^W[]aa_cW^ddb``a`WcfR	YT:Z:UU
>> HWI-EAS146_0380:5:100:15383:21374#0/1	4	*	0	0	*	*	0	0	CAACCTTCACCGTGGGTTCTAATGTA	YY\\_fff[fd^^^]ZZ]^Wc]c]_a	YT:Z:UU
>> HWI-EAS146_0380:5:100:15873:21375#0/1	4	*	0	0	*	*	0	0	CCTGCTTAAAGAGATCGGAAGAGCGG	`_Z^acRcccffYf[ca]d]ff[cdf	YT:Z:UU
>> HWI-EAS146_0380:5:100:16039:21373#0/1	16	Locus_4044_Transcript_6/48_Confidence_0.066_Length_596	412	42	26M	*	0	0	TGCAAGCATTCCTCCTGTGGAGGGGT	hghedhgdhcgceffdhhhhgeeeee	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:26	YT:Z:UU
>> HWI-EAS146_0380:5:100:16121:21377#0/1	16	Locus_2303_Transcript_7/11_Confidence_0.500_Length_1785	1287	42	26M	*	0	0	GCTCCGTTTGGATCCGAGAATAGCAC	fchhhghhfhgghhfgghhhh\a\\\	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:26	YT:Z:UU
>> HWI-EAS146_0380:5:100:16914:21379#0/1	4	*	0	0	*	*	0	0	CCATTCATCATCTTGATACTGCAGAT	fffffhhghfghhhhhhhhhhhehea	YT:Z:UU
>> HWI-EAS146_0380:5:100:16964:21378#0/1	0	Locus_17940_Transcript_2/7_Confidence_0.538_Length_1249	584	42	26M	*	0	0	CTGTAGCCTTTCAGTGAGCTGAGAAA	`b^``effffgfgfdgegggedgggg	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:26	YT:Z:UU
>> HWI-EAS146_0380:5:100:17356:21373#0/1	4	*	0	0	*	*	0	0	GCCATCAAACCTTACCAATGCTGAGA	dacccf[afffffffffdfff]ffca	YT:Z:UU
>> HWI-EAS146_0380:5:100:17415:21379#0/1	4	*	0	0	*	*	0	0	CTAGATGCCTCTTCAGGACTTGAGAA	a`a`aggggffffffgaggdfcdfdg	YT:Z:UU
>>
>>
>> Thanks again,
>>
>> Alan
>>
>>
>>
>>
>>
>> On Thu, Aug 1, 2013 at 10:55 PM, Wei Shi <shi at wehi.edu.au> wrote:
>>
>>> Dear Alan,
>>>
>>> Could you please provide us a few reads in your SAM file so that we can
>>> reproduce the problem? Sorry for your crashed R session.
>>>
>>> Best wishes,
>>> Wei
>>>
>>> On Aug 2, 2013, at 12:47 PM, Alan Smith wrote:
>>>
>>> > Hello,
>>> >
>>> > I'm trying to use featureCounts to extract read counts per transcript
>>> from
>>> > the SAM file (generated using Bowtie2) mapped to de novo assembled
>>> > transcripts (for DE analysis). I made annotation file as per the SAF
>>> > requirements:
>>> >
>>> > Eg:
>>> >
>>> > *GeneID Chr Start End Strand*
>>> > Locus_47_Transcript_16/31_Confidence_0.158_Length_1485
>>> > Locus_47_Transcript_16/31_Confidence_0.158_Length_1485 1 1485 +
>>> > Locus_58_Transcript_85/85_Confidence_0.017_Length_650
>>> > Locus_58_Transcript_85/85_Confidence_0.017_Length_650 1 650 +
>>> >
>>> > Here I made both the GeneID and the Chr columns same with *Start*
>>> being
>>> 1
>>> > and *End* as the length of the transcript and the Strand as + for all
>>> > transcripts.
>>> >
>>> > When I used the following code, it does nothing for a while and then
>>> R
>>> > crashes.
>>> >
>>> > library(Rsubread)
>>> > counts <-
>>> >
>>> featureCounts(files="AssembledTranscriptome-LMcontrol1.sam",annot="AnnotRsubreadSAF.txt",isGTFAnnotationFile=FALSE)
>>> >
>>> > Rsubread works great with GTF as annot parameter on genome mapped SAM
>>> files
>>> > but not with the SAF, as I tried here.
>>> >
>>> > Please let me know where I'm wrong here or if anyone tried other
>>> packages
>>> > that serves the purpose.
>>> >
>>> > sessionInfo()
>>> > R version 3.0.1 (2013-05-16)
>>> > Platform: x86_64-redhat-linux-gnu (64-bit)
>>> >
>>> > locale:
>>> > [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> > LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> > [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C
>>> >          LC_NAME=C
>>> > [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> >
>>> > attached base packages:
>>> > [1] splines   parallel  grid      stats     graphics  grDevices utils
>>> > datasets  methods   base
>>> >
>>> > other attached packages:
>>> > [1] Rsamtools_1.12.3                    edgeR_3.2.4
>>> >  Rsubread_1.10.5
>>> > [4] ChIPpeakAnno_2.8.0                  GenomicFeatures_1.12.3
>>> > limma_3.16.7
>>> > [7] org.Hs.eg.db_2.9.0                  GO.db_2.9.0
>>> >  RSQLite_0.11.4
>>> > [10] DBI_0.2-7                           AnnotationDbi_1.22.6
>>> > BSgenome.Ecoli.NCBI.20080805_1.3.17
>>> > [13] BSgenome_1.28.0                     GenomicRanges_1.12.4
>>> > Biostrings_2.28.0
>>> > [16] IRanges_1.18.2                      multtest_2.16.0
>>> >  Biobase_2.20.1
>>> > [19] biomaRt_2.16.0                      BiocGenerics_0.6.0
>>> > VennDiagram_1.6.4
>>> > [22] sendmailR_1.1-2                     base64enc_0.1-1
>>> >  BiocInstaller_1.10.3
>>> >
>>> > loaded via a namespace (and not attached):
>>> > [1] bitops_1.0-5       MASS_7.3-26        RCurl_1.95-4.1
>>> > rtracklayer_1.20.4 stats4_3.0.1
>>> > [6] survival_2.37-4    tools_3.0.1        XML_3.98-1.1
>>> zlibbioc_1.6.0
>>> >
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioconductor mailing list
>>> > Bioconductor at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> > Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for
>>> the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission
>>> of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for
>> the
>> addressee.
>> You must not disclose, forward, print or use it without the permission
>> of
>> the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}