[BioC] bed file around the TSSs

Sat Oct 16 23:49:32 CEST 2010

as usual my suggestions were a bit more involved than necessary.  the
point is that the thing you get from the
transcripts() method applied to a makeTranscriptsDb will have
"tx_name" as the name of the elementMetadata and
this metadata will not be propagated by export(); if it had name
"name", it would be propagated.  so

> f19 = loadFeatures("hg19.txdb.sqlite")  # serialized via saveFeatures, but perhaps a long time ago... so the message:
The TranscriptDb object has been updated to the latest schema version 1.0.
The updated object can be saved using saveFeatures()
> t19 = transcripts(f19)  # get transcripts GRanges
> names(elementMetadata(t19))   # what are the metadata names
[1] "tx_id"   "tx_name"
> names(elementMetadata(t19))[2] = "name"   # reset a name
> export(t19, "t19.bed")    # do not use file=
> cat(readLines("t19.bed", n=5), sep="\n")       # what you asked for, minus the header line
chr1    11873   14409   uc001aaa.3      0       +
chr1    11873   14409   uc010nxq.1      0       +
chr1    11873   14409   uc010nxr.1      0       +
chr1    69090   70008   uc001aal.1      0       +
chr1    321083  321114  uc001aaq.1      0       +

On Sat, Oct 16, 2010 at 5:10 PM, joseph <jdsandjd at yahoo.com> wrote:
> Hi Vincent
> Thank you, that was very helpful.
> Can you please expand a bit more  on step#3 about coercing to RangedData and
> changing tx_name to name?
> Joseph
>
> ________________________________
> From: Vincent Carey <stvjc at channing.harvard.edu>
> To: joseph <jdsandjd at yahoo.com>
> Cc: bioconductor at stat.math.ethz.ch
> Sent: Wed, October 13, 2010 2:05:31 PM
> Subject: Re: [BioC] bed file around the TSSs
>
> There are various approaches but basic tools for one solution are
> 1) use GenomicFeatures::makeTranscriptTableFromUCSC(genome, table) to
> acquire
> a SQLite database that has relevant information
> 2) extract the GRanges object corresponding to transcripts and their
> genomic coordinates
> using the transcripts() method
> 3) set the names of the GRanges using something like names(gr) =
> elementMetadata(gr)$tx_name -- but see below; you might instead coerce
> to RangedData and just change the column names
> 4) revise the coordinate information according to your wishes
> 5) use the rtracklayer export method on the revised GRanges or
> RangedData structure to create the bed file
>
> some concrete results, based on
>
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2010-10-12 r53293)
> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats    graphics  grDevices datasets  tools    utils    methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.9.9      RCurl_1.4-3            bitops_1.0-4.1
> [4] GenomicFeatures_1.1.12 GenomicRanges_1.1.29  IRanges_1.7.35
> [7] weaver_1.15.0          codetools_0.2-2        digest_0.4.2
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.17.7    Biobase_2.9.2      Biostrings_2.17.47 DBI_0.2-5
> [5] RSQLite_0.9-2      XML_3.1-1          biomaRt_2.5.1
>
>> t1 = makeTranscriptDbFromUCSC("hg19", "refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> tt1 = transcripts(t1)
>> tt1
> GRanges with 37195 ranges and 2 elementMetadata values
>         seqnames              ranges strand  |    tx_id      tx_name
>           <Rle>            <IRanges>  <Rle>  | <integer>  <character>
>     [1]    chr1    [ 69091,  70008]      +  |      980 NM_001005484
>     [2]    chr1    [323892, 328581]      +  |      985    NR_028327
>     [3]    chr1    [323892, 328581]      +  |      986    NR_028322
>     [4]    chr1    [323892, 328581]      +  |      987    NR_028325
>     [5]    chr1    [367659, 368597]      +  |      984 NM_001005277
>
>> names(tt1) = make.names(elementMetadata(tt1)$tx_name, unique=TRUE)  # why?
>> tt1[1:5]
> GRanges with 5 ranges and 2 elementMetadata values
>             seqnames          ranges strand |    tx_id      tx_name
>                 <Rle>        <IRanges>  <Rle> | <integer>  <character>
> NM_001005484    chr1 [ 69091,  70008]      + |      980 NM_001005484
>   NR_028327    chr1 [323892, 328581]      + |      985    NR_028327
>   NR_028322    chr1 [323892, 328581]      + |      986    NR_028322
>   NR_028325    chr1 [323892, 328581]      + |      987    NR_028325
> NM_001005277    chr1 [367659, 368597]      + |      984 NM_001005277
>
> seqlengths
>                   chr1                  chr2 ... chr18_gl000207_random
>             249250621            243199373 ...                  4262
>
> observe what happens when the tx_names are duplicated
>
>> tt1[42:44]
> GRanges with 3 ranges and 2 elementMetadata values
>             seqnames            ranges strand |    tx_id    tx_name
>               <Rle>          <IRanges>  <Rle> | <integer> <character>
>   NR_002946    chr1 [1568159, 1570027]      + |      1053  NR_002946
> NR_002946.1    chr1 [1631378, 1633247]      + |      1063  NR_002946
>   NM_138705    chr1 [1846266, 1848733]      + |      1066  NM_138705
>
> seqlengths
>                   chr1                  chr2 ... chr18_gl000207_random
>             249250621            243199373 ...                  4262
>
> if you coerce to RangedData and change tx_name to name you can
> probably avoid the munging of the transcript names,
> which in this case, if you use the approach above, could be pretty
> confusing if .1 etc are used to indicate versions at refseq
>
>> export(tt1, "tt1.bed")
>> cat(readLines("tt1.bed", n=5), sep="\n")
> chr1    69090  70008  NM_001005484    0      +
> chr1    323891  328581  NR_028327      0      +
> chr1    323891  328581  NR_028322      0      +
> chr1    323891  328581  NR_028325      0      +
> chr1    367658  368597  NM_001005277    0      +
>
> i have skipped the modifications to the coordinates (you will need to
> program
> that conditional on strand, presumably) and the setting of the header line.
>
>
>
>
> On Wed, Oct 13, 2010 at 3:53 PM, joseph <jdsandjd at yahoo.com> wrote:
>> Hi
>> I would highly appreciate if you could show me how to create a bed file
>> around
>> the TSSs from UCSC databases such as ensembl or refSeq genes. I need 350
>> nucleotides upstream and 150 nucleotides downstream of TSSs. The bed file
>> should
>> look like below, where:
>> chromStart is 350 nucleotides upstream of TSS
>> chromEnd is 150 nucleotides downstream of TSS
>> name is Name of gene or transcript_id depending on the database.
>>
>>
>>
>> chrom    chromStart    chromEnd    name        score    strand
>> chr1    67051159    67163158    NM_024763    0    -
>> chr1    67075869    67163158    NM_207014    0    +
>> chr1    16762998    16812569    NM_017940    0    -
>>
>> Thanks
>> Joseph
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>