[BioC] Repeat masker sequences as GRanges object

Sat Sep 13 04:43:30 CEST 2014

The table browser stuff just grew out of the HTTP interface to the actual
browser component. Of course, I wouldn't be opposed to an extension of
rtracklayer that accessed the database via SQL.

On Fri, Sep 12, 2014 at 3:06 PM, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi Michael,
>
> On 09/12/2014 01:47 PM, Michael Lawrence wrote:
>
>>
>>
>> On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>>
>>     Hi,
>>
>>
>>     On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>>
>>         Hi Hermann,
>>
>>         How about this:
>>
>>             library(AnnotationHub)
>>             hub <- AnnotationHub()
>>             hub$goldenpath.hg19.database.__rmsk_0.0.1.RData
>>
>>
>>         GRanges with 5298130 ranges and 2 metadata columns:
>>                                    seqnames               ranges strand
>>  |
>>         name
>>                                       <Rle>            <IRanges>  <Rle>
>>  |
>>         <character>
>>                   [1]                  chr1 [16777161, 16777470]      +
>>  |
>>         AluSp
>>                   [2]                  chr1 [25165801, 25166089]      -
>>  |
>>         AluY
>>                   [3]                  chr1 [33553607, 33554646]      +
>>  |
>>         L2b
>>                   [4]                  chr1 [50330064, 50332153]      +
>>  |
>>         L1PA10
>>                   [5]                  chr1 [58720068, 58720973]      -
>>  |
>>         L1PA2
>>                   ...                   ...                  ...    ...
>> ...
>>         ...
>>             [5298126] chr21_gl000210_random       [25379, 25875]      +
>>  |
>>         MER74B
>>             [5298127] chr21_gl000210_random       [26438, 26596]      -
>>  |
>>         MIRc
>>             [5298128] chr21_gl000210_random       [26882, 27022]      -
>>  |
>>         MIRc
>>             [5298129] chr21_gl000210_random       [27297, 27447]      +
>>  |
>>         HAL1-2a_MD
>>             [5298130] chr21_gl000210_random       [27469, 27682]      +
>>  |
>>         HAL1-2a_MD
>>                           score
>>                       <numeric>
>>                   [1]      2147
>>                   [2]      2626
>>                   [3]       626
>>                   [4]     12545
>>                   [5]      8050
>>                   ...       ...
>>             [5298126]      1674
>>             [5298127]       308
>>             [5298128]       475
>>             [5298129]       371
>>             [5298130]       370
>>             ---
>>             seqlengths:
>>                               chr1                  chr2 ...
>>         chr18_gl000207_random
>>                          249250621             243199373 ...
>>                4262
>>
>>         This is a GRanges of all features from UCSC's Repeat Masker table.
>>
>>
>>     Nice to see the "name" and "score" metadata cols but I wonder why the
>>     original UCSC names for these cols (which are "repName" and "swScore")
>>     were not preserved. Also, other UCSC cols in the rmsk table at UCSC
>>     might be of interest (e.g. "repClass" and "repFamily").
>>
>>     FWIW, here is one way to get these cols:
>>
>>        local_file <- tempfile()
>>
>>     download.file("http://__hgdownload.soe.ucsc.edu/__
>> goldenPath/hg19/database/rmsk.__txt.gz
>>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz
>> >",
>>     local_file)
>>
>>        ## Get the col names from
>>        ##
>>     http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql
>>
>>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql>
>>        COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
>>                      "genoName", "genoStart", "genoEnd", "genoLeft",
>>                      "strand", "repName", "repClass", "repFamily",
>>                      "repStart", "repEnd", "repLeft", "id")
>>
>>        library(GenomicRanges)
>>        df <- read.table(local_file, col.names=COLNAMES)
>>        rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>>                    seqnames.field="genoName",
>>                    start.field="genoStart",
>>                    end.field="genoEnd",
>>                    strand.field="strand",
>>                    starts.in.df.are.0based=TRUE)
>>
>>     Then:
>>
>>        > head(rmsk)
>>        GRanges with 6 ranges and 13 metadata columns:
>>              seqnames         ranges strand |       bin   swScore
>>     milliDiv  milliDel
>>                 <Rle>      <IRanges>  <Rle> | <integer> <integer>
>>     <integer> <integer>
>>          [1]     chr1 [10001, 10468]      + |       585      1504
>>     13         4
>>          [2]     chr1 [10469, 11447]      - |       585      3612
>>       114       270
>>          [3]     chr1 [11504, 11675]      - |       585       437
>>       235       186
>>          [4]     chr1 [11678, 11780]      - |       585       239
>>       294        19
>>          [5]     chr1 [15265, 15355]      - |       585       318
>>       230        38
>>          [6]     chr1 [16713, 16749]      + |       585       203
>>       162         0
>>               milliIns   genoLeft   repName      repClass     repFamily
>>     repStart
>>              <integer>  <integer>  <factor>      <factor>      <factor>
>>     <integer>
>>          [1]        13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
>>         1
>>          [2]        13 -249239174      TAR1     Satellite          telo
>>     -399
>>          [3]        35 -249238946      L1MC          LINE            L1
>>     -2236
>>          [4]        10 -249238841     MER5B           DNA   hAT-Charlie
>>       -74
>>          [5]         0 -249235266      MIR3          SINE           MIR
>>     -119
>>          [6]         0 -249233872    (TGG)n Simple_repeat Simple_repeat
>>         1
>>                 repEnd   repLeft        id
>>              <integer> <integer> <integer>
>>          [1]       463         0         1
>>          [2]      1712       483         2
>>          [3]      5646      5449         3
>>          [4]       104         1         4
>>          [5]       143        49         5
>>          [6]        37         0         6
>>
>>     I also tried with rtracklayer but got the following error:
>>
>>        > library(rtracklayer)
>>        > session <- browserSession()
>>        > genome(session) <- "hg19"
>>        > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
>>        > system.time(rmsk <- getTable(query))
>>        Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>     na.strings,  :
>>          line 3813855 did not have 17 elements
>>        Timing stopped at: 551.027 7.24 1304.386
>>
>>     Could be due to my flaky internet connection though...
>>
>>
>>
>> This is just hitting up against the row count limit of the table
>> browser. rtracklayer uses the table browser as an abstraction, but it
>> has its limitations.
>>
>
> Thanks for the reminder about the row count limit of UCSC table browser.
>
> How hard would it be to modify getTable() to query directly the MySQL
> server at UCSC? That would not only work around the row count limit of
> the table browser, but would also make getTable() at least 10x or 20x
> faster. According to my timings, that would make getTable() almost as
> fast as doing hub$goldenpath.hg19.database.rmsk_0.0.1.RData (46.5 s
> vs 40.5 s) even though the comparison is unfair because getTable()
> retrieves the 17 cols instead of only 6 when downloading from
> AnnotationHub.
>
> That would also benefit things like makeTranscriptDbFromUCSC()
> and makeFeatureDbFromUCSC(), which spend most of their time in
> getTable().
>
> Thanks,
> H.
>
>
>> Anyway, it would be great if the AnnotationHub track could be improved.
>> Those extra fields are indeed useful.
>>
>>     Cheers,
>>     H.
>>
>>      > sessionInfo()
>>     R version 3.1.1 (2014-07-10)
>>     Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>     locale:
>>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>     attached base packages:
>>     [1] parallel  stats     graphics  grDevices utils     datasets
>> methods
>>     [8] base
>>
>>     other attached packages:
>>     [1] rtracklayer_1.25.16   GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
>>     [4] IRanges_1.99.28       S4Vectors_0.2.3       BiocGenerics_0.11.5
>>
>>     loaded via a namespace (and not attached):
>>       [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.99.19
>>       [4] Biostrings_2.33.14       bitops_1.0-6             brew_1.0-6
>>       [7] checkmate_1.4            codetools_0.2-9          DBI_0.3.0
>>     [10] digest_0.6.4             fail_1.2                 foreach_1.4.2
>>     [13] GenomicAlignments_1.1.29 iterators_1.0.7          RCurl_1.95-4.3
>>     [16] Rsamtools_1.17.33        RSQLite_0.11.4           sendmailR_1.1-2
>>     [19] stats4_3.1.1             stringr_0.6.2            tools_3.1.1
>>     [22] XML_3.98-1.1             XVector_0.5.8            zlibbioc_1.11.1
>>
>>
>>         Best,
>>
>>         Jim
>>
>>
>>
>>
>>         On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois
>>         <hnorpois at gmail.com <mailto:hnorpois at gmail.com>> wrote:
>>
>>             Hello,
>>
>>             I would like to have repeat sequences as GRanges object
>>             I started with ...
>>
>>             library (BSgenome.Hsapiens.UCSC.hg19)
>>             ch1 <- Hsapiens$chr1
>>             active (masks (ch1))
>>             AGAPS   AMB    RM   TRF
>>                TRUE  TRUE FALSE FALSE
>>             active (masks(ch1))["RM"] <- TRUE
>>             active (masks (ch1))
>>             AGAPS   AMB    RM   TRF
>>                TRUE  TRUE  TRUE FALSE
>>
>>             Can anyboldy give me a hint how to continue.
>>
>>             Thanks
>>             hermann
>>
>>                       [[alternative HTML version deleted]]
>>
>>             _________________________________________________
>>             Bioconductor mailing list
>>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org
>> >
>>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>             Search the archives:
>>             http://news.gmane.org/gmane.__science.biology.informatics.__
>> conductor
>>             <http://news.gmane.org/gmane.science.biology.informatics.
>> conductor>
>>
>>
>>
>>
>>
>>     --
>>     Hervé Pagès
>>
>>     Program in Computational Biology
>>     Division of Public Health Sciences
>>     Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N, M1-B514
>>     P.O. Box 19024
>>     Seattle, WA 98109-1024
>>
>>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>     _________________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>     Search the archives:
>>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

	[[alternative HTML version deleted]]