[BioC] Repeat masker sequences as GRanges object

Hervé Pagès hpages at fhcrc.org
Sat Sep 13 00:06:40 CEST 2014


Hi Michael,

On 09/12/2014 01:47 PM, Michael Lawrence wrote:
>
>
> On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi,
>
>
>     On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>
>         Hi Hermann,
>
>         How about this:
>
>             library(AnnotationHub)
>             hub <- AnnotationHub()
>             hub$goldenpath.hg19.database.__rmsk_0.0.1.RData
>
>         GRanges with 5298130 ranges and 2 metadata columns:
>                                    seqnames               ranges strand   |
>         name
>                                       <Rle>            <IRanges>  <Rle>   |
>         <character>
>                   [1]                  chr1 [16777161, 16777470]      +   |
>         AluSp
>                   [2]                  chr1 [25165801, 25166089]      -   |
>         AluY
>                   [3]                  chr1 [33553607, 33554646]      +   |
>         L2b
>                   [4]                  chr1 [50330064, 50332153]      +   |
>         L1PA10
>                   [5]                  chr1 [58720068, 58720973]      -   |
>         L1PA2
>                   ...                   ...                  ...    ... ...
>         ...
>             [5298126] chr21_gl000210_random       [25379, 25875]      +   |
>         MER74B
>             [5298127] chr21_gl000210_random       [26438, 26596]      -   |
>         MIRc
>             [5298128] chr21_gl000210_random       [26882, 27022]      -   |
>         MIRc
>             [5298129] chr21_gl000210_random       [27297, 27447]      +   |
>         HAL1-2a_MD
>             [5298130] chr21_gl000210_random       [27469, 27682]      +   |
>         HAL1-2a_MD
>                           score
>                       <numeric>
>                   [1]      2147
>                   [2]      2626
>                   [3]       626
>                   [4]     12545
>                   [5]      8050
>                   ...       ...
>             [5298126]      1674
>             [5298127]       308
>             [5298128]       475
>             [5298129]       371
>             [5298130]       370
>             ---
>             seqlengths:
>                               chr1                  chr2 ...
>         chr18_gl000207_random
>                          249250621             243199373 ...
>                4262
>
>         This is a GRanges of all features from UCSC's Repeat Masker table.
>
>
>     Nice to see the "name" and "score" metadata cols but I wonder why the
>     original UCSC names for these cols (which are "repName" and "swScore")
>     were not preserved. Also, other UCSC cols in the rmsk table at UCSC
>     might be of interest (e.g. "repClass" and "repFamily").
>
>     FWIW, here is one way to get these cols:
>
>        local_file <- tempfile()
>
>     download.file("http://__hgdownload.soe.ucsc.edu/__goldenPath/hg19/database/rmsk.__txt.gz
>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz>",
>     local_file)
>
>        ## Get the col names from
>        ##
>     http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql
>     <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql>
>        COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
>                      "genoName", "genoStart", "genoEnd", "genoLeft",
>                      "strand", "repName", "repClass", "repFamily",
>                      "repStart", "repEnd", "repLeft", "id")
>
>        library(GenomicRanges)
>        df <- read.table(local_file, col.names=COLNAMES)
>        rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>                    seqnames.field="genoName",
>                    start.field="genoStart",
>                    end.field="genoEnd",
>                    strand.field="strand",
>                    starts.in.df.are.0based=TRUE)
>
>     Then:
>
>        > head(rmsk)
>        GRanges with 6 ranges and 13 metadata columns:
>              seqnames         ranges strand |       bin   swScore
>     milliDiv  milliDel
>                 <Rle>      <IRanges>  <Rle> | <integer> <integer>
>     <integer> <integer>
>          [1]     chr1 [10001, 10468]      + |       585      1504
>     13         4
>          [2]     chr1 [10469, 11447]      - |       585      3612
>       114       270
>          [3]     chr1 [11504, 11675]      - |       585       437
>       235       186
>          [4]     chr1 [11678, 11780]      - |       585       239
>       294        19
>          [5]     chr1 [15265, 15355]      - |       585       318
>       230        38
>          [6]     chr1 [16713, 16749]      + |       585       203
>       162         0
>               milliIns   genoLeft   repName      repClass     repFamily
>     repStart
>              <integer>  <integer>  <factor>      <factor>      <factor>
>     <integer>
>          [1]        13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
>         1
>          [2]        13 -249239174      TAR1     Satellite          telo
>     -399
>          [3]        35 -249238946      L1MC          LINE            L1
>     -2236
>          [4]        10 -249238841     MER5B           DNA   hAT-Charlie
>       -74
>          [5]         0 -249235266      MIR3          SINE           MIR
>     -119
>          [6]         0 -249233872    (TGG)n Simple_repeat Simple_repeat
>         1
>                 repEnd   repLeft        id
>              <integer> <integer> <integer>
>          [1]       463         0         1
>          [2]      1712       483         2
>          [3]      5646      5449         3
>          [4]       104         1         4
>          [5]       143        49         5
>          [6]        37         0         6
>
>     I also tried with rtracklayer but got the following error:
>
>        > library(rtracklayer)
>        > session <- browserSession()
>        > genome(session) <- "hg19"
>        > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
>        > system.time(rmsk <- getTable(query))
>        Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>     na.strings,  :
>          line 3813855 did not have 17 elements
>        Timing stopped at: 551.027 7.24 1304.386
>
>     Could be due to my flaky internet connection though...
>
>
>
> This is just hitting up against the row count limit of the table
> browser. rtracklayer uses the table browser as an abstraction, but it
> has its limitations.

Thanks for the reminder about the row count limit of UCSC table browser.

How hard would it be to modify getTable() to query directly the MySQL
server at UCSC? That would not only work around the row count limit of
the table browser, but would also make getTable() at least 10x or 20x
faster. According to my timings, that would make getTable() almost as
fast as doing hub$goldenpath.hg19.database.rmsk_0.0.1.RData (46.5 s
vs 40.5 s) even though the comparison is unfair because getTable()
retrieves the 17 cols instead of only 6 when downloading from
AnnotationHub.

That would also benefit things like makeTranscriptDbFromUCSC()
and makeFeatureDbFromUCSC(), which spend most of their time in
getTable().

Thanks,
H.

>
> Anyway, it would be great if the AnnotationHub track could be improved.
> Those extra fields are indeed useful.
>
>     Cheers,
>     H.
>
>      > sessionInfo()
>     R version 3.1.1 (2014-07-10)
>     Platform: x86_64-unknown-linux-gnu (64-bit)
>
>     locale:
>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>       [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>       [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>       [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>     attached base packages:
>     [1] parallel  stats     graphics  grDevices utils     datasets  methods
>     [8] base
>
>     other attached packages:
>     [1] rtracklayer_1.25.16   GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
>     [4] IRanges_1.99.28       S4Vectors_0.2.3       BiocGenerics_0.11.5
>
>     loaded via a namespace (and not attached):
>       [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.99.19
>       [4] Biostrings_2.33.14       bitops_1.0-6             brew_1.0-6
>       [7] checkmate_1.4            codetools_0.2-9          DBI_0.3.0
>     [10] digest_0.6.4             fail_1.2                 foreach_1.4.2
>     [13] GenomicAlignments_1.1.29 iterators_1.0.7          RCurl_1.95-4.3
>     [16] Rsamtools_1.17.33        RSQLite_0.11.4           sendmailR_1.1-2
>     [19] stats4_3.1.1             stringr_0.6.2            tools_3.1.1
>     [22] XML_3.98-1.1             XVector_0.5.8            zlibbioc_1.11.1
>
>
>         Best,
>
>         Jim
>
>
>
>
>         On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois
>         <hnorpois at gmail.com <mailto:hnorpois at gmail.com>> wrote:
>
>             Hello,
>
>             I would like to have repeat sequences as GRanges object
>             I started with ...
>
>             library (BSgenome.Hsapiens.UCSC.hg19)
>             ch1 <- Hsapiens$chr1
>             active (masks (ch1))
>             AGAPS   AMB    RM   TRF
>                TRUE  TRUE FALSE FALSE
>             active (masks(ch1))["RM"] <- TRUE
>             active (masks (ch1))
>             AGAPS   AMB    RM   TRF
>                TRUE  TRUE  TRUE FALSE
>
>             Can anyboldy give me a hint how to continue.
>
>             Thanks
>             hermann
>
>                       [[alternative HTML version deleted]]
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list