[BioC] Repeat masker sequences as GRanges object

Fri Sep 12 21:29:25 CEST 2014

Pfft. Downloading. Ain't nobody got time fo' dat! ;-D

library(RMySQL)
library(GenomicRanges)
con <- dbConnect("MySQL", user="genome", host="genome-mysql.cse.ucsc.edu",
db="hg19")
rmsk <- dbGetQuery(con, "select * from rmsk;")
 rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
              seqnames.field="genoName",
              start.field="genoStart",
              end.field="genoEnd",
              strand.field="strand",
              starts.in.df.are.0based=TRUE)

Best,

Jim

On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi,
>
> On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>
>> Hi Hermann,
>>
>> How about this:
>>
>>  library(AnnotationHub)
>>> hub <- AnnotationHub()
>>> hub$goldenpath.hg19.database.rmsk_0.0.1.RData
>>>
>> GRanges with 5298130 ranges and 2 metadata columns:
>>                           seqnames               ranges strand   |
>> name
>>                              <Rle>            <IRanges>  <Rle>   |
>> <character>
>>          [1]                  chr1 [16777161, 16777470]      +   |
>> AluSp
>>          [2]                  chr1 [25165801, 25166089]      -   |
>> AluY
>>          [3]                  chr1 [33553607, 33554646]      +   |
>> L2b
>>          [4]                  chr1 [50330064, 50332153]      +   |
>> L1PA10
>>          [5]                  chr1 [58720068, 58720973]      -   |
>> L1PA2
>>          ...                   ...                  ...    ... ...
>> ...
>>    [5298126] chr21_gl000210_random       [25379, 25875]      +   |
>> MER74B
>>    [5298127] chr21_gl000210_random       [26438, 26596]      -   |
>> MIRc
>>    [5298128] chr21_gl000210_random       [26882, 27022]      -   |
>> MIRc
>>    [5298129] chr21_gl000210_random       [27297, 27447]      +   |
>> HAL1-2a_MD
>>    [5298130] chr21_gl000210_random       [27469, 27682]      +   |
>> HAL1-2a_MD
>>                  score
>>              <numeric>
>>          [1]      2147
>>          [2]      2626
>>          [3]       626
>>          [4]     12545
>>          [5]      8050
>>          ...       ...
>>    [5298126]      1674
>>    [5298127]       308
>>    [5298128]       475
>>    [5298129]       371
>>    [5298130]       370
>>    ---
>>    seqlengths:
>>                      chr1                  chr2 ... chr18_gl000207_random
>>                 249250621             243199373 ...                  4262
>>
>> This is a GRanges of all features from UCSC's Repeat Masker table.
>>
>
> Nice to see the "name" and "score" metadata cols but I wonder why the
> original UCSC names for these cols (which are "repName" and "swScore")
> were not preserved. Also, other UCSC cols in the rmsk table at UCSC
> might be of interest (e.g. "repClass" and "repFamily").
>
> FWIW, here is one way to get these cols:
>
>   local_file <- tempfile()
>
> download.file("http://hgdownload.soe.ucsc.edu/
> goldenPath/hg19/database/rmsk.txt.gz", local_file)
>
>   ## Get the col names from
>   ##   http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql
>   COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
>                 "genoName", "genoStart", "genoEnd", "genoLeft",
>                 "strand", "repName", "repClass", "repFamily",
>                 "repStart", "repEnd", "repLeft", "id")
>
>   library(GenomicRanges)
>   df <- read.table(local_file, col.names=COLNAMES)
>   rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>               seqnames.field="genoName",
>               start.field="genoStart",
>               end.field="genoEnd",
>               strand.field="strand",
>               starts.in.df.are.0based=TRUE)
>
> Then:
>
>   > head(rmsk)
>   GRanges with 6 ranges and 13 metadata columns:
>         seqnames         ranges strand |       bin   swScore  milliDiv
> milliDel
>            <Rle>      <IRanges>  <Rle> | <integer> <integer> <integer>
> <integer>
>     [1]     chr1 [10001, 10468]      + |       585      1504        13
>      4
>     [2]     chr1 [10469, 11447]      - |       585      3612       114
>    270
>     [3]     chr1 [11504, 11675]      - |       585       437       235
>    186
>     [4]     chr1 [11678, 11780]      - |       585       239       294
>     19
>     [5]     chr1 [15265, 15355]      - |       585       318       230
>     38
>     [6]     chr1 [16713, 16749]      + |       585       203       162
>      0
>          milliIns   genoLeft   repName      repClass     repFamily repStart
>         <integer>  <integer>  <factor>      <factor>      <factor>
> <integer>
>     [1]        13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat     1
>     [2]        13 -249239174      TAR1     Satellite          telo  -399
>     [3]        35 -249238946      L1MC          LINE            L1 -2236
>     [4]        10 -249238841     MER5B           DNA   hAT-Charlie   -74
>     [5]         0 -249235266      MIR3          SINE           MIR  -119
>     [6]         0 -249233872    (TGG)n Simple_repeat Simple_repeat     1
>            repEnd   repLeft        id
>         <integer> <integer> <integer>
>     [1]       463         0         1
>     [2]      1712       483         2
>     [3]      5646      5449         3
>     [4]       104         1         4
>     [5]       143        49         5
>     [6]        37         0         6
>
> I also tried with rtracklayer but got the following error:
>
>   > library(rtracklayer)
>   > session <- browserSession()
>   > genome(session) <- "hg19"
>   > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
>   > system.time(rmsk <- getTable(query))
>   Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings,  :
>     line 3813855 did not have 17 elements
>   Timing stopped at: 551.027 7.24 1304.386
>
> Could be due to my flaky internet connection though...
>
> Cheers,
> H.
>
> > sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.25.16   GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
> [4] IRanges_1.99.28       S4Vectors_0.2.3       BiocGenerics_0.11.5
>
> loaded via a namespace (and not attached):
>  [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.99.19
>  [4] Biostrings_2.33.14       bitops_1.0-6             brew_1.0-6
>  [7] checkmate_1.4            codetools_0.2-9          DBI_0.3.0
> [10] digest_0.6.4             fail_1.2                 foreach_1.4.2
> [13] GenomicAlignments_1.1.29 iterators_1.0.7          RCurl_1.95-4.3
> [16] Rsamtools_1.17.33        RSQLite_0.11.4           sendmailR_1.1-2
> [19] stats4_3.1.1             stringr_0.6.2            tools_3.1.1
> [22] XML_3.98-1.1             XVector_0.5.8            zlibbioc_1.11.1
>
>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at gmail.com>
>> wrote:
>>
>>  Hello,
>>>
>>> I would like to have repeat sequences as GRanges object
>>> I started with ...
>>>
>>> library (BSgenome.Hsapiens.UCSC.hg19)
>>> ch1 <- Hsapiens$chr1
>>> active (masks (ch1))
>>> AGAPS   AMB    RM   TRF
>>>   TRUE  TRUE FALSE FALSE
>>> active (masks(ch1))["RM"] <- TRUE
>>> active (masks (ch1))
>>> AGAPS   AMB    RM   TRF
>>>   TRUE  TRUE  TRUE FALSE
>>>
>>> Can anyboldy give me a hint how to continue.
>>>
>>> Thanks
>>> hermann
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]