[BioC] Repeat masker sequences as GRanges object
James W. MacDonald
jmacdon at uw.edu
Fri Sep 12 21:29:25 CEST 2014
Pfft. Downloading. Ain't nobody got time fo' dat! ;-D
library(RMySQL)
library(GenomicRanges)
con <- dbConnect("MySQL", user="genome", host="genome-mysql.cse.ucsc.edu",
db="hg19")
rmsk <- dbGetQuery(con, "select * from rmsk;")
rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
seqnames.field="genoName",
start.field="genoStart",
end.field="genoEnd",
strand.field="strand",
starts.in.df.are.0based=TRUE)
Best,
Jim
On Fri, Sep 12, 2014 at 2:08 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi,
>
> On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>
>> Hi Hermann,
>>
>> How about this:
>>
>> library(AnnotationHub)
>>> hub <- AnnotationHub()
>>> hub$goldenpath.hg19.database.rmsk_0.0.1.RData
>>>
>> GRanges with 5298130 ranges and 2 metadata columns:
>> seqnames ranges strand |
>> name
>> <Rle> <IRanges> <Rle> |
>> <character>
>> [1] chr1 [16777161, 16777470] + |
>> AluSp
>> [2] chr1 [25165801, 25166089] - |
>> AluY
>> [3] chr1 [33553607, 33554646] + |
>> L2b
>> [4] chr1 [50330064, 50332153] + |
>> L1PA10
>> [5] chr1 [58720068, 58720973] - |
>> L1PA2
>> ... ... ... ... ...
>> ...
>> [5298126] chr21_gl000210_random [25379, 25875] + |
>> MER74B
>> [5298127] chr21_gl000210_random [26438, 26596] - |
>> MIRc
>> [5298128] chr21_gl000210_random [26882, 27022] - |
>> MIRc
>> [5298129] chr21_gl000210_random [27297, 27447] + |
>> HAL1-2a_MD
>> [5298130] chr21_gl000210_random [27469, 27682] + |
>> HAL1-2a_MD
>> score
>> <numeric>
>> [1] 2147
>> [2] 2626
>> [3] 626
>> [4] 12545
>> [5] 8050
>> ... ...
>> [5298126] 1674
>> [5298127] 308
>> [5298128] 475
>> [5298129] 371
>> [5298130] 370
>> ---
>> seqlengths:
>> chr1 chr2 ... chr18_gl000207_random
>> 249250621 243199373 ... 4262
>>
>> This is a GRanges of all features from UCSC's Repeat Masker table.
>>
>
> Nice to see the "name" and "score" metadata cols but I wonder why the
> original UCSC names for these cols (which are "repName" and "swScore")
> were not preserved. Also, other UCSC cols in the rmsk table at UCSC
> might be of interest (e.g. "repClass" and "repFamily").
>
> FWIW, here is one way to get these cols:
>
> local_file <- tempfile()
>
> download.file("http://hgdownload.soe.ucsc.edu/
> goldenPath/hg19/database/rmsk.txt.gz", local_file)
>
> ## Get the col names from
> ## http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql
> COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
> "genoName", "genoStart", "genoEnd", "genoLeft",
> "strand", "repName", "repClass", "repFamily",
> "repStart", "repEnd", "repLeft", "id")
>
> library(GenomicRanges)
> df <- read.table(local_file, col.names=COLNAMES)
> rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
> seqnames.field="genoName",
> start.field="genoStart",
> end.field="genoEnd",
> strand.field="strand",
> starts.in.df.are.0based=TRUE)
>
> Then:
>
> > head(rmsk)
> GRanges with 6 ranges and 13 metadata columns:
> seqnames ranges strand | bin swScore milliDiv
> milliDel
> <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
> <integer>
> [1] chr1 [10001, 10468] + | 585 1504 13
> 4
> [2] chr1 [10469, 11447] - | 585 3612 114
> 270
> [3] chr1 [11504, 11675] - | 585 437 235
> 186
> [4] chr1 [11678, 11780] - | 585 239 294
> 19
> [5] chr1 [15265, 15355] - | 585 318 230
> 38
> [6] chr1 [16713, 16749] + | 585 203 162
> 0
> milliIns genoLeft repName repClass repFamily repStart
> <integer> <integer> <factor> <factor> <factor>
> <integer>
> [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat 1
> [2] 13 -249239174 TAR1 Satellite telo -399
> [3] 35 -249238946 L1MC LINE L1 -2236
> [4] 10 -249238841 MER5B DNA hAT-Charlie -74
> [5] 0 -249235266 MIR3 SINE MIR -119
> [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat 1
> repEnd repLeft id
> <integer> <integer> <integer>
> [1] 463 0 1
> [2] 1712 483 2
> [3] 5646 5449 3
> [4] 104 1 4
> [5] 143 49 5
> [6] 37 0 6
>
> I also tried with rtracklayer but got the following error:
>
> > library(rtracklayer)
> > session <- browserSession()
> > genome(session) <- "hg19"
> > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
> > system.time(rmsk <- getTable(query))
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, :
> line 3813855 did not have 17 elements
> Timing stopped at: 551.027 7.24 1304.386
>
> Could be due to my flaky internet connection though...
>
> Cheers,
> H.
>
> > sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
> [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19
> [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6
> [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0
> [10] digest_0.6.4 fail_1.2 foreach_1.4.2
> [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3
> [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2
> [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
> [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1
>
>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois <hnorpois at gmail.com>
>> wrote:
>>
>> Hello,
>>>
>>> I would like to have repeat sequences as GRanges object
>>> I started with ...
>>>
>>> library (BSgenome.Hsapiens.UCSC.hg19)
>>> ch1 <- Hsapiens$chr1
>>> active (masks (ch1))
>>> AGAPS AMB RM TRF
>>> TRUE TRUE FALSE FALSE
>>> active (masks(ch1))["RM"] <- TRUE
>>> active (masks (ch1))
>>> AGAPS AMB RM TRF
>>> TRUE TRUE TRUE FALSE
>>>
>>> Can anyboldy give me a hint how to continue.
>>>
>>> Thanks
>>> hermann
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
More information about the Bioconductor
mailing list