[BioC] Repeat masker sequences as GRanges object
Michael Lawrence
lawrence.michael at gene.com
Sat Sep 13 04:43:30 CEST 2014
The table browser stuff just grew out of the HTTP interface to the actual
browser component. Of course, I wouldn't be opposed to an extension of
rtracklayer that accessed the database via SQL.
On Fri, Sep 12, 2014 at 3:06 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Michael,
>
> On 09/12/2014 01:47 PM, Michael Lawrence wrote:
>
>>
>>
>> On Fri, Sep 12, 2014 at 11:08 AM, Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>>
>> Hi,
>>
>>
>> On 09/12/2014 06:30 AM, James W. MacDonald wrote:
>>
>> Hi Hermann,
>>
>> How about this:
>>
>> library(AnnotationHub)
>> hub <- AnnotationHub()
>> hub$goldenpath.hg19.database.__rmsk_0.0.1.RData
>>
>>
>> GRanges with 5298130 ranges and 2 metadata columns:
>> seqnames ranges strand
>> |
>> name
>> <Rle> <IRanges> <Rle>
>> |
>> <character>
>> [1] chr1 [16777161, 16777470] +
>> |
>> AluSp
>> [2] chr1 [25165801, 25166089] -
>> |
>> AluY
>> [3] chr1 [33553607, 33554646] +
>> |
>> L2b
>> [4] chr1 [50330064, 50332153] +
>> |
>> L1PA10
>> [5] chr1 [58720068, 58720973] -
>> |
>> L1PA2
>> ... ... ... ...
>> ...
>> ...
>> [5298126] chr21_gl000210_random [25379, 25875] +
>> |
>> MER74B
>> [5298127] chr21_gl000210_random [26438, 26596] -
>> |
>> MIRc
>> [5298128] chr21_gl000210_random [26882, 27022] -
>> |
>> MIRc
>> [5298129] chr21_gl000210_random [27297, 27447] +
>> |
>> HAL1-2a_MD
>> [5298130] chr21_gl000210_random [27469, 27682] +
>> |
>> HAL1-2a_MD
>> score
>> <numeric>
>> [1] 2147
>> [2] 2626
>> [3] 626
>> [4] 12545
>> [5] 8050
>> ... ...
>> [5298126] 1674
>> [5298127] 308
>> [5298128] 475
>> [5298129] 371
>> [5298130] 370
>> ---
>> seqlengths:
>> chr1 chr2 ...
>> chr18_gl000207_random
>> 249250621 243199373 ...
>> 4262
>>
>> This is a GRanges of all features from UCSC's Repeat Masker table.
>>
>>
>> Nice to see the "name" and "score" metadata cols but I wonder why the
>> original UCSC names for these cols (which are "repName" and "swScore")
>> were not preserved. Also, other UCSC cols in the rmsk table at UCSC
>> might be of interest (e.g. "repClass" and "repFamily").
>>
>> FWIW, here is one way to get these cols:
>>
>> local_file <- tempfile()
>>
>> download.file("http://__hgdownload.soe.ucsc.edu/__
>> goldenPath/hg19/database/rmsk.__txt.gz
>> <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz
>> >",
>> local_file)
>>
>> ## Get the col names from
>> ##
>> http://hgdownload.soe.ucsc.__edu/goldenPath/hg19/database/__rmsk.sql
>>
>> <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql>
>> COLNAMES <- c("bin", "swScore", "milliDiv", "milliDel", "milliIns",
>> "genoName", "genoStart", "genoEnd", "genoLeft",
>> "strand", "repName", "repClass", "repFamily",
>> "repStart", "repEnd", "repLeft", "id")
>>
>> library(GenomicRanges)
>> df <- read.table(local_file, col.names=COLNAMES)
>> rmsk <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
>> seqnames.field="genoName",
>> start.field="genoStart",
>> end.field="genoEnd",
>> strand.field="strand",
>> starts.in.df.are.0based=TRUE)
>>
>> Then:
>>
>> > head(rmsk)
>> GRanges with 6 ranges and 13 metadata columns:
>> seqnames ranges strand | bin swScore
>> milliDiv milliDel
>> <Rle> <IRanges> <Rle> | <integer> <integer>
>> <integer> <integer>
>> [1] chr1 [10001, 10468] + | 585 1504
>> 13 4
>> [2] chr1 [10469, 11447] - | 585 3612
>> 114 270
>> [3] chr1 [11504, 11675] - | 585 437
>> 235 186
>> [4] chr1 [11678, 11780] - | 585 239
>> 294 19
>> [5] chr1 [15265, 15355] - | 585 318
>> 230 38
>> [6] chr1 [16713, 16749] + | 585 203
>> 162 0
>> milliIns genoLeft repName repClass repFamily
>> repStart
>> <integer> <integer> <factor> <factor> <factor>
>> <integer>
>> [1] 13 -249240153 (CCCTAA)n Simple_repeat Simple_repeat
>> 1
>> [2] 13 -249239174 TAR1 Satellite telo
>> -399
>> [3] 35 -249238946 L1MC LINE L1
>> -2236
>> [4] 10 -249238841 MER5B DNA hAT-Charlie
>> -74
>> [5] 0 -249235266 MIR3 SINE MIR
>> -119
>> [6] 0 -249233872 (TGG)n Simple_repeat Simple_repeat
>> 1
>> repEnd repLeft id
>> <integer> <integer> <integer>
>> [1] 463 0 1
>> [2] 1712 483 2
>> [3] 5646 5449 3
>> [4] 104 1 4
>> [5] 143 49 5
>> [6] 37 0 6
>>
>> I also tried with rtracklayer but got the following error:
>>
>> > library(rtracklayer)
>> > session <- browserSession()
>> > genome(session) <- "hg19"
>> > query <- ucscTableQuery(session, "RepeatMasker", table="rmsk")
>> > system.time(rmsk <- getTable(query))
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>> na.strings, :
>> line 3813855 did not have 17 elements
>> Timing stopped at: 551.027 7.24 1304.386
>>
>> Could be due to my flaky internet connection though...
>>
>>
>>
>> This is just hitting up against the row count limit of the table
>> browser. rtracklayer uses the table browser as an abstraction, but it
>> has its limitations.
>>
>
> Thanks for the reminder about the row count limit of UCSC table browser.
>
> How hard would it be to modify getTable() to query directly the MySQL
> server at UCSC? That would not only work around the row count limit of
> the table browser, but would also make getTable() at least 10x or 20x
> faster. According to my timings, that would make getTable() almost as
> fast as doing hub$goldenpath.hg19.database.rmsk_0.0.1.RData (46.5 s
> vs 40.5 s) even though the comparison is unfair because getTable()
> retrieves the 17 cols instead of only 6 when downloading from
> AnnotationHub.
>
> That would also benefit things like makeTranscriptDbFromUCSC()
> and makeFeatureDbFromUCSC(), which spend most of their time in
> getTable().
>
> Thanks,
> H.
>
>
>> Anyway, it would be great if the AnnotationHub track could be improved.
>> Those extra fields are indeed useful.
>>
>> Cheers,
>> H.
>>
>> > sessionInfo()
>> R version 3.1.1 (2014-07-10)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
>> methods
>> [8] base
>>
>> other attached packages:
>> [1] rtracklayer_1.25.16 GenomicRanges_1.17.40 GenomeInfoDb_1.1.19
>> [4] IRanges_1.99.28 S4Vectors_0.2.3 BiocGenerics_0.11.5
>>
>> loaded via a namespace (and not attached):
>> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.99.19
>> [4] Biostrings_2.33.14 bitops_1.0-6 brew_1.0-6
>> [7] checkmate_1.4 codetools_0.2-9 DBI_0.3.0
>> [10] digest_0.6.4 fail_1.2 foreach_1.4.2
>> [13] GenomicAlignments_1.1.29 iterators_1.0.7 RCurl_1.95-4.3
>> [16] Rsamtools_1.17.33 RSQLite_0.11.4 sendmailR_1.1-2
>> [19] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
>> [22] XML_3.98-1.1 XVector_0.5.8 zlibbioc_1.11.1
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>> On Thu, Sep 11, 2014 at 3:16 AM, Hermann Norpois
>> <hnorpois at gmail.com <mailto:hnorpois at gmail.com>> wrote:
>>
>> Hello,
>>
>> I would like to have repeat sequences as GRanges object
>> I started with ...
>>
>> library (BSgenome.Hsapiens.UCSC.hg19)
>> ch1 <- Hsapiens$chr1
>> active (masks (ch1))
>> AGAPS AMB RM TRF
>> TRUE TRUE FALSE FALSE
>> active (masks(ch1))["RM"] <- TRUE
>> active (masks (ch1))
>> AGAPS AMB RM TRF
>> TRUE TRUE TRUE FALSE
>>
>> Can anyboldy give me a hint how to continue.
>>
>> Thanks
>> hermann
>>
>> [[alternative HTML version deleted]]
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org
>> >
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.__science.biology.informatics.__
>> conductor
>> <http://news.gmane.org/gmane.science.biology.informatics.
>> conductor>
>>
>>
>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
More information about the Bioconductor
mailing list