[R] Using read.table for importing gz file

David Winsemius dw|n@em|u@ @end|ng |rom comc@@t@net
Sun Aug 11 02:47:11 CEST 2019


Further note:


After three minutes of waiting  ... not a particularly long wait in my 
opinion, I get this:


 > z <- read.table( 
text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')) 
), header=TRUE, sep="\t")
 > dim(z)
[1] 485577    686

So almost half a million lines of data in a rather wide dataset for an 
incompletely described file.


I'd say R seems to be "working" properly.

data.table::fread was more informative about the process but acheived 
basically the same result in 1/6th the time:


  ?fread
system.time( z <- 
fread('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz', 
sep="\t")  )

#-----------

[100%] Downloaded 597770433 bytes...
    user  system elapsed
  20.682   3.322  29.292

 > dim(z)
[1] 485577    686

-- 

David.

On 8/10/19 5:32 PM, David Winsemius wrote:
> Well, let's see about "rules"  ... you posted in HTML when this is a 
> plain text mailing list and then you replied to only me when you are 
> supposed reply to the list (so I'm putting back the list address in my 
> reply:
>
>
> When I copied your code and then attempted to do a bit of debugging I 
> get:
>
>
> > z <- 
> readLines(gzcon(url(“https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz”)), 
> n = 100)
> Error: unexpected input in "z <- readLines(gzcon(url(�"
>
> # that was because you had "smart-quotes" rather than ASCII quotes:
>
>
> > z <- readLines(gzcon(url( 
> 'https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz' 
> )), n = 100)
> > z[1:10]
>  [1] 
> "sample\tTCGA-E1-5319-01\tTCGA-HT-7693-01\tTCGA-CS-6665-01\tTCGA-S9-A7J2-01\tTCGA-FG-A6J3-01\tTCGA-FG-6688-01\tTCGA-S9-A6TX-01\tTCGA-VM-A8C8-01\tTCGA-74-6577-01\tTCGA-06-AABW-11\tTCGA-06-0125-02\tTCGA-HT-A74L-01\tTCGA-26-A7UX-01\tTCGA-DU-A5TS-01\tTCGA-06-6388-01\tTCGA-DB-A4XA-01\tTCGA-06-A7TL-01\tTCGA-HT-A4DV-01\tTCGA-TQ-A7RP-01\tTCGA-E1-5311-01\tTCGA-28-5213-01\tTCGA-E1-A7YI-01\tTCGA-E1-5305-01\tTCGA-F6-A8O4-01\tTCGA-HT-8113-01\tTCGA-DH-A66G-01\tTCGA-76-4932-01\t
>
> Snipped hundreds of lines. So this seems to indicate that this is a 
> tab separated file. Don't you have some documentation to refer to?
>
>
> This seems possibly useful:
>
>
> > z <- read.table( 
> text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')), 
> n = 100), header=TRUE, sep="\t")
> > str(z)
> 'data.frame':    99 obs. of  686 variables:
>  $ sample         : Factor w/ 99 levels "cg00036732","cg00651829",..: 
> 53 2 60 41 16 13 37 20 70 21 ...
>  $ TCGA.E1.5319.01: num  0.4019 0.0215 0.053 0.0453 0.515 ...
>  $ TCGA.HT.7693.01: num  0.9364 0.0216 0.0547 0.0819 0.6129 ...
>  $ TCGA.CS.6665.01: num  0.0345 0.0164 0.0719 0.0497 0.6648 ...
>  $ TCGA.S9.A7J2.01: num  0.0295 0.0168 0.0421 0.0867 0.1657 ...
>  $ TCGA.FG.A6J3.01: num  0.0248 0.0161 0.0556 0.0902 0.5042 ...
>  $ TCGA.FG.6688.01: num  0.0203 0.0179 0.0321 0.0513 0.1075 ...
>  $ TCGA.S9.A6TX.01: num  0.0378 0.0199 0.0623 0.0992 0.7662 ...
>  $ TCGA.VM.A8C8.01: num  0.0271 0.0172 0.0466 0.0564 0.3478 ...
>  $ TCGA.74.6577.01: num  0.0237 0.0193 0.0196 0.0961 0.1242 ...
>  $ TCGA.06.AABW.11: num  0.0323 0.0156 0.0395 0.0708 0.1136 ...
>  $ TCGA.06.0125.02: num  0.0238 0.0181 0.039 0.068 0.0796 ...
>  $ TCGA.HT.A74L.01: num  0.7409 0.0221 0.0596 0.0765 0.8157 ...
>
> #snipped the output
>
> # there seemed to be 686 columns
>
>



More information about the R-help mailing list