[R] Problems with time formats when importing data using readHTMLTable

Cristina Silva csilva at ipma.pt
Tue Jul 11 19:00:20 CEST 2017


Dear David,

Thank you very much. You have solved my problem, not of importing in 
the right format but the date conversion. It works for me, and if I 
specify the timezone tz='GMT', I get exactly the same time of the 
webpage.

Sincerely yours,

Cristina

On Tue, 11 Jul 2017 07:15:43 -0700, David Winsemius wrote:
>> On Jul 11, 2017, at 6:25 AM, Cristina Silva <csilva at ipma.pt> wrote:
>>
>> Dear Jeff,
>>
>> I am sorry, I didn't notice that it was not plain text. I hope that 
>> it is now in the correct format. I explain the problem again, now with 
>> more detais.
>> I am collecting the track positions of our research vessel from 
>> www.marinetraffic.com.  In the page, the data appear in a table:
>>
>> Timestamp     Source     Speed (kn)     Latitude (°) Longitude (°)   
>>  Course (°)     Show on Map
>> 2017-07-11 10:57 (UTC)    Terr-AIS     8.6    37.36228 -9.176811    
>> 200
>> 2017-07-11 10:54 (UTC)    Terr-AIS     8.4    37.36875 -9.174048    
>> 200
>> 2017-07-11 09:50 (UTC)    Terr-AIS     8    37.51499 -9.184502    
>> 149
>> 2017-07-11 09:08 (UTC)    Terr-AIS     7.7    37.60513 -9.228263    
>> 169
>> 2017-07-11 09:05 (UTC)    Terr-AIS     8    37.61168 -9.229627    
>> 167
>> 2017-07-11 09:03 (UTC)    Terr-AIS     6.9    37.61626 -9.230456    
>> 132
>> 2017-07-11 08:59 (UTC)    Terr-AIS     3.4    37.61758 -9.237928    
>> 101
>> 2017-07-11 08:57 (UTC)    Terr-AIS     3.6    37.61808 -9.240235    
>> 116
>> 2017-07-11 08:47 (UTC)    Terr-AIS     3.3    37.62267 -9.249762    
>> 112
>> 2017-07-11 08:44 (UTC)    Terr-AIS     3.2    37.6241 -9.25348    
>> 115
>> 2017-07-11 08:40 (UTC)    Terr-AIS     3.3    37.62557 -9.257295    
>> 112
>> 2017-07-11 08:32 (UTC)    Terr-AIS     3.4    37.62903 -9.266028    
>> 111
>> 2017-07-11 08:28 (UTC)    Terr-AIS     3.3    37.63051 -9.269725    
>> 114
>> 2017-07-11 08:25 (UTC)    Terr-AIS     3.3    37.63207 -9.273642    
>> 120
>> 2017-07-11 08:20 (UTC)    Terr-AIS     3.6    37.63395 -9.278254    
>> 118
>> 2017-07-11 08:16 (UTC)    Terr-AIS     3.3    37.63586 -9.282853    
>> 116
>> 2017-07-11 08:08 (UTC)    Terr-AIS     3.2    37.6392 -9.291157    
>> 118
>> 2017-07-11 07:38 (UTC)    Sat-AIS
>> 2017-07-11 06:58 (UTC)    Terr-AIS     3    37.68459 -9.321658    
>> 159
>> 2017-07-11 06:53 (UTC)    Terr-AIS     3.3    37.68881 -9.324172    
>> 150
>>
>> To get this table I had to copy manually from the webpage, paste in 
>> a text file, delete information on figures, links, etc. and then 
>> import to R. This works well, I read the csv file and have no problems 
>> with date formats. But instead of 20 records, I have to import e.g. 
>> 500 records, i.e. 10 pages of 50 records each, or more. I want to 
>> import these data in a more automated and quick procedure. I have done 
>> this before with no problems, but probably the webpage changed some 
>> permissions.
>>
>> Here is the script used to get the data directly from the webpage 
>> into R, for a sample of 20 records (10 per page):
>>
>> library(XML)
>> x <- list()
>> for (i in 1:2)
>> {
>>    x[i]<- 
>> readHTMLTable(paste('http://www.marinetraffic.com/en/ais/index/positions/all/shipid:318358/mmsi:263601000/shipname:NORUEGA/per_page:10/page:', 
>> i, sep=''))
>> }
>>
>> ais <- do.call('rbind', x)
>> ais <- ais[,-7]
>>
>> and I got the following table:
>>
>>> ais
>>                                     Timestamp Source                 
>>        Speed (kn) Latitude (°) Longitude (°) Course (°)
>> 1            149977066014997706604 minutes ago Terr-AIS              
>>                  8.6      37.36228 -9.176811         200
>
> There's undoubtedly a way to extract the time info from this HTML
> code (but I'm not the one to provide that solution.) This is the 
> first
> entry in the webpage source when viewed right now:
>
> <td>
> <time class="toggle_time utc_time hide-me formatTime"
> data-overrideAbbreviation="empty" data-time="1499780737"
> data-timezone="0"><span>1499780737</span></time><time
> class="toggle_time my_time hide-me formatTime"
> data-overrideAbbreviation="MyT" data-time="1499780737"
> data-timezone="MyT"><span>1499780737</span></time><time
> class="toggle_time dif_time hide"><span>6 minutes
> ago</span></time>	</td>
>
> So the POSIX-style time is givne twice (with no separator and then an
> offset in minutes is calculated. If I were given the records as you
> have them now I would extract the leading 10 digits and convert to
> POSIXct
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","", Timestamp)), 
> origin="1970-01-01")
>
> as.POSIXct(as.numeric(gsub("(^.{10}).+","\\1", ais$Timestamp)),
> origin="1970-01-01")
>  [1] "2017-07-11 07:09:57 PDT" "2017-07-11 06:45:37 PDT" "2017-07-11
> 06:40:34 PDT" "2017-07-11 06:34:46 PDT" "2017-07-11 06:32:26 PDT"
>  [6] "2017-07-11 06:29:34 PDT" "2017-07-11 06:26:37 PDT" "2017-07-11
> 06:20:24 PDT" "2017-07-11 06:17:18 PDT" "2017-07-11 06:12:37 PDT"
> [11] "2017-07-11 06:08:24 PDT" "2017-07-11 06:05:04 PDT" "2017-07-11
> 06:03:27 PDT" "2017-07-11 05:59:37 PDT" "2017-07-11 05:54:37 PDT"
> [16] "2017-07-11 05:50:48 PDT" "2017-07-11 05:44:53 PDT" "2017-07-11
> 05:29:17 PDT" "2017-07-11 05:26:44 PDT" "2017-07-11 04:30:08 PDT"
>
> -- David.
>
>
>
>> 2            149977048714997704876 minutes ago Terr-AIS              
>>                  8.4      37.36875 -9.174048         200
>> 3   149976661414997666141 hour, 11 minutes ago Terr-AIS              
>>                  8.0      37.51499 -9.184502         149
>> 4   149976410714997641071 hour, 53 minutes ago Terr-AIS              
>>                  7.7      37.60513 -9.228263         169
>> 5   149976392714997639271 hour, 56 minutes ago Terr-AIS              
>>                  8.0      37.61168 -9.229627         167
>> 6   149976378014997637801 hour, 58 minutes ago Terr-AIS              
>>                  6.9      37.61626 -9.230456         132
>> 7   149976354014997635402 hours, 2 minutes ago Terr-AIS              
>>                  3.4      37.61758 -9.237928         101
>> 8   149976342014997634202 hours, 4 minutes ago Terr-AIS              
>>                  3.6      37.61808 -9.240235         116
>> 9  149976286114997628612 hours, 14 minutes ago Terr-AIS              
>>                  3.3      37.62267 -9.249762         112
>> 10 149976264714997626472 hours, 17 minutes ago Terr-AIS              
>>                  3.2       37.6241 -9.25348         115
>> 11 149976243014997624302 hours, 21 minutes ago Terr-AIS              
>>                  3.3      37.62557 -9.257295         112
>> 12 149976193714997619372 hours, 29 minutes ago Terr-AIS              
>>                  3.4      37.62903 -9.266028         111
>> 13 149976172814997617282 hours, 32 minutes ago Terr-AIS              
>>                  3.3      37.63051 -9.269725         114
>> 14 149976150714997615072 hours, 36 minutes ago Terr-AIS              
>>                  3.3      37.63207 -9.273642         120
>> 15 149976124714997612472 hours, 40 minutes ago Terr-AIS              
>>                  3.6      37.63395 -9.278254         118
>> 16 149976098714997609872 hours, 45 minutes ago Terr-AIS              
>>                  3.3      37.63586 -9.282853         116
>> 17 149976051014997605102 hours, 53 minutes ago Terr-AIS              
>>                  3.2       37.6392 -9.291157         118
>> 18 149975870414997587043 hours, 23 minutes ago  Sat-AIS Add to SAT 
>> Fleet for undelayed data          <NA> <NA>        <NA>
>> 19  149975631414997563144 hours, 3 minutes ago Terr-AIS              
>>                  3.0      37.68459 -9.321658         159
>> 20  149975598814997559884 hours, 8 minutes ago Terr-AIS              
>>                  3.3      37.68881 -9.324172         150
>>
>> The records correspond exactly to the first table. I have tried to 
>> change the colClasses to character, but the results are always the 
>> same. My question is: how to change the Timestamp to a POSIXlt POSIXct 
>> format, either in the importing phase or within R?
>>
>> I hope that the question is clear now.
>>
>> Cristina
>>
>>
>> On Mon, 10 Jul 2017 09:18:05 -0700, Jeff Newmiller wrote:
>>> Not reproducible. [1][2][3] If our answers don't seem to apply to
>>> your situation, it will likely be because you did not explain your
>>> question clearly.
>>>
>>> Not plain text. This is a plain text mailing list,  and the 
>>> best-case
>>> scenario when you let your email program send HTML is that what you
>>> saw is not what we see (worst case is your email is scrambled on 
>>> our
>>> end).
>>>
>>> Have you read the documentation for the function you are using? In
>>> particular, what about the colClasses argument? If you don't let
>>> readHTMLTable guess what the format is (have it read in as 
>>> character
>>> data) then you have a fighting chance to get it right yourself, 
>>> e.g.
>>>
>>> as.POSIXct( "2017-07-10 14:04 (UTC)", format="%Y-%m-%d %H:%M 
>>> (UTC)",
>>> tz="UTC" )
>>>
>>> -----
>>>
>>> [1]
>>> 
>>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>>
>>> [2] http://adv-r.had.co.nz/Reproducibility.html
>>>
>>> [3] https://cran.r-project.org/web/packages/reprex/index.html
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 10, 2017 8:31:30 AM PDT, Cristina Silva <csilva at ipma.pt> 
>>> wrote:
>>>> Hi,
>>>>
>>>> I am extracting positions data from the marine traffic website. 
>>>> The
>>>> table has a "Timestamp" column which, in the browser, appears with 
>>>> the
>>>> format yyyy-mm-dd HH:MM (UTC), e.g. 2017-07-10 14:04 (UTC).
>>>>
>>>> When I import the table, the same date "2017-07-10 14:04 (UTC)" 
>>>> appears
>>>>
>>>> as "1499696500149969650021 minutes ago", This is the more recent 
>>>> date
>>>> and time. Older records, as e.g. "2017-07-09 17:02 (UTC)" appear 
>>>> as
>>>> e.g.
>>>> "1499619726149961972621 hours, 59 minutes ago".
>>>>
>>>> I don't know how to convert these data to the time formats used in 
>>>> R
>>>> (POSIXct).
>>>>
>>>> The script is very simple and worked before:
>>>>
>>>> library(XML)
>>>> x <- readHTMLTable('url')
>>>>
>>>> where the 'url' is the link to the website with the specification 
>>>> of
>>>> the
>>>> vessel.
>>>>
>>>> I appreciate any help.
>>>>
>>>> Cristina
>>>>

>
> David Winsemius
> Alameda, CA, USA

-- 
Cristina Silva
Divisão de Modelação e Gestão de Recursos Pesqueiros
Avenida de Brasília
1449-006 Lisboa
@: csilva at ipma.pt
#: +351 213027096



More information about the R-help mailing list