[R] Fwd: Reading very large text files into R

Thu Sep 29 17:14:51 CEST 2022

On Thu, 29 Sep 2022, Nick Wray writes:

> ---------- Forwarded message ---------
> From: Nick Wray <nickmwray using gmail.com>
> Date: Thu, 29 Sept 2022 at 15:32
> Subject: Re: [R] Reading very large text files into R
> To: Ben Tupper <btupper using bigelow.org>
>
>
> Hi Ben
> Beneath is an example of the text (also in an attachment) and it's the "B",
> of which there are quite a few scattered throughout the text doc which
> causes the reading in error message (btw I don't need the "RAIN" column or
> the 1's after it or the last four elements).   I have also attached the
> snippet as text file
>
> 1980-01-01 10:00, 225620, RAIN, 1, 1, WAHRAIN, 5091, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 226918, RAIN, 1, 1, WAHRAIN, 5124, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 228562, RAIN, 1, 1, WAHRAIN, 491, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 231581, RAIN, 1, 1, WAHRAIN, 5213, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 232671, RAIN, 1, 1, WAHRAIN, 487, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 232913, RAIN, 1, 1, WAHRAIN, 5243, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 234362, RAIN, 1, 1, WAHRAIN, 5265, 1001, 0, , 10009, 0, ,
> , B
> 1980-01-01 10:00, 234682, RAIN, 1, 1, WAHRAIN, 5271, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 235389, RAIN, 1, 1, WAHRAIN, 5279, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 236466, RAIN, 1, 1, WAHRAIN, 497, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 243350, RAIN, 1, 1, SREW, 484, 1001, 0, , 9, 0, , ,
> 1980-01-01 10:00, 243350, RAIN, 1, 1, WAHRAIN, 484, 1001, 0, 0, 9, 9, , ,
>
> Thanks Nick
>
> On Thu, 29 Sept 2022 at 15:12, Ben Tupper <btupper using bigelow.org> wrote:
>
>> Hi Nick,
>>
>> It's hard to know without seeing at least a snippet of the data.
>> Could you do the following and paste the result into a plain text
>> email?  If you don't set your email client to plain text (from rich
>> text or html) then we are apt to see a jumble of output on our email
>> clients.
>>
>>
>> ## start
>> x <- readLines(filename, n = 20)
>> cat(x, sep = "\n")
>> ## end
>>
>> Cheers,
>> Ben
>>
>>
>> On Thu, Sep 29, 2022 at 9:54 AM Nick Wray <nickmwray using gmail.com> wrote:
>> >
>> > Hello   I may be offending the R purists with this question but it is
>> > linked to R, as will become clear.  I have very large data sets from the
>> UK
>> > Met Office in notepad form.  Unfortunately,  I can’t read them directly
>> > into R because, for some reason, although most lines in the text doc
>> > consist of 15 elements, every so often there is a sixteenth one and R
>> > doesn’t like this and gives me an error message because it has assumed
>> that
>> > every line has 15 elements and doesn’t like finding one with more.  I
>> have
>> > tried playing around with the text document, inserting an extra element
>> > into the top line etc, but to no avail.
>> >
>> > Also unfortunately you need access permission from the Met Office to get
>> > the files in question so this link probably won’t work:
>> >
>> > https://catalogue.ceda.ac.uk/uuid/bbd6916225e7475514e17fdbf11141c1
>> >
>> > So what I have done is simply to copy and paste the text docs into excel
>> > csv and then read them in, which is time-consuming but works.  However
>> the
>> > later datasets are over the excel limit of 1048576 lines.  I can paste in
>> > the first 1048576 lines but then trying to isolate the remainder of the
>> > text doc to paste it into a second csv doc is proving v difficult – the
>> > only way I have found is to scroll down by hand and that’s taking ages.
>> I
>> > cannot find another way of editing the notepad text doc to get rid of the
>> > part which I have already copied and pasted.
>> >
>> > Can anyone help with a)ideally being able to simply read the text tables
>> > into R  or b)suggest a way of editing out the bits of the text file I
>> have
>> > already pasted in without laborious scrolling?
>> >
>> > Thanks Nick Wray
>> >

[...]

>>
>> --
>> Ben Tupper (he/him)
>> Bigelow Laboratory for Ocean Science
>> East Boothbay, Maine
>> http://www.bigelow.org/
>> https://eco.bigelow.org
>>
>

Maybe I have missed it, but could you please show how
you tried to read the table?

When I use your file with 

    read.table("sample text.txt", header = FALSE, sep = ",")

I get

    ##                  V1     V2    V3 V4 V5       V6   V7   V8 V9 V10   V11 V12 V13 V14 V15
    ## 1  1980-01-01 10:00 225620  RAIN  1  1  WAHRAIN 5091 1001  0  NA     9   0  NA  NA    
    ## 2  1980-01-01 10:00 226918  RAIN  1  1  WAHRAIN 5124 1001  0  NA     9   0  NA  NA    
    ## ## .....
    ## 7  1980-01-01 10:00 234362  RAIN  1  1  WAHRAIN 5265 1001  0  NA 10009   0  NA  NA   B
    ## 8  1980-01-01 10:00 234682  RAIN  1  1  WAHRAIN 5271 1001  0  NA     9   0  NA  NA    

-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net