[R] [External] Fwd: Reading very large text files into R

Richard M. Heiberger rmh @end|ng |rom temp|e@edu
Thu Sep 29 19:27:03 CEST 2022


I think you need the
  fill=TRUE
argument. See 
?read.table

> On Sep 29, 2022, at 11:14, Enrico Schumann <es using enricoschumann.net> wrote:
> 
> On Thu, 29 Sep 2022, Nick Wray writes:
> 
>> ---------- Forwarded message ---------
>> From: Nick Wray <nickmwray using gmail.com>
>> Date: Thu, 29 Sept 2022 at 15:32
>> Subject: Re: [R] Reading very large text files into R
>> To: Ben Tupper <btupper using bigelow.org>
>> 
>> 
>> Hi Ben
>> Beneath is an example of the text (also in an attachment) and it's the "B",
>> of which there are quite a few scattered throughout the text doc which
>> causes the reading in error message (btw I don't need the "RAIN" column or
>> the 1's after it or the last four elements). I have also attached the
>> snippet as text file
>> 
>> 1980-01-01 10:00, 225620, RAIN, 1, 1, WAHRAIN, 5091, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 226918, RAIN, 1, 1, WAHRAIN, 5124, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 228562, RAIN, 1, 1, WAHRAIN, 491, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 231581, RAIN, 1, 1, WAHRAIN, 5213, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 232671, RAIN, 1, 1, WAHRAIN, 487, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 232913, RAIN, 1, 1, WAHRAIN, 5243, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 234362, RAIN, 1, 1, WAHRAIN, 5265, 1001, 0, , 10009, 0, ,
>> , B
>> 1980-01-01 10:00, 234682, RAIN, 1, 1, WAHRAIN, 5271, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 235389, RAIN, 1, 1, WAHRAIN, 5279, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 236466, RAIN, 1, 1, WAHRAIN, 497, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 243350, RAIN, 1, 1, SREW, 484, 1001, 0, , 9, 0, , ,
>> 1980-01-01 10:00, 243350, RAIN, 1, 1, WAHRAIN, 484, 1001, 0, 0, 9, 9, , ,
>> 
>> Thanks Nick
>> 
>> On Thu, 29 Sept 2022 at 15:12, Ben Tupper <btupper using bigelow.org> wrote:
>> 
>>> Hi Nick,
>>> 
>>> It's hard to know without seeing at least a snippet of the data.
>>> Could you do the following and paste the result into a plain text
>>> email? If you don't set your email client to plain text (from rich
>>> text or html) then we are apt to see a jumble of output on our email
>>> clients.
>>> 
>>> 
>>> ## start
>>> x <- readLines(filename, n = 20)
>>> cat(x, sep = "\n")
>>> ## end
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>> On Thu, Sep 29, 2022 at 9:54 AM Nick Wray <nickmwray using gmail.com> wrote:
>>>> 
>>>> Hello I may be offending the R purists with this question but it is
>>>> linked to R, as will become clear. I have very large data sets from the
>>> UK
>>>> Met Office in notepad form. Unfortunately, I can’t read them directly
>>>> into R because, for some reason, although most lines in the text doc
>>>> consist of 15 elements, every so often there is a sixteenth one and R
>>>> doesn’t like this and gives me an error message because it has assumed
>>> that
>>>> every line has 15 elements and doesn’t like finding one with more. I
>>> have
>>>> tried playing around with the text document, inserting an extra element
>>>> into the top line etc, but to no avail.
>>>> 
>>>> Also unfortunately you need access permission from the Met Office to get
>>>> the files in question so this link probably won’t work:
>>>> 
>>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcatalogue.ceda.ac.uk%2Fuuid%2Fbbd6916225e7475514e17fdbf11141c1&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FolfWagLVB9RNAAR3L88YUnOG8wwDHZFPm5%2BWVWgZ7Y%3D&reserved=0
>>>> 
>>>> So what I have done is simply to copy and paste the text docs into excel
>>>> csv and then read them in, which is time-consuming but works. However
>>> the
>>>> later datasets are over the excel limit of 1048576 lines. I can paste in
>>>> the first 1048576 lines but then trying to isolate the remainder of the
>>>> text doc to paste it into a second csv doc is proving v difficult – the
>>>> only way I have found is to scroll down by hand and that’s taking ages.
>>> I
>>>> cannot find another way of editing the notepad text doc to get rid of the
>>>> part which I have already copied and pasted.
>>>> 
>>>> Can anyone help with a)ideally being able to simply read the text tables
>>>> into R or b)suggest a way of editing out the bits of the text file I
>>> have
>>>> already pasted in without laborious scrolling?
>>>> 
>>>> Thanks Nick Wray
>>>> 
> 
> [...]
> 
>>> 
>>> --
>>> Ben Tupper (he/him)
>>> Bigelow Laboratory for Ocean Science
>>> East Boothbay, Maine
>>> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.bigelow.org%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Qmpsx1aA7kL9lYJYshs1U7PrPqFpYFbzOQWXQvW1RLI%3D&reserved=0
>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Feco.bigelow.org%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IXODOG1eVdJKrHZyrM5yud9gjInLCFNcMGo4dWqFe3I%3D&reserved=0
>>> 
>> 
> 
> Maybe I have missed it, but could you please show how
> you tried to read the table?
> 
> When I use your file with 
> 
> read.table("sample text.txt", header = FALSE, sep = ",")
> 
> I get
> 
> ## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
> ## 1 1980-01-01 10:00 225620 RAIN 1 1 WAHRAIN 5091 1001 0 NA 9 0 NA NA 
> ## 2 1980-01-01 10:00 226918 RAIN 1 1 WAHRAIN 5124 1001 0 NA 9 0 NA NA 
> ## ## .....
> ## 7 1980-01-01 10:00 234362 RAIN 1 1 WAHRAIN 5265 1001 0 NA 10009 0 NA NA B
> ## 8 1980-01-01 10:00 234682 RAIN 1 1 WAHRAIN 5271 1001 0 NA 9 0 NA NA 
> 
> 
> 
> -- 
> Enrico Schumann
> Lucerne, Switzerland
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fenricoschumann.net%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=O7lDKov8h%2FuC6kCepouzcWMfaIyzi0L6FBfc1BrE2zo%3D&reserved=0
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AXPi3UBPFniqQgD%2FWaF3tGpaPROl19tz0XC26sQHvR0%3D&reserved=0
> PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dMj3dYXs8aPT7lXJVnnarZZp%2BAukdJcI%2BxEpn2reHCM%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list