[R] Reading very large text files into R

Fri Sep 30 13:26:47 CEST 2022

Hi Nick,
   Can you post one line of data with 15 entries followed by the next line of data with 16 entries? 

Tim

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Richard O'Keefe
Sent: Friday, September 30, 2022 12:08 AM
To: Nick Wray <nickmwray using gmail.com>
Cc: r-help using r-project.org
Subject: Re: [R] Reading very large text files into R

[External Email]

If I had this problem, in the old days I'd've whipped up a tiny AWK script.  These days I might use xsv or qsv.
BUT
first I would want to know why these extra fields are present and what they signify.  Are they good data that happen not to be described in the documentation?  Do they represent a defect in the generation process?  What other discrepancies are there?  If the data *format* cannot be fully trusted, what does that say about the data *content*?  Do other data sets from the same source have the same issue?  Is it possible to compare this version of the data with an earlier version?

On Fri, 30 Sept 2022 at 02:54, Nick Wray <nickmwray using gmail.com> wrote:

> Hello   I may be offending the R purists with this question but it is
> linked to R, as will become clear.  I have very large data sets from 
> the UK Met Office in notepad form.  Unfortunately,  I can't read them 
> directly into R because, for some reason, although most lines in the 
> text doc consist of 15 elements, every so often there is a sixteenth 
> one and R doesn't like this and gives me an error message because it 
> has assumed that every line has 15 elements and doesn't like finding 
> one with more.  I have tried playing around with the text document, 
> inserting an extra element into the top line etc, but to no avail.
>
> Also unfortunately you need access permission from the Met Office to 
> get the files in question so this link probably won't work:
>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcata
> logue.ceda.ac.uk%2Fuuid%2Fbbd6916225e7475514e17fdbf11141c1&data=05
> %7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f8
> 4a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFp
> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
> 0%3D%7C3000%7C%7C%7C&sdata=FEHsv515QPe4iXFMLlx9jwj4JXka7asxg771h6s
> 5nVg%3D&reserved=0
>
> So what I have done is simply to copy and paste the text docs into 
> excel csv and then read them in, which is time-consuming but works.  
> However the later datasets are over the excel limit of 1048576 lines.  
> I can paste in the first 1048576 lines but then trying to isolate the 
> remainder of the text doc to paste it into a second csv doc is proving 
> v difficult - the only way I have found is to scroll down by hand and 
> that's taking ages.  I cannot find another way of editing the notepad 
> text doc to get rid of the part which I have already copied and pasted.
>
> Can anyone help with a)ideally being able to simply read the text 
> tables into R  or b)suggest a way of editing out the bits of the text 
> file I have already pasted in without laborious scrolling?
>
> Thanks Nick Wray
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl
> .edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e
> 1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3D&rese
> rved=0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%
> 7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%
> 7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&
> sdata=DOkkKe1P474ELVoFjMtqWXawwQ5ouRR3ofjQEBPXKVM%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C8Zffji%2FBVfDK1B6baYikAwps91Kv2xO7XnXxes%2FgqU%3D&reserved=0
PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C1da5c2d4d14845f2745308daa2996e5a%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638001077156093439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DOkkKe1P474ELVoFjMtqWXawwQ5ouRR3ofjQEBPXKVM%3D&reserved=0
and provide commented, minimal, self-contained, reproducible code.