[R] Reading very large text files into R

Fri Sep 30 21:39:53 CEST 2022

Hello Thanks again for all the suggestions.  The irony is that for the
datasets I'm using the fill=T as suggested by Ivan in the first instance I
think works fine.  They're not particularly sophisticated datasets and
although I don't know what the extra Bs (of which the first one  as Avi
says does occur quite late on) actually mean I don't really need to know -
all I need is the date/time/station id/rainfall accumulation and that's
obvious once I've read the dataset in.  It has been interesting seeing the
takes of people who have a far deeper and wider understanding of R than I
do however and an education in itself... Nick

On Fri, 30 Sept 2022 at 20:16, <avi.e.gross using gmail.com> wrote:

> Tim and others,
>
> A point to consider is that there are various algorithms in the functions
> used to read in formatted data into data.frame form and they vary. Some do
> a
> look-ahead of some size to determine things and if they find a column that
> LOOKS LIKE all integers for say the first thousand lines, they go and read
> in that column as integer. If the first floating point value is thousands
> of
> lines further along, things may go wrong.
>
> So asking for line/row 16 to have an extra 16th entry/column may work fine
> for an algorithm that looks ahead and concludes there are 16 columns
> throughout. Yet a file where the first time a sixteenth entry is seen is at
> line/row 31,459 may well just set the algorithm to expect exactly 15
> columns
> and then be surprised as noted above.
>
> I have stayed out of this discussion and others have supplied pretty much
> what I would have said. I also see the data as flawed and ask which rows
> are
> the valid ones. If a sixteenth column is allowed, it would be better if all
> other rows had an empty sixteenth column. If not allowed, none should have
> it.
>
> The approach I might take, again as others have noted, is to preprocess the
> data file using some form of stream editor such as AWK that automagically
> reads in a line at a time and parses lines into a collection of tokens
> based
> on what separates them such as a comma. You can then either write out just
> the first 15 to the output stream if your choice is to ignore a spurious
> sixteenth, or write out all sixteen for every line, with the last being
> some
> form of null most of the time. And, of course, to be more general, you
> could
> make two passes through the file with the first one determining the maximum
> number of entries as well as what the most common number of entries is, and
> a second pass using that info to normalize the file the way you want. And
> note some of what was mentioned could often be done in this preprocessing
> such as removing any columns you do not want to read into R later. Do note
> such filters may need to handle edge cases like skipping comment lines or
> treating the row of headers differently.
>
> As some have shown, you can create your own filters within a language like
> R
> too and either read in lines and pre-process them as discussed or continue
> on to making your own data.frame and skip the read.table() type of
> functionality. For very large files, though, having multiple variations in
> memory at once may be an issue, especially if they are not removed and
> further processing and analysis continues.
>
> Perhaps it might be sensible to contact those maintaining the data and
> point
> out the anomaly and ask if their files might be saved alternately in a
> format that can be used without anomalies.
>
> Avi
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Ebert,Timothy
> Aaron
> Sent: Friday, September 30, 2022 7:27 AM
> To: Richard O'Keefe <raoknz using gmail.com>; Nick Wray <nickmwray using gmail.com>
> Cc: r-help using r-project.org
> Subject: Re: [R] Reading very large text files into R
>
> Hi Nick,
>    Can you post one line of data with 15 entries followed by the next line
> of data with 16 entries?
>
> Tim
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]