[R] Huge Dataset Dates Span two Lines

Collin Lynch cflynch at ncsu.edu
Thu Jan 8 20:37:43 CET 2015


You might consider using something other than R to clean the file and even
to load it.  I regularly use python to preprocess data for R and often feed
it to R directly via the rpy2 interface.  If the dates are delimited by
some feature (e.g. ") you could potentially use the python csv library to
load it directly and then either send that or dump it in a clean form.
Alternatively you could use a simple script to iterate over the file and to
remove newlines in that case without doing any other processing.  A regular
experession of the form:

"([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(
)+(\n?)([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})"

should match the date/time strings even with an embedded newline.  You
could use this to detect those cases in the file and then to replace the
newline with a whitespace character.

On Thu, Jan 8, 2015 at 1:20 PM, DVL <daniel.van.lunen at state.ma.us> wrote:

> I'm trying to import a many gigabyte .txt file to analyze. It is asterisk
> delimited. I'm having an issue with the date field in the dataset. In the
> first 165 lines dates are listed as :
> YYYY-MM-DD HH:MM:SS
>
> Then on the 166th line and in other places the date spans two lines:
> YYYY-MM-DD
> HH:MM:SS
>
> This causes a problem because R thinks it has reached the end of a row in
> the table. How can I solve this?
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list