[R] how to separate string from numbers in a large txt file

Michael Boulineau m|ch@e|@p@bou||ne@u @end|ng |rom gm@||@com
Thu May 16 21:30:13 CEST 2019


Thanks for this tip on etiquette, David. I will be sure and not do that again.

I tried the read.fwf from the foreign package, with a code like this:

 d <- read.fwf("hangouts-conversation.txt",
                widths= c(10,10,20,40),
                col.names=c("date","time","person","comment"),
                strip.white=TRUE)

But it threw this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 6347 did not have 4 elements

Interestingly, though, the error only happened when I increased the
width size. But I had to increase the size, or else I couldn't "see"
anything.  The comment was so small that nothing was being captured by
the size of the column. so to speak.

It seems like what's throwing me is that there's no comma that
demarcates the end of the text proper. For example:

2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01
15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane
Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was
lots of Starbucks in my day2016-07-01 15:35:47

It was interesting, too, when I pasted the text into the email, it
self-formatted into the way I wanted it to look. I had to manually
make it look like it does above, since that's the way that it looks in
the txt file. I wonder if it's being organized by XML or something.

Anyways, There's always a space between the two sideways carrots, just
like there is right now: <John Doe> See. Space. And there's always a
space between the data and time. Like this. 2016-07-01 15:34:30 See.
Space. But there's never a space between the end of the comment and
the next date. Like this: We were in a starbucks2016-07-01 15:35:02
See. starbucks and 2016 are smooshed together.

This code is also on the table right now too.

a <- read.table("E:/working
directory/-189/hangouts-conversation2.txt", quote="\"",
comment.char="", fill=TRUE)

h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])

aa<-gsub("[^[:digit:]]","",h)
my.data.num <- as.numeric(str_extract(h, "[0-9]+"))

Those last lines are a work in progress. I wish I could import a
picture of what it looks like when it's translated into a data frame.
The fill=TRUE helped to get the data in table that kind of sort of
works, but the comments keep bleeding into the data and time column.
It's like

2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
over               there
2016-07-01 15:59:27 <Jane Doe> It confuses me :(

And then, maybe, the "seriously" will be in a column all to itself, as
will be the "I've'"and the "never" etc.

I will use a regular expression if I have to, but it would be nice to
keep the dates and times on there. Originally, I thought they were
meaningless, but I've since changed my mind on that count. The time of
day isn't so important. But, especially since, say, Gmail itself knows
how to quickly recognize what it is, I know it can be done. I know
this data has structure to it.

Michael



On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius using comcast.net> wrote:
>
>
> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> > I have a wild and crazy text file, the head of which looks like this:
> >
> > 2016-07-01 02:50:35 <john> hey
> > 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
> > 2016-07-01 02:51:45 <john> thinking about my boo
> > 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really
> > 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep
> > 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really
> > 2016-07-01 02:54:17 <john> just know it's london
> > 2016-07-01 02:56:44 <jane> you are probably asleep
> > 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay
> > 2016-07-01 02:58:56 <jone>
> > 2016-07-01 02:59:34 <jane>
> > 2016-07-01 03:02:48 <john> British security is a little more rigorous...
>
> Looks entirely not-"crazy". Typical log file format.
>
> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex
> (i.e. the sub-function) to strip everything up to the "<". Read
> `?regex`. Since that's not a metacharacters you could use a pattern
> ".+<" and replace with "".
>
> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp,
> at least within hours of each, is considered poor manners.
>
>
> --
>
> David.
>
> >
> > It goes on for a while. It's a big file. But I feel like it's going to
> > be difficult to annotate with the coreNLP library or package. I'm
> > doing natural language processing. In other words, I'm curious as to
> > how I would shave off the dates, that is, to make it look like:
> >
> > <john> hey
> > <jane> waiting for plane to Edinburgh
> >   <john> thinking about my boo
> > <jane> nothing crappy has happened, not really
> > <john> plane went by pretty fast, didn't sleep
> > <jane> no idea what time it is or where I am really
> > <john> just know it's london
> > <jane> you are probably asleep
> > <jane> I hope fish was fishy in a good eay
> >   <jone>
> > <jane>
> > <john> British security is a little more rigorous...
> >
> > To be clear, then, I'm trying to clean a large text file by writing a
> > regular expression? such that I create a new object with no numbers or
> > dates.
> >
> > Michael
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list