[R] how to separate string from numbers in a large txt file

David Winsemius dw|n@em|u@ @end|ng |rom comc@@t@net
Thu May 16 22:05:10 CEST 2019


On 5/16/19 12:30 PM, Michael Boulineau wrote:
> Thanks for this tip on etiquette, David. I will be sure and not do that again.
>
> I tried the read.fwf from the foreign package, with a code like this:
>
>   d <- read.fwf("hangouts-conversation.txt",
>                  widths= c(10,10,20,40),
>                  col.names=c("date","time","person","comment"),
>                  strip.white=TRUE)
>
> But it threw this error:
>
> Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
>    line 6347 did not have 4 elements


So what does line 6347 look like? (Use `readLines` and print it out.)

>
> Interestingly, though, the error only happened when I increased the
> width size. But I had to increase the size, or else I couldn't "see"
> anything.  The comment was so small that nothing was being captured by
> the size of the column. so to speak.
>
> It seems like what's throwing me is that there's no comma that
> demarcates the end of the text proper. For example:

Not sure why you thought there should be a comma. Lines usually end 
with  <cr> and or a <lf>.


Once you have the raw text in a character vector from `readLines` named, 
say, 'chrvec', then you could selectively substitute commas for spaces 
with regex. (Now that you no longer desire to remove the dates and times.)

sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)

This will not do any replacements when the pattern is not matched. See 
this test:


 > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
 > newvec
  [1] "2016-07-01,02:50:35,<john>,hey"
  [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
  [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
  [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
  [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"
  [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am 
really"
  [7] "2016-07-01,02:54:17,<john>,just know it's london"
  [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
  [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"
[12] "2016-07-01,03:02:48,<john>,British security is a little more 
rigorous..."


You should probably remove the "empty comment" lines.


-- 

David.

>
> 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01
> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane
> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was
> lots of Starbucks in my day2016-07-01 15:35:47
>
> It was interesting, too, when I pasted the text into the email, it
> self-formatted into the way I wanted it to look. I had to manually
> make it look like it does above, since that's the way that it looks in
> the txt file. I wonder if it's being organized by XML or something.
>
> Anyways, There's always a space between the two sideways carrots, just
> like there is right now: <John Doe> See. Space. And there's always a
> space between the data and time. Like this. 2016-07-01 15:34:30 See.
> Space. But there's never a space between the end of the comment and
> the next date. Like this: We were in a starbucks2016-07-01 15:35:02
> See. starbucks and 2016 are smooshed together.
>
> This code is also on the table right now too.
>
> a <- read.table("E:/working
> directory/-189/hangouts-conversation2.txt", quote="\"",
> comment.char="", fill=TRUE)
>
> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>
> aa<-gsub("[^[:digit:]]","",h)
> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
>
> Those last lines are a work in progress. I wish I could import a
> picture of what it looks like when it's translated into a data frame.
> The fill=TRUE helped to get the data in table that kind of sort of
> works, but the comments keep bleeding into the data and time column.
> It's like
>
> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
> over               there
> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
>
> And then, maybe, the "seriously" will be in a column all to itself, as
> will be the "I've'"and the "never" etc.
>
> I will use a regular expression if I have to, but it would be nice to
> keep the dates and times on there. Originally, I thought they were
> meaningless, but I've since changed my mind on that count. The time of
> day isn't so important. But, especially since, say, Gmail itself knows
> how to quickly recognize what it is, I know it can be done. I know
> this data has structure to it.
>
> Michael
>
>
>
> On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius using comcast.net> wrote:
>>
>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
>>> I have a wild and crazy text file, the head of which looks like this:
>>>
>>> 2016-07-01 02:50:35 <john> hey
>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
>>> 2016-07-01 02:51:45 <john> thinking about my boo
>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really
>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep
>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really
>>> 2016-07-01 02:54:17 <john> just know it's london
>>> 2016-07-01 02:56:44 <jane> you are probably asleep
>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay
>>> 2016-07-01 02:58:56 <jone>
>>> 2016-07-01 02:59:34 <jane>
>>> 2016-07-01 03:02:48 <john> British security is a little more rigorous...
>> Looks entirely not-"crazy". Typical log file format.
>>
>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex
>> (i.e. the sub-function) to strip everything up to the "<". Read
>> `?regex`. Since that's not a metacharacters you could use a pattern
>> ".+<" and replace with "".
>>
>> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp,
>> at least within hours of each, is considered poor manners.
>>
>>
>> --
>>
>> David.
>>
>>> It goes on for a while. It's a big file. But I feel like it's going to
>>> be difficult to annotate with the coreNLP library or package. I'm
>>> doing natural language processing. In other words, I'm curious as to
>>> how I would shave off the dates, that is, to make it look like:
>>>
>>> <john> hey
>>> <jane> waiting for plane to Edinburgh
>>>    <john> thinking about my boo
>>> <jane> nothing crappy has happened, not really
>>> <john> plane went by pretty fast, didn't sleep
>>> <jane> no idea what time it is or where I am really
>>> <john> just know it's london
>>> <jane> you are probably asleep
>>> <jane> I hope fish was fishy in a good eay
>>>    <jone>
>>> <jane>
>>> <john> British security is a little more rigorous...
>>>
>>> To be clear, then, I'm trying to clean a large text file by writing a
>>> regular expression? such that I create a new object with no numbers or
>>> dates.
>>>
>>> Michael
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list