[R] Reading a file w/ two delimiters

Bert Gunter gunter.berton at gene.com
Fri Nov 18 19:44:38 CET 2011


David:

As you now realize "\t" etc. is a perfectly legal single tab character.

Now consider:
Error in gsub("\\", "a", "\\") :
  invalid regular expression '\', reason 'Trailing backslash'

BUT

> gsub("\\\\","a","\\")
[1] "a"

???

The issue is there are two levels of escapes here -- the R parser's
and the reg expression's. The R parser recognizes "\\" as a single
backslash character in the third argument of gsub above. In the first
incorrect version, this single backslash is passed on to the reg
expression engine and it sees a single backslash, which is meaningless
to it. For example, a backreference would be something like "\\2"  =
"backslash 2."

The second incantation's first argument is correct and is passed onto
the reg expression engine as "backslash backslash," which it
interprets as an escaped "\" which is a literal "\" , per the
documentation.

So what about :

> cat(z)
ab      cd>
> cat(sub("\\\t","\n",z))
ab
cd>

R passes "backslash tab_character" to the regexp engine, which looks
also to me like an error ;  However, this may be one of those
"implementation dependent" details mentioned in the Help file, It
seems to me that the engine sees a meaningless escape sequence and
just throws away the escape to interpret the character literally. As
support for this, "\h" is not a meaningful escape sequence in R:

> gsub("\\h","a","\h")
Error: '\h' is an unrecognized escape in character string starting "\h"

and

> gsub("\\h","a","h")
[1] "a"

But I may be wrong, and I am hoping that this post will prompt someone
more knowledgeable than I to respond (if only just to confirm my
"explanation" if it's correct).

Cheers,
Bert





On Fri, Nov 18, 2011 at 7:26 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Nov 18, 2011, at 9:28 AM, jim holtman wrote:
>
>> It is pretty straightforward in R:
>>
>>> x <-
>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>> closeAllConnections()
>>> # convert tabs to newlines
>>> x <- gsub("\t", "\n", x)
>
> Did the rules get liberalized for escaping patterns? Or have I been
> unnecessarily expending backslashes all these years. I thought that one
> needed 3 blackslashes. This code does work and I am wondering if/when I
> "didn't get the memo". (I do see that there is a line early in the ?regex
> page that suggests I have been deluded all along.)
>
> "The current implementation interprets \a as BEL, \e asESC, \f as FF, \n as
> LF, \r as CR and \t as TAB."
>
>> x <-
>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>> closeAllConnections()
>> # convert tabs to newlines
>> x2 <- gsub("\\\t", "\n", x)
>> x2
> [1] "sadf|asdf|asdf\nqwer|qwer|qwer\nzxcv|zxcv|zxfcgv"
>
> So I guess my question is (now) why the triple-slash technique even works?
>
> --
> David.
>
>
>
>>> # write out to a temp file and then read in as a data frame
>>> myFile <- tempfile()
>>> writeLines(x, con = myFile)
>>> x.df <- read.table(myFile, sep = "|")
>>>
>>>
>>> x.df
>>
>>   V1   V2     V3
>> 1 sadf asdf   asdf
>> 2 qwer qwer   qwer
>> 3 zxcv zxcv zxfcgv
>>>
>>
>> On Fri, Nov 18, 2011 at 9:13 AM, Langston, Jim
>> <Jim.Langston at compuware.com> wrote:
>>>
>>> Thanks Paul,
>>>
>>> That's the path I was marching down, I was hoping for something
>>> a little cleaner, I do the same with Perl or Java.
>>>
>>> Jim
>>>
>>> On 11/18/11 8:35 AM, "Paul Hiemstra" <paul.hiemstra at knmi.nl> wrote:
>>>
>>>> Hi Jim,
>>>>
>>>> You can read the text file using readLines. This puts each line in the
>>>> file into an element of a list. Then you can go through the lines
>>>> manually (e.g. using grep, sub, strsplit) and create your data.frame.
>>>>
>>>> cheers,
>>>> Paul
>>>>
>>>> On 11/18/2011 12:37 PM, Langston, Jim wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been scratching and poking, but basically, the file I need to read
>>>>> has
>>>>> two delimiters that I need to contend with. The first is that the file
>>>>> contains
>>>>> tabs (\t) , instead of newlines (\n), and the second is that the fields
>>>>> have
>>>>> | for the seperators. I can easily do a read if I first convert the \t
>>>>> to
>>>>> \n
>>>>> and then use read.table to get the file read with the | separator. But,
>>>>> what I would really like to do, is do this all within R. I have a lot
>>>>> of
>>>>> files
>>>>> to read and do analysis on.
>>>>>
>>>>> I can read the data into a table using the \t has delimiter, but can't
>>>>> figure
>>>>> out how to take that table data and use the | for separation, I've look
>>>>> at
>>>>> string splits, etc. but haven't figured out how to split the whole
>>>>> table.
>>>>>
>>>>> Any thoughts ? hints ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>> The contents of this e-mail are intended for the named
>>>>> a...{{dropped:6}}
>>>>>
>>>>>
>>> The contents of this e-mail are intended for the named addressee only. It
>>> contains information that may be confidential. Unless you are the named
>>> addressee or an authorized designee, you may not copy or use it, or disclose
>>> it to anyone else. If you received it in error please notify us immediately
>>> and then destroy it.
>>>
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> --
>>>> Paul Hiemstra, Ph.D.
>>>> Global Climate Division
>>>> Royal Netherlands Meteorological Institute (KNMI)
>>>> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
>>>> P.O. Box 201 | 3730 AE | De Bilt
>>>> tel: +31 30 2206 494
>>>>
>>>> http://intamap.geo.uu.nl/~paul
>>>> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list