[R] Reading a file w/ two delimiters

Bert Gunter gunter.berton at gene.com
Fri Nov 18 19:48:25 CET 2011


... I failed to correctly paste the first line of an example:

On Fri, Nov 18, 2011 at 10:44 AM, Bert Gunter <bgunter at gene.com> wrote:
> David:
>
> As you now realize "\t" etc. is a perfectly legal single tab character.
>
> Now consider:
-------------  left this out --------------
> gsub("\\","a","\\")
-----------------------------------------------
> Error in gsub("\\", "a", "\\") :
>  invalid regular expression '\', reason 'Trailing backslash'
>
> BUT
>
>> gsub("\\\\","a","\\")
> [1] "a"
>
> ???
>
> The issue is there are two levels of escapes here -- the R parser's
> and the reg expression's. The R parser recognizes "\\" as a single
> backslash character in the third argument of gsub above. In the first
> incorrect version, this single backslash is passed on to the reg
> expression engine and it sees a single backslash, which is meaningless
> to it. For example, a backreference would be something like "\\2"  =
> "backslash 2."
>
> The second incantation's first argument is correct and is passed onto
> the reg expression engine as "backslash backslash," which it
> interprets as an escaped "\" which is a literal "\" , per the
> documentation.
>
> So what about :
>
>> cat(z)
> ab      cd>
>> cat(sub("\\\t","\n",z))
> ab
> cd>
>
> R passes "backslash tab_character" to the regexp engine, which looks
> also to me like an error ;  However, this may be one of those
> "implementation dependent" details mentioned in the Help file, It
> seems to me that the engine sees a meaningless escape sequence and
> just throws away the escape to interpret the character literally. As
> support for this, "\h" is not a meaningful escape sequence in R:
>
>> gsub("\\h","a","\h")
> Error: '\h' is an unrecognized escape in character string starting "\h"
>
> and
>
>> gsub("\\h","a","h")
> [1] "a"
>
> But I may be wrong, and I am hoping that this post will prompt someone
> more knowledgeable than I to respond (if only just to confirm my
> "explanation" if it's correct).
>
> Cheers,
> Bert
>
>
>
>
>
> On Fri, Nov 18, 2011 at 7:26 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>> On Nov 18, 2011, at 9:28 AM, jim holtman wrote:
>>
>>> It is pretty straightforward in R:
>>>
>>>> x <-
>>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>>> closeAllConnections()
>>>> # convert tabs to newlines
>>>> x <- gsub("\t", "\n", x)
>>
>> Did the rules get liberalized for escaping patterns? Or have I been
>> unnecessarily expending backslashes all these years. I thought that one
>> needed 3 blackslashes. This code does work and I am wondering if/when I
>> "didn't get the memo". (I do see that there is a line early in the ?regex
>> page that suggests I have been deluded all along.)
>>
>> "The current implementation interprets \a as BEL, \e asESC, \f as FF, \n as
>> LF, \r as CR and \t as TAB."
>>
>>> x <-
>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>> closeAllConnections()
>>> # convert tabs to newlines
>>> x2 <- gsub("\\\t", "\n", x)
>>> x2
>> [1] "sadf|asdf|asdf\nqwer|qwer|qwer\nzxcv|zxcv|zxfcgv"
>>
>> So I guess my question is (now) why the triple-slash technique even works?
>>
>> --
>> David.
>>
>>
>>
>>>> # write out to a temp file and then read in as a data frame
>>>> myFile <- tempfile()
>>>> writeLines(x, con = myFile)
>>>> x.df <- read.table(myFile, sep = "|")
>>>>
>>>>
>>>> x.df
>>>
>>>   V1   V2     V3
>>> 1 sadf asdf   asdf
>>> 2 qwer qwer   qwer
>>> 3 zxcv zxcv zxfcgv
>>>>
>>>
>>> On Fri, Nov 18, 2011 at 9:13 AM, Langston, Jim
>>> <Jim.Langston at compuware.com> wrote:
>>>>
>>>> Thanks Paul,
>>>>
>>>> That's the path I was marching down, I was hoping for something
>>>> a little cleaner, I do the same with Perl or Java.
>>>>
>>>> Jim
>>>>
>>>> On 11/18/11 8:35 AM, "Paul Hiemstra" <paul.hiemstra at knmi.nl> wrote:
>>>>
>>>>> Hi Jim,
>>>>>
>>>>> You can read the text file using readLines. This puts each line in the
>>>>> file into an element of a list. Then you can go through the lines
>>>>> manually (e.g. using grep, sub, strsplit) and create your data.frame.
>>>>>
>>>>> cheers,
>>>>> Paul
>>>>>
>>>>> On 11/18/2011 12:37 PM, Langston, Jim wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've been scratching and poking, but basically, the file I need to read
>>>>>> has
>>>>>> two delimiters that I need to contend with. The first is that the file
>>>>>> contains
>>>>>> tabs (\t) , instead of newlines (\n), and the second is that the fields
>>>>>> have
>>>>>> | for the seperators. I can easily do a read if I first convert the \t
>>>>>> to
>>>>>> \n
>>>>>> and then use read.table to get the file read with the | separator. But,
>>>>>> what I would really like to do, is do this all within R. I have a lot
>>>>>> of
>>>>>> files
>>>>>> to read and do analysis on.
>>>>>>
>>>>>> I can read the data into a table using the \t has delimiter, but can't
>>>>>> figure
>>>>>> out how to take that table data and use the | for separation, I've look
>>>>>> at
>>>>>> string splits, etc. but haven't figured out how to split the whole
>>>>>> table.
>>>>>>
>>>>>> Any thoughts ? hints ?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>> The contents of this e-mail are intended for the named
>>>>>> a...{{dropped:6}}
>>>>>>
>>>>>>
>>>> The contents of this e-mail are intended for the named addressee only. It
>>>> contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or disclose
>>>> it to anyone else. If you received it in error please notify us immediately
>>>> and then destroy it.
>>>>
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> --
>>>>> Paul Hiemstra, Ph.D.
>>>>> Global Climate Division
>>>>> Royal Netherlands Meteorological Institute (KNMI)
>>>>> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
>>>>> P.O. Box 201 | 3730 AE | De Bilt
>>>>> tel: +31 30 2206 494
>>>>>
>>>>> http://intamap.geo.uu.nl/~paul
>>>>> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list