[R] Reading a file w/ two delimiters

Bert Gunter gunter.berton at gene.com
Fri Nov 18 19:56:16 CET 2011


... and yet another line I left out below!  I apologize for this baloney!

On Fri, Nov 18, 2011 at 10:48 AM, Bert Gunter <bgunter at gene.com> wrote:
> ... I failed to correctly paste the first line of an example:
>
> On Fri, Nov 18, 2011 at 10:44 AM, Bert Gunter <bgunter at gene.com> wrote:
>> David:
>>
>> As you now realize "\t" etc. is a perfectly legal single tab character.
>>
>> Now consider:
> -------------  left this out --------------
>> gsub("\\","a","\\")
> -----------------------------------------------
>> Error in gsub("\\", "a", "\\") :
>>  invalid regular expression '\', reason 'Trailing backslash'
>>
>> BUT
>>
>>> gsub("\\\\","a","\\")
>> [1] "a"
>>
>> ???
>>
>> The issue is there are two levels of escapes here -- the R parser's
>> and the reg expression's. The R parser recognizes "\\" as a single
>> backslash character in the third argument of gsub above. In the first
>> incorrect version, this single backslash is passed on to the reg
>> expression engine and it sees a single backslash, which is meaningless
>> to it. For example, a backreference would be something like "\\2"  =
>> "backslash 2."
>>
>> The second incantation's first argument is correct and is passed onto
>> the reg expression engine as "backslash backslash," which it
>> interprets as an escaped "\" which is a literal "\" , per the
>> documentation.
>>
>> So what about :
>>
---------------  also left this out ---------------
z <- "ab\tcd"
-----------------------------------------------
>>> cat(z)
>> ab      cd>
>>> cat(sub("\\\t","\n",z))
>> ab
>> cd>
>>
>> R passes "backslash tab_character" to the regexp engine, which looks
>> also to me like an error ;  However, this may be one of those
>> "implementation dependent" details mentioned in the Help file, It
>> seems to me that the engine sees a meaningless escape sequence and
>> just throws away the escape to interpret the character literally. As
>> support for this, "\h" is not a meaningful escape sequence in R:
>>
>>> gsub("\\h","a","\h")
>> Error: '\h' is an unrecognized escape in character string starting "\h"
>>
>> and
>>
>>> gsub("\\h","a","h")
>> [1] "a"
>>
>> But I may be wrong, and I am hoping that this post will prompt someone
>> more knowledgeable than I to respond (if only just to confirm my
>> "explanation" if it's correct).
>>
>> Cheers,
>> Bert
>>
>>
>>
>>
>>
>> On Fri, Nov 18, 2011 at 7:26 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>>>
>>> On Nov 18, 2011, at 9:28 AM, jim holtman wrote:
>>>
>>>> It is pretty straightforward in R:
>>>>
>>>>> x <-
>>>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>>>> closeAllConnections()
>>>>> # convert tabs to newlines
>>>>> x <- gsub("\t", "\n", x)
>>>
>>> Did the rules get liberalized for escaping patterns? Or have I been
>>> unnecessarily expending backslashes all these years. I thought that one
>>> needed 3 blackslashes. This code does work and I am wondering if/when I
>>> "didn't get the memo". (I do see that there is a line early in the ?regex
>>> page that suggests I have been deluded all along.)
>>>
>>> "The current implementation interprets \a as BEL, \e asESC, \f as FF, \n as
>>> LF, \r as CR and \t as TAB."
>>>
>>>> x <-
>>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>>> closeAllConnections()
>>>> # convert tabs to newlines
>>>> x2 <- gsub("\\\t", "\n", x)
>>>> x2
>>> [1] "sadf|asdf|asdf\nqwer|qwer|qwer\nzxcv|zxcv|zxfcgv"
>>>
>>> So I guess my question is (now) why the triple-slash technique even works?
>>>
>>> --
>>> David.
>>>
>>>
>>>
>>>>> # write out to a temp file and then read in as a data frame
>>>>> myFile <- tempfile()
>>>>> writeLines(x, con = myFile)
>>>>> x.df <- read.table(myFile, sep = "|")
>>>>>
>>>>>
>>>>> x.df
>>>>
>>>>   V1   V2     V3
>>>> 1 sadf asdf   asdf
>>>> 2 qwer qwer   qwer
>>>> 3 zxcv zxcv zxfcgv
>>>>>
>>>>
>>>> On Fri, Nov 18, 2011 at 9:13 AM, Langston, Jim
>>>> <Jim.Langston at compuware.com> wrote:
>>>>>
>>>>> Thanks Paul,
>>>>>
>>>>> That's the path I was marching down, I was hoping for something
>>>>> a little cleaner, I do the same with Perl or Java.
>>>>>
>>>>> Jim
>>>>>
>>>>> On 11/18/11 8:35 AM, "Paul Hiemstra" <paul.hiemstra at knmi.nl> wrote:
>>>>>
>>>>>> Hi Jim,
>>>>>>
>>>>>> You can read the text file using readLines. This puts each line in the
>>>>>> file into an element of a list. Then you can go through the lines
>>>>>> manually (e.g. using grep, sub, strsplit) and create your data.frame.
>>>>>>
>>>>>> cheers,
>>>>>> Paul
>>>>>>
>>>>>> On 11/18/2011 12:37 PM, Langston, Jim wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been scratching and poking, but basically, the file I need to read
>>>>>>> has
>>>>>>> two delimiters that I need to contend with. The first is that the file
>>>>>>> contains
>>>>>>> tabs (\t) , instead of newlines (\n), and the second is that the fields
>>>>>>> have
>>>>>>> | for the seperators. I can easily do a read if I first convert the \t
>>>>>>> to
>>>>>>> \n
>>>>>>> and then use read.table to get the file read with the | separator. But,
>>>>>>> what I would really like to do, is do this all within R. I have a lot
>>>>>>> of
>>>>>>> files
>>>>>>> to read and do analysis on.
>>>>>>>
>>>>>>> I can read the data into a table using the \t has delimiter, but can't
>>>>>>> figure
>>>>>>> out how to take that table data and use the | for separation, I've look
>>>>>>> at
>>>>>>> string splits, etc. but haven't figured out how to split the whole
>>>>>>> table.
>>>>>>>
>>>>>>> Any thoughts ? hints ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>> The contents of this e-mail are intended for the named
>>>>>>> a...{{dropped:6}}
>>>>>>>
>>>>>>>
>>>>> The contents of this e-mail are intended for the named addressee only. It
>>>>> contains information that may be confidential. Unless you are the named
>>>>> addressee or an authorized designee, you may not copy or use it, or disclose
>>>>> it to anyone else. If you received it in error please notify us immediately
>>>>> and then destroy it.
>>>>>
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Paul Hiemstra, Ph.D.
>>>>>> Global Climate Division
>>>>>> Royal Netherlands Meteorological Institute (KNMI)
>>>>>> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
>>>>>> P.O. Box 201 | 3730 AE | De Bilt
>>>>>> tel: +31 30 2206 494
>>>>>>
>>>>>> http://intamap.geo.uu.nl/~paul
>>>>>> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Data Munger Guru
>>>>
>>>> What is the problem that you are trying to solve?
>>>> Tell me what you want to do, not how you want to do it.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>> Internal Contact Info:
>> Phone: 467-7374
>> Website:
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>>
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list