[R] Pattern match

David Winsemius dwinsemius at comcast.net
Thu Apr 21 14:30:28 CEST 2011


On Apr 21, 2011, at 5:27 AM, neetika nath wrote:

> Thank you Dennis,
>
> yes the problem is the input file. i have .rdf file and the format  
> is in
> same way i have posted earlier. if i open that file in notepad++ the  
> lines
> are divided or broken  with CR+LF character. so any suggestion to  
> retrieve
> SpeciesScientific information without changing the input file?

You might consider attaching the original file named with an extension  
of `.txt`, since your verbal description does not match your included  
example. What I see after the various servers have passed this around  
and inserted line-ends is the string `SpeciesScientific` in the first  
line, rather than in the third.

-- 
David

-- 
>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser at gmail.com>  
> wrote:
>
>> Hi:
>>
>> This is a bit of a roundabout approach; I'm sure that folks with  
>> regex
>> expertise will trump this in a heartbeat. I modified the last piece  
>> of
>> the string a bit to accommodate the approach below. Depending on  
>> where
>> the strings have line breaks, you may have some odd '\n' characters
>> inserted.
>>
>> # Step 1: read the input as a single character string
>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>
>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C- 
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O- 
>> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, 
>> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>
>> # Step 2: Split input lines by the ';' delimiter and then use  
>> lapply()
>> to split variable names from values.
>> # This results in a nested list for ulist2.
>> ulist <- strsplit(u, ';')
>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>
>> # Step 3: Break out the results into a matrix whose first column is
>> the variable name
>> # and whose second column is the value (with parens included)
>> # This avoids dealing with nested lists
>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>
>> # Step 4: Strip off the parens
>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>> colnames(w) <- c('Name', 'Value')
>> w
>>     Name                 Value
>> [1,] "SpeciesCommon"      "Human"
>> [2,] "SpeciesScientific"  "Homo sapiens"
>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>> [4,] "BondInvolved"       "C-H"
>> [5,] "EzCatDBID"          "S00343"
>> [6,] "BondFormed"         "O-H,O-H"
>> [7,] "Bond"               "255B"
>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>> [9,] "CatalyticSwissProt" "P25006"
>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>> [11,] "SpeciesCommon"      "Bacteria"
>> [12,] "Reactive"           "Ce+"
>>
>> # Step 5: Subset out the values of the SpeciesScientific variables
>> subset(as.data.frame(w), Name == 'SpeciesScientific', select =  
>> 'Value')
>>                        Value
>> 2                 Homo sapiens
>> 10 Achromobacter\ncycloclastes
>>
>>
>> One possible 'advantage' of this approach is that if you have a  
>> number
>> of string records of this type, you can create nested lists for each
>> string and then manipulate the lists to get what you need. Hopefully
>> you can use some of these ideas for other purposes as well.
>>
>> Dennis
>>
>>
>>
>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
>>> Hi ALL,
>>>
>>> I have very simple question regarding pattern matching. Could  
>>> anyone tell
>> me
>>> how to I can use R to retrieve string pattern from text file.  for
>> example
>>> my file contain following information
>>>
>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>
>> H,O,C,C,C,C,O,H);BondInvolved=(C- 
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU, 
>> 502,A);CatalyticSwissProt=(P25006);Sp+
>>> eciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>
>>> and I want to extract “SpeciesScientific = (?)” information from  
>>> this
>> file.
>>> Problem is in 3rd line where SpeciesScientific word is divided  
>>> with +.
>>>
>>> Could anyone help me please?
>>> Thank you
>>>
>>>
>>> --
>>> View this message in context:
>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list