[R] extracting characters from a string

Thu Jan 24 07:37:34 CET 2013

HI David,

It could be related to spaces in the data or something else.  
Suppose, if the data has some spaces at the end or the beginning.
pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D ')

pubnew<-rbind(pub1, pub2, pub3)
res<-as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("^ | $","",gsub("[A-Za-z]+$","",gsub(" $","",x))))),stringsAsFactors=F)
str(res)
#'data.frame':    3 obs. of  4 variables:
# $ V1: chr  "Brown" "Benigni" "Arstra"
# $ V2: chr  "Santos" "" "Van den Hoops"
# $ V3: chr  "Rome" "" "lamarque"
# $ V4: chr  "Don Juan" "" ""

#If I used the previous solution:
as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F)
       V1            V2         V3       V4
1   Brown        Santos       Rome Don Juan
2 Benigni                                  
3  Arstra Van den Hoops lamarque D  # initial present.

I tried this case with Rui's solution:
fun2(pubnew)
#[[1]]
#[1] " Brown"   "Santos"   "Rome"     "Don Juan"

#[[2]]
#[1] "Benigni"
#
#[[3]]
#[1] "Arstra"        "Van den Hoops" "lamarque D"   # tinitials present.

As Rui's solution works for you, the problem might be something else.
A.K.

________________________________
From: Biau David <djmbiau at yahoo.fr>
To: arun <smartpink111 at yahoo.com> 
Sent: Thursday, January 24, 2013 12:40 AM
Subject: Re: [R] extracting characters from a string

thanks a lot. it doesn't entirely work well yet; poabably because of the format of the data I import. I have to look into it and thanks to your explanation, I should be able to find the problem in the data.

David

>________________________________
> De : arun <smartpink111 at yahoo.com>
>À : Biau David <djmbiau at yahoo.fr> 
>Envoyé le : Mercredi 23 janvier 2013 19h06
>Objet : Re: [R] extracting characters from a string
> 
>Hi David,
>
>I forgot about the explanation part.
>dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) # here, I converted it to dataframe, delimited by ",", Used fill=TRUE because you have unequal number of publications in each line
>as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F)
>
>#splitting codes into smaller pieces;
> lapply(dat1,function(x) gsub("^ |\\w+$","",x)) #lapply() will ensure that the columns in dataframe are split to list elements.  Here, the gsub command within first double quotes matches if there are any empty spaces at the start of the string and also the last word characters in each string and removes them ( 2nd set of double quotes are
empty).
>$V1
>[1] "Brown "   "Benigni " "Arstra " 
>
>$V2
>[1] "Santos "        ""               "Van den Hoops "
>
>$V3
>[1] "Rome "     ""          "lamarque "
>
>$V4
>[1] "Don Juan " ""          ""         
>lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x))) # I used a second gsub because there are some spaces at the end e.g. "Brown "
>$V1
>[1] "Brown"   "Benigni" "Arstra" 
>
>$V2
>[1] "Santos"        ""              "Van den Hoops"
>
>$V3
>[1] "Rome"     ""        
"lamarque"
>
>$V4
>[1] "Don Juan" ""         ""        
>
>do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))) #bind by columns
>     V1        V2              V3         V4        
>[1,] "Brown"   "Santos"        "Rome"     "Don Juan"
>[2,] "Benigni" ""              ""         ""        
>[3,] "Arstra"  "Van den Hoops" "lamarque" ""        
>
>Hope it
helps.
>A.K.
>
>
>
>
>
>
>
>
>
>
>
>----- Original Message -----
>From: Biau David <djmbiau at yahoo.fr>
>To: r help list <r-help at r-project.org>
>Cc: 
>Sent: Wednesday, January 23, 2013 12:38 PM
>Subject: [R] extracting characters from a string
>
>Dear All,
>
>I have a data frame of vectors of publication names such as 'pub':
>
>pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
>pub2 <- c('Benigni D')
>pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
>pub <- rbind(pub1, pub2, pub3)
>
>
>I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last
name. I would like to avoid a loop.
>
>ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great!
>
> 
>David
>
>    [[alternative HTML version deleted]]
>
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>
>