[R] Splitting dataframes and cleaning extraneous characters

arun smartpink111 at yahoo.com
Wed Jul 17 19:47:00 CEST 2013


HI,
One problem with using ?subst() would be it depends upon the number of digits, characters etc.  

For eg.
substring("-005-190",6)
#[1] "190"
 substring("-0057-190",6)
#[1] "-190"

#whereas

gsub("^-[^-]*-","","-0057-190")
#[1] "190"

Probably, your dataset doesn't have that sort of problem.

dat1<- read.table(text="
project boro
123 m
134 k
123 m
123 m
543 q
543 q
134 k
",sep="",header=TRUE,stringsAsFactors=FALSE)
 res<-split(dat1,gsub("\\.","",as.character(interaction(dat1[,2],dat1[,1]))))
 res
$k134
#  project boro
#2     134    k
#7     134    k
#
#$m123
 # project boro
#1     123    m
#3     123    m
#4     123    m
#
#$q543
 # project boro
#5     543    q
#6     543    q
 str(res$k134)
#'data.frame':    2 obs. of  2 variables:
# $ project: int  134 134
# $ boro   : chr  "k" "k"
A.K.



I was able to split the extraneous stuff using 

a<-substring(Project_NBR, first=6) 

and then cbind to add the edited column to the df. I have a 
sample but I am not sure how to provide it to you. I will try to produce
 an example that's similar to what I have: 

project	boro 
123	m 
134	k 
123	m 
123 	m 
543	q 
543	q 
134	k 


Basically I am trying to subset the data frame according to 
project and boro with the name of the subset being boro-project (ex. 
m123, k134) 

I hope this provides more clarity to my problem. 


----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc: 
Sent: Wednesday, July 17, 2013 11:06 AM
Subject: Re: Splitting dataframes and cleaning extraneous characters

Hi,
YOu could try.
?split()
split(ats,ats$Project_NBR)
You also mentioned about two columns.

split(ats,list(ats$col1, ats$col2))

You should have provided an example dataset using ?dput() ( dput(head(data,10)) ) for testing.
Also,

gsub("^-[^-]*-","","-005-190")
#[1] "190"
A.K.




Problem: I have a large data set and need to separate based on factors 
in 2 columns. The final output would be a collection of dataframes 
renamed to 

the corresponding factor levels.   

So far I know that for each corresponding factor I can execute 

x190<-ats[which(Project_NBR=='-005-190'),] 

However there are about 400 factors needing to be separated. 
Also, I would like to remove the "-005-".  Any guidance will be greatly 
appreciated.  



More information about the R-help mailing list