[R] query on converting survey data from one structure to another

Jonathan Baron baron at psych.upenn.edu
Wed Aug 20 13:03:31 CEST 2003


On 08/20/03 07:13, Vumani Dlamini wrote:
>Dear R users,
>
>I am trying to convert a dataset from one format to several rectangular 
>datasets. A consultant helped design the data entry program for our survey 
>using Delphi/Pascal and for each household the information is stored in a 
>file called "EA-HM-HH.TXT" where EA is the enumeration area number, HM is 
>the homestead number and HH is the household number. Within this file the 
>data is stored as follows,
>
>
>###### file="5677-001-001.TXT" ######
>EAnumber=5677
>HMnumber=001
>HHnumber=001
># Demographics section
>Dserial=01   #first person in household
>Dage=56
>Dsex=1
>Dserial=02   #second person in household
>Dage=44
>Dsex=2
>Dserial=03   #second person in household
>Dage=7
>Dsex=2
># Agricultural inputs section
>Amaize=200
>Apumpkins=50
>###### end of file ########
>
>Note that in the demograpics section there may only be less than 3 or more 
>people in some households. I would like to create a file for the 
>demographics section which is as follows
>
>EAnumber|HMnumber|HHnumber|Dserial|Dage|Dsex
>5677 001 001 056 1
>5677 001 001 044 2
>5677 001 001 007 2
>
>and for the agricultural inputs I would like to have,
>
>EAnumber|HMnumber|HHnumber|Amaize|Apumpkins
>5677 001 001 0200
>5677 001 001 0050
>
>There are several similar files where the EA number, HM number or HH number 
>changes, thus I would also like to know whether it is possible to create a 
>script where for all household there is only one dataset for demographics, 
>and one for agricultural inputs.

The string handling functions in R should be able to handle this
about as easly as could the string functions in (say) Perl, but
I'm not going to do it for you.  It would take me at least an
hour, probably 2 hours.  If I were doing it - and I may not be
doing it the best way - I would first put all the little files
together into one big file f1 with the Unix command cat.  Then I
would read in f1 in a way that gives me a vector of strings, with
one line corresponding to each element of the vector:

d1 <- as.vector(as.matrix(read.table(f1,sep="\n",quote="")))

is how I've done this before, but I'm not sure this is all
necessary.

Then I would start a file d2 for output and write one line with
the names of the variables:

write.table("EAnumber HMnumber HHnumber Dserial Dage Dsex",file="d2"),
            quote=F,row.names=F,col.names=F,sep="")

(I'm not sure this command will work.  I'd try it first by itself.)

Then I would write a loop to process d1 line by line, with
sub-loops to deal with the family members in one of the original
files.  Here is part of a similar loop I wrote for another
purpose.  The variable "onoff" indicated whether I was still
within one of my original records (like one of your little
files).  You would probably want two such variables, one to
indicate the family, one for each of its members.

d2 <- {} # set up a vector for the output, undetermined length
for (i in 1:length(d1)) {
  s1 <- substr(d1[i],1,4)
  if (s1=="age=") {onoff <- TRUE; j <- j+1; d2[j] <- ""}
  if (onoff==1 & s1!="aqua" & s1!="comm" & s1!="apay")
    {d2[j] <- paste(d2[j],strsplit(d1[i],"=")[[1]][2],collapse=" ")}
  if (s1==f2) {onoff <- FALSE}
}

The following command is to replace plus signs with spaces.  It
just shows you how to use gsub().

d2 <- gsub("\\+"," ",d2)

Finally, I would append the vector d2 to the file I had written:

write.table(d2,file="d2",append=T,quote=F,row.names=F,col.names=F,sep="")

I have not tested any of this recently.  You will have to build
it up step by step and test it at each point.  But perhaps this
will get you started.  Also look at the functions grep() and
regexpr(), which might come in hand.

-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page:            http://www.sas.upenn.edu/~baron
R page:               http://finzi.psych.upenn.edu/




More information about the R-help mailing list