[R] How would I analyse data like this?

laurent.duperval@microcell.ca laurent.duperval at microcell.ca
Wed Mar 19 18:40:20 CET 2003


On 19 Mar, james.holtman at convergys.com wrote:
> Have you tried:
>       data <- read.table("data.dat", header=TRUE, sep="|", as.is=TRUE)
> 

Yes I did. However, it takes a LOT more time because of the date/time
string. The result looks like this:


str(data)
`data.frame':	317437 obs. of  8 variables:
 $ phone   : num  1.52e+10 1.42e+10 1.82e+10 1.65e+10 1.65e+10 ...
 $ state   : int  3 3 3 3 3 3 3 3 3 3 ...
 $ code    : int  983 983 983 983 3000 983 983 983 983 5203 ...
 $ amount  : int  1000 1000 2500 2500 2500 1000 1000 2500 2500 2500 ...
 $ left    : int  260 0 0 25 0 1260 273 0 0 0 ...
 $ channel : Factor w/ 5 levels "CSR","IN","IVR",..: 2 5 4 2 3 2 2 3 4 3 ...
 $ time    : Factor w/ 312198 levels "2002-10-16 ..",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ mtd     : Factor w/ 2 levels "C","D": 1 1 1 1 1 1 1 1 1 1 ...

I think the 312198 factor level is wrong. Also, the phone column is  a string,
not a number. I didn't see how to specify that with read.table(). (In my
original post, I think I forgot to mention that I had over 300,000 entries in
my file).

> change your 'br' range to:
>       br=c(0,50,100,150,200,250,300,350,400,450,500,1e10)
> to make sure that you include everything in the last range.

I tried that, but the result is a graph that is too wide. It treats the range
as numerical values instead of bins. Well, to me, anyway. If it's acceptable
policy, I can post a screenshot of the result here (about 25K). Everything is
bunched up on the left, but the right portion is much larger and contains
nothing.

>> - How do I count the number of times channel "IN" occurs with code = 983?
> How about if
>>   I want to combine IN and code=983 or 982 or 981?
> sum(data$channel == "IN" && (data$code %in% c(983,982,981))

Thank, I'll try that.

>>
>> - Finally (for today at least) how do I count the number of times
> code=983 and
>>   date=2003-03-16 (without the time) occur. I'm hoping this will also
> help
>>   me build histograms for days of the week and for hours of the day.
> You need to split off the date from that column with:
> 
> data$date <- unlist(lapply(strsplit(data$time, " "), function(x) x[1]))  #
> get just date
> counts <- table(list(data$date, data$code))  # computes all the counts at
> once into matrix
> 

Ok, I'll try all this.

While I was writing this message, a few more answers came in. Let me try
all those before I reply to them.


Thanks to all,

L

-- 
Laurent Duperval <laurent.duperval at microcell.ca>

"I'm not going to so my maths homework. Look at these unsolved problems. Here's a number in mortal combat with another. One of them is going to get subtracted. But why? What will be left of him? If I answered these, it would kill the suspense. It would resolve the conflict and turn intriguing possibilities into boring old facts."
"I never really thought about the literary possibilities of maths."
"I prefer to savour the mystery."
                                           -Calvin & Hobbes



More information about the R-help mailing list