[R] Conditional subsetting

Gabor Grothendieck ggrothendieck at gmail.com
Wed Aug 8 07:37:43 CEST 2007


Another way to do this is by using the SQLite select statements making
use of the sqldf package.

First let us assume that the alphabetic order of the Quality levels
reflects their desirability as well.  In the example data this is true
but if its not true in the whole data set where there may be additional
levels of Quality then it would need to be recoded first.

The first select statement sorts the data and the second one just takes
the last row in each group.  (It seems
that when using the group by clause that it takes the last row of each group
but I could not actually find this documented so beware.)  Note
that Time and Date names are changed by the underlying software
which is why we must refer to them using Time__1 and Date__1.

Lines <- "DeployID Date.Time LocationQuality Latitude Longitude

STM05-1 28/02/2005 17:35 Good -35.562 177.158
STM05-1 28/02/2005 19:44 Good -35.487 177.129
STM05-1 28/02/2005 23:01 Unknown -35.399 177.064
STM05-1 01/03/2005 07:28 Unknown -34.978 177.268
STM05-1 01/03/2005 18:06 Poor -34.799 177.027
STM05-1 01/03/2005 18:47 Poor -34.85 177.059
STM05-2 28/02/2005 12:49 Good -35.928 177.328
STM05-2 28/02/2005 21:23 Poor -35.926 177.314
"

# in next line replace textConnection(Lines) with "myfile.dat"
library(sqldf)
DF <- read.table(textConnection(Lines), skip = 1,  as.is = TRUE,
  col.names = c("Id", "Date", "Time", "Quality", "Lat", "Long"))

DFo <- sqldf("select * from DF order by
  substr(Date__1, 7, 4) || substr(Date__1, 4, 2) || substr(Date__1, 1, 2) DESC,
  Quality DESC,
  abs(substr(Time__1, 1, 2) + substr(Time__1, 4, 2) /60 - 12) DESC")
sqldf("select * from DFo group by Date__1")

####
# Here is a second different way to do it.
# Another way to do it also using sqldf is via nested selects like this using
# the same DF as above

sqldf("select * from DF u
 where abs(substr(Time__1, 1, 2) + substr(Time__1, 4, 2) /60 - 12) =
   (select min(abs(substr(Time__1, 1, 2) + substr(Time__1, 4, 2) /60 - 12))
     from DF x where Quality =
       (select min(Quality) from DF y
          where x.Date__1 = y.Date__1) and x.Date__1 = u.Date__1)")


On 8/7/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Read the data in and transform the Date, Time and Quality columns
> so that Date and Time are the respective chron classes (see RNews 4/1 Help
> Desk article) and Quality is a factor such that lower levels are better.
>
> Then sort the data frame by Date, Quality and time from noon.  That would
> actually be the default ordering if the shown factor levels are the only ones
> since they naturally fall into that order but we have explicitly ordered them
> in case your extension to other levels no longer obeys that property.
>
> Finally take the first row from each Date group.
>
> Lines <- "DeployID Date.Time LocationQuality Latitude Longitude
> STM05-1 28/02/2005 17:35 Good -35.562 177.158
> STM05-1 28/02/2005 19:44 Good -35.487 177.129
> STM05-1 28/02/2005 23:01 Unknown -35.399 177.064
> STM05-1 01/03/2005 07:28 Unknown -34.978 177.268
> STM05-1 01/03/2005 18:06 Poor -34.799 177.027
> STM05-1 01/03/2005 18:47 Poor -34.85 177.059
> STM05-2 28/02/2005 12:49 Good -35.928 177.328
> STM05-2 28/02/2005 21:23 Poor -35.926 177.314
> "
>
> library(chron)
> # in next line replace textConnection(Lines) with "myfile.dat"
> DF <- read.table(textConnection(Lines), skip = 1,  as.is = TRUE,
>  col.names = c("Id", "Date", "Time", "Quality", "Lat", "Long"))
>
> DF <- transform(DF,
>        Date = chron(Date, format = "d/m/y"),
>        Time = times(paste(Time, "00", sep = ":")),
>        Quality = factor(Quality, levels = c("Good", "Poor", "Unknown")))
>
> o <- order(DF$Date, as.numeric(DF$Quality), abs(DF$Time - times("12:00:00")))
> DF <- DF[o,]
>
> DF[tapply(row.names(DF), DF$Date, head, 1), ]
>
> # The last line above could alternately be written like this:
> do.call("rbind", by(DF, DF$Date, head, 1))
>
>
>
> On 8/7/07, Tim Sippel <tsip002 at ec.auckland.ac.nz> wrote:
> > Thanks Ross.  Yes, upon your remark it seems that maybe I need to be
> > selecting first for the best available location, and then by the time stamp
> > nearest to 12pm.  It seems like tapply(), or aggregate() or aggregate.ts()
> > might be applicable, but I'm working through trying different functions and
> > have yet to find what works.  Suggestions on what function to start with
> > would be appreciated.
> >
> >
> >
> > Tim Sippel (MSc)
> >
> > School of Biological Sciences
> >
> > Auckland University
> >
> > Private Bag 92019
> >
> > Auckland 1020
> >
> > New Zealand
> >
> > +64-9-373-7599 ext. 84589 (work)
> >
> > +64-9-373-7668 (Fax)
> >
> > +64-21-593-001 (mobile)
> >
> >
> >
> >  _____
> >
> > From: Ross Darnell [mailto:r.darnell at uq.edu.au]
> > Sent: 08 August 2007 12:43
> > To: Tim Sippel; r-help at stat.math.ethz.ch
> > Subject: RE: [R] Conditional subsetting
> >
> >
> >
> >
> >
> > Assuming *A* animal can only be in one location at any one time, I don't
> > understand how once you have selected the location nearest 12pm how you can
> > select based on the type of location?
> > Do you mean to select on location before timei.e. from the best location
> > visited that day which was closest to 12pm?
> >
> > Perhaps for condition 2  for a animal which.min(time-1200) (replace 1200
> > with properly defined timestamp representing 12:00pm)
> >
> > Ross Darnell
> >
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch on behalf of Tim Sippel
> > Sent: Wed 08-Aug-07 9:37 AM
> > To: r-help at stat.math.ethz.ch
> > Subject: [R] Conditional subsetting
> >
> > Hello-
> >
> > Upon searching the email archives and reading documentation I haven't found
> > what I'm looking for.  I have a dataset with a time/date stamp on a series
> > of geographic locations (lat/lon) associated with the movements of animals.
> > On some days there are numerous locations and some days there is only one
> > location.  Associated with each location is an indication that the quality
> > of the location is either "good", "poor", or "unknown".  For each animal, I
> > want to extract only one location for each day.  And I want the location
> > extracted to be nearest to 12pm on any given day, and highest quality
> > possible.
> >
> > So the order of priority for my extraction will be: 1. only one location per
> > animal each day; 2. the location selected be as close to 12pm as possible;
> > 3. the selected location comes from the best locations available (ie. take
> > from pool of "good" locations first, or select from "poor" locations if a
> > "good" location isn't available, or "unknown" if nothing else is available).
> >
> >
> >
> >
> > I think aspect of this task that has me particularly stumped is how to
> > select only one location for each day, and for that location to be a close
> > to 12pm as possible.
> >
> >
> >
> > An example of my dataset follows:
> >
> >
> >
> >
> >
> > DeployID
> >
> >
> >
> > Date.Time
> >
> >
> >
> > LocationQuality
> >
> > Latitude
> >
> > Longitude
> >
> >
> > STM05-1
> >
> >
> >
> > 28/02/2005 17:35
> >
> >
> >
> > Good
> >
> > -35.562
> >
> > 177.158
> >
> >
> > STM05-1
> >
> >
> >
> > 28/02/2005 19:44
> >
> >
> >
> > Good
> >
> > -35.487
> >
> > 177.129
> >
> >
> > STM05-1
> >
> >
> >
> > 28/02/2005 23:01
> >
> >
> >
> > Unknown
> >
> > -35.399
> >
> > 177.064
> >
> >
> > STM05-1
> >
> >
> >
> > 01/03/2005 07:28
> >
> >
> >
> > Unknown
> >
> > -34.978
> >
> > 177.268
> >
> >
> > STM05-1
> >
> >
> >
> > 01/03/2005 18:06
> >
> >
> >
> > Poor
> >
> > -34.799
> >
> > 177.027
> >
> >
> > STM05-1
> >
> >
> >
> > 01/03/2005 18:47
> >
> >
> >
> > Poor
> >
> > -34.85
> >
> > 177.059
> >
> >
> > STM05-2
> >
> >
> >
> > 28/02/2005 12:49
> >
> >
> >
> > Good
> >
> > -35.928
> >
> > 177.328
> >
> >
> > STM05-2
> >
> >
> >
> > 28/02/2005 21:23
> >
> >
> >
> > Poor
> >
> > -35.926
> >
> > 177.314
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Many thanks for your input.  I'm using R 2.5.1 on Windows XP.
> >
> >
> >
> > Cheers,
> >
> >
> >
> > Tim Sippel (MSc)
> >
> > School of Biological Sciences
> >
> > Auckland University
> >
> > Private Bag 92019
> >
> > Auckland 1020
> >
> > New Zealand
> >
> > +64-9-373-7599 ext. 84589 (work)
> >
> > +64-9-373-7668 (Fax)
> >
> > +64-21-593-001 (mobile)
> >
> >
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



More information about the R-help mailing list