[R] unique/subset problem

lalitha viswanath lalithaviswanath at yahoo.com
Fri Jan 26 21:43:13 CET 2007


Hi
I read in my dataset using
dt <read.table("filename")
calling unique(levels(dt$genome1))  yields the
following 

 "aero"      "aful"      "aquae"     "atum_D"   
"bbur"      "bhal"      "bmel"      "bsub"     
 [9] "buch"      "cace"      "ccre"      "cglu"     
"cjej"      "cper"      "cpneuA"    "cpneuC"   
[17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"     
"hinf"      "hpyl"      "linn"      "llact"    
[25] "lmon"      "mgen"      "mjan"      "mlep"     
"mlot"      "mpneu"     "mpul"      "mthe"     
[33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"  
"paer"      "paero"     "pmul"      "pyro"     
[41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
"sent"      "smel"      "spneu"     "spyo"     
[49] "ssol"      "stok"      "styp"      "synecho"  
"tacid"     "tmar"      "tpal"      "tvol"     
[57] "uure"      "vcho"      "xfas"      "ypes"     

It shows 60 genomes, which is correct.

I extracted a subset as follows
possible_relatives_subset <- subset(dt, Y < -5)
I am pasting the results below
     genome1   genome2 parameterX          Y
21       sent ecoliO157  0.00590 -200.633493
22       sent      paer  0.18603 -100.200570
27       styp ecoliO157  0.00484 -240.708645
28       styp      paer  0.18497 -30.250127
41       paer      sent  0.18603 -60.200570
44       paer      styp  0.18497 -80.250127
49       paer      hinf  0.18913 -90.056333
53       paer      vcho  0.18703 -10.153929
55       paer      pmul  0.18587 -100.208042
67       paer      buch  0.21485  -80.898667
70       paer      ypes  0.18460 -107.267454
82       paer      xfas  0.26268  -61.920552
95       hinf ecoliO157  0.07654 -163.018417
96       hinf      paer  0.18913 -10.056333
103      vcho ecoliO157  0.09518 -140.921153
104      vcho      paer  0.18703 -10.153929
107      pmul ecoliO157  0.07328 -165.215225
108      pmul      paer  0.18587 -10.208042
131      buch ecoliO157  0.15412 -11.746939
132      buch      paer  0.21485  -8.898667
137      ypes ecoliO157  0.02705 -19.171851
138      ypes      paer  0.18460 -10.267454
171 ecoliO157      sent  0.00590 -20.633493
174 ecoliO157      styp  0.00484 -20.708645
179 ecoliO157      hinf  0.07654 -6.018417
183 ecoliO157      vcho  0.09518 -14.921153
185 ecoliO157      pmul  0.07328 -6.215225
197 ecoliO157      buch  0.15412 -11.746939
200 ecoliO157      ypes  0.02705 -9.171851
211 ecoliO157      xfas  0.25833  -71.091552
217      xfas ecoliO157  0.25833  -75.091552
218      xfas      paer  0.26268  -64.920552

I think  even a cursory look will tell us that there
are not as many unique genomes in the subset results.
(around 8/10).
However when I do
unique(levels(possible_relatives_subset$genome1)), I
get

[1] "aero"      "aful"      "aquae"     "atum_D"   
"bbur"      "bhal"      "bmel"      "bsub"     
 [9] "buch"      "cace"      "ccre"      "cglu"     
"cjej"      "cper"      "cpneuA"    "cpneuC"   
[17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"     
"hinf"      "hpyl"      "linn"      "llact"    
[25] "lmon"      "mgen"      "mjan"      "mlep"     
"mlot"      "mpneu"     "mpul"      "mthe"     
[33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"  
"paer"      "paero"     "pmul"      "pyro"     
[41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
"sent"      "smel"      "spneu"     "spyo"     
[49] "ssol"      "stok"      "styp"      "synecho"  
"tacid"     "tmar"      "tpal"      "tvol"     
[57] "uure"      "vcho"      "xfas"      "ypes" 

Where am I going wrong?
I tried calling unique without the levels too, which
gives me the following response

[1] sent      styp      paer      hinf      vcho     
pmul      buch      ypes      ecoliO157 xfas     
60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
buch cace ccre cglu cjej cper cpneuA ... ypes

--- Weiwei Shi <helprhelp at gmail.com> wrote:

> Then you need to provide more details about the
> calls you made and your dataset.
> For example, you can tell us by
> str(prunedrelatives, 1)
> 
> how did you call unique on prunedrelative and so on?
> I made a test
> data it gave me what you wanted (omitted here).
> 
> On 1/26/07, lalitha viswanath
> <lalithaviswanath at yahoo.com> wrote:
> > Hi
> > The pruned dataset has 8 unique genomes in it
> while
> > the dataset before pruning has 65 unique genomes
> in
> > it.
> > However calling unique on the pruned dataset seems
> to
> > return 65 no matter what.
> >
> > Any assistance in this matter would be
> appreciated.
> >
> > Thanks
> > Lalitha
> > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Even you removed "many" genomes1 by setting
> score<
> > > -5; it is not
> > > necessary saying you changed the uniqueness.
> > >
> > > To check this, you can do like
> > > p0 <- unique(dataset[dataset$score< -5,
> "genome1"])
> > > # same as subset
> > > p1 <- unique(dataset[dataset$score>= -5,
> "genome1"])
> > >
> > > setdiff(p1, p0)
> > >
> > > if the output above has NULL, then it means even
> > > though you remove
> > > many genomes1, but it does not help changing the
> > > uniqueness.
> > >
> > > HTH,
> > >
> > > weiwei
> > >
> > >
> > >
> > > On 1/25/07, lalitha viswanath
> > > <lalithaviswanath at yahoo.com> wrote:
> > > > Hi
> > > > I am new to R programming and am using subset
> to
> > > > extract part of a data as follows
> > > >
> > > > names(dataset) =
> > > > c("genome1","genome2","dist","score");
> > > > prunedrelatives <- subset(dataset, score <
> -5);
> > > >
> > > > However when I use unique to find the number
> of
> > > unique
> > > > genomes now present in prunedrelatives I get
> > > results
> > > > identical to calling unique(dataset$genome1)
> > > although
> > > > subset has eliminated many genomes and
> records.
> > > >
> > > > I would greatly appreciate your input about
> using
> > > > "unique" correctly  in this regard.
> > > >
> > > > Thanks
> > > > Lalitha
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
____________________________________________________________________________________
> > > > TV dinner still cooling?
> > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal,
> self-contained,
> > > reproducible code.
> > > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > > Research Scientist
> > > GeneGO, Inc.
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> >
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Bored stiff? Loosen up...
> > Download and play hundreds of games for free on


> >
> 
> 
> -- 
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 



 
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love



More information about the R-help mailing list