[R] unique/subset problem

Weiwei Shi helprhelp at gmail.com
Fri Jan 26 21:51:27 CET 2007


check
?read.table

and add "as.is=T" in the option. So you read string as character now
and avoid the factor things.

Then repeat your work.

For example
> x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10)
> str(x0,1)
`data.frame':	10 obs. of  7 variables:
 $ V1: Factor w/ 10 levels "-4086733916",..: 10 9 8 7 6 5 4 3 2 1
 $ V2: Factor w/ 10 levels "-1963744741",..: 10 8 7 4 5 6 3 9 1 2
 $ V3: Factor w/ 7 levels "-1687428658",..: 7 4 4 2 5 1 6 6 3 4
 $ V4: Factor w/ 2 levels "5","MECHANISM": 2 1 1 1 1 1 1 1 1 1
 $ V5: Factor w/ 2 levels "0","TYPE": 2 1 1 1 1 1 1 1 1 1
 $ V6: Factor w/ 2 levels "USER_","alexey": 1 2 2 2 2 2 2 2 2 2
 $ V7: Factor w/ 2 levels "3","TRUST": 2 1 1 1 1 1 1 1 1 1
> x0 <- read.table("~/Documents/tox/noodles/four_sheets_orig/reg_r2.txt", sep="\t", nrows=10, as.is=T)
> str(x0,1)
`data.frame':	10 obs. of  7 variables:
 $ V1: chr  "LINK_ID" "-4293537751" "-4247422653" "-4223137153" ...
 $ V2: chr  "ID1" "65259" "1020286" "-518245428" ...
 $ V3: chr  "ID2" "6436" "6436" "-2099509019" ...
 $ V4: chr  "MECHANISM" "5" "5" "5" ...
 $ V5: chr  "TYPE" "0" "0" "0" ...
 $ V6: chr  "USER_" "alexey" "alexey" "alexey" ...
 $ V7: chr  "TRUST" "3" "3" "3" ...

HTH,

weiwei

On 1/26/07, lalitha viswanath <lalithaviswanath at yahoo.com> wrote:
> Hi
> I read in my dataset using
> dt <read.table("filename")
> calling unique(levels(dt$genome1))  yields the
> following
>
>  "aero"      "aful"      "aquae"     "atum_D"
> "bbur"      "bhal"      "bmel"      "bsub"
>  [9] "buch"      "cace"      "ccre"      "cglu"
> "cjej"      "cper"      "cpneuA"    "cpneuC"
> [17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"
> "hinf"      "hpyl"      "linn"      "llact"
> [25] "lmon"      "mgen"      "mjan"      "mlep"
> "mlot"      "mpneu"     "mpul"      "mthe"
> [33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"
> "paer"      "paero"     "pmul"      "pyro"
> [41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
> "sent"      "smel"      "spneu"     "spyo"
> [49] "ssol"      "stok"      "styp"      "synecho"
> "tacid"     "tmar"      "tpal"      "tvol"
> [57] "uure"      "vcho"      "xfas"      "ypes"
>
> It shows 60 genomes, which is correct.
>
> I extracted a subset as follows
> possible_relatives_subset <- subset(dt, Y < -5)
> I am pasting the results below
>      genome1   genome2 parameterX          Y
> 21       sent ecoliO157  0.00590 -200.633493
> 22       sent      paer  0.18603 -100.200570
> 27       styp ecoliO157  0.00484 -240.708645
> 28       styp      paer  0.18497 -30.250127
> 41       paer      sent  0.18603 -60.200570
> 44       paer      styp  0.18497 -80.250127
> 49       paer      hinf  0.18913 -90.056333
> 53       paer      vcho  0.18703 -10.153929
> 55       paer      pmul  0.18587 -100.208042
> 67       paer      buch  0.21485  -80.898667
> 70       paer      ypes  0.18460 -107.267454
> 82       paer      xfas  0.26268  -61.920552
> 95       hinf ecoliO157  0.07654 -163.018417
> 96       hinf      paer  0.18913 -10.056333
> 103      vcho ecoliO157  0.09518 -140.921153
> 104      vcho      paer  0.18703 -10.153929
> 107      pmul ecoliO157  0.07328 -165.215225
> 108      pmul      paer  0.18587 -10.208042
> 131      buch ecoliO157  0.15412 -11.746939
> 132      buch      paer  0.21485  -8.898667
> 137      ypes ecoliO157  0.02705 -19.171851
> 138      ypes      paer  0.18460 -10.267454
> 171 ecoliO157      sent  0.00590 -20.633493
> 174 ecoliO157      styp  0.00484 -20.708645
> 179 ecoliO157      hinf  0.07654 -6.018417
> 183 ecoliO157      vcho  0.09518 -14.921153
> 185 ecoliO157      pmul  0.07328 -6.215225
> 197 ecoliO157      buch  0.15412 -11.746939
> 200 ecoliO157      ypes  0.02705 -9.171851
> 211 ecoliO157      xfas  0.25833  -71.091552
> 217      xfas ecoliO157  0.25833  -75.091552
> 218      xfas      paer  0.26268  -64.920552
>
> I think  even a cursory look will tell us that there
> are not as many unique genomes in the subset results.
> (around 8/10).
> However when I do
> unique(levels(possible_relatives_subset$genome1)), I
> get
>
> [1] "aero"      "aful"      "aquae"     "atum_D"
> "bbur"      "bhal"      "bmel"      "bsub"
>  [9] "buch"      "cace"      "ccre"      "cglu"
> "cjej"      "cper"      "cpneuA"    "cpneuC"
> [17] "cpneuJ"    "ctraM"     "ecoliO157" "hbsp"
> "hinf"      "hpyl"      "linn"      "llact"
> [25] "lmon"      "mgen"      "mjan"      "mlep"
> "mlot"      "mpneu"     "mpul"      "mthe"
> [33] "mtub"      "mtub_cdc"  "nost"      "pabyssi"
> "paer"      "paero"     "pmul"      "pyro"
> [41] "rcon"      "rpxx"      "saur_mu50" "saur_n315"
> "sent"      "smel"      "spneu"     "spyo"
> [49] "ssol"      "stok"      "styp"      "synecho"
> "tacid"     "tmar"      "tpal"      "tvol"
> [57] "uure"      "vcho"      "xfas"      "ypes"
>
> Where am I going wrong?
> I tried calling unique without the levels too, which
> gives me the following response
>
> [1] sent      styp      paer      hinf      vcho
> pmul      buch      ypes      ecoliO157 xfas
> 60 Levels: aero aful aquae atum_D bbur bhal bmel bsub
> buch cace ccre cglu cjej cper cpneuA ... ypes
>
> --- Weiwei Shi <helprhelp at gmail.com> wrote:
>
> > Then you need to provide more details about the
> > calls you made and your dataset.
> > For example, you can tell us by
> > str(prunedrelatives, 1)
> >
> > how did you call unique on prunedrelative and so on?
> > I made a test
> > data it gave me what you wanted (omitted here).
> >
> > On 1/26/07, lalitha viswanath
> > <lalithaviswanath at yahoo.com> wrote:
> > > Hi
> > > The pruned dataset has 8 unique genomes in it
> > while
> > > the dataset before pruning has 65 unique genomes
> > in
> > > it.
> > > However calling unique on the pruned dataset seems
> > to
> > > return 65 no matter what.
> > >
> > > Any assistance in this matter would be
> > appreciated.
> > >
> > > Thanks
> > > Lalitha
> > > --- Weiwei Shi <helprhelp at gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Even you removed "many" genomes1 by setting
> > score<
> > > > -5; it is not
> > > > necessary saying you changed the uniqueness.
> > > >
> > > > To check this, you can do like
> > > > p0 <- unique(dataset[dataset$score< -5,
> > "genome1"])
> > > > # same as subset
> > > > p1 <- unique(dataset[dataset$score>= -5,
> > "genome1"])
> > > >
> > > > setdiff(p1, p0)
> > > >
> > > > if the output above has NULL, then it means even
> > > > though you remove
> > > > many genomes1, but it does not help changing the
> > > > uniqueness.
> > > >
> > > > HTH,
> > > >
> > > > weiwei
> > > >
> > > >
> > > >
> > > > On 1/25/07, lalitha viswanath
> > > > <lalithaviswanath at yahoo.com> wrote:
> > > > > Hi
> > > > > I am new to R programming and am using subset
> > to
> > > > > extract part of a data as follows
> > > > >
> > > > > names(dataset) =
> > > > > c("genome1","genome2","dist","score");
> > > > > prunedrelatives <- subset(dataset, score <
> > -5);
> > > > >
> > > > > However when I use unique to find the number
> > of
> > > > unique
> > > > > genomes now present in prunedrelatives I get
> > > > results
> > > > > identical to calling unique(dataset$genome1)
> > > > although
> > > > > subset has eliminated many genomes and
> > records.
> > > > >
> > > > > I would greatly appreciate your input about
> > using
> > > > > "unique" correctly  in this regard.
> > > > >
> > > > > Thanks
> > > > > Lalitha
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> ____________________________________________________________________________________
> > > > > TV dinner still cooling?
> > > > > Check out "Tonight's Picks" on Yahoo! TV.
> > > > >
> > > > > ______________________________________________
> > > > > R-help at stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide
> > > > http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal,
> > self-contained,
> > > > reproducible code.
> > > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Shi, Ph.D
> > > > Research Scientist
> > > > GeneGO, Inc.
> > > >
> > > > "Did you always know?"
> > > > "No, I did not. But I believed..."
> > > > ---Matrix III
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> ____________________________________________________________________________________
> > > Bored stiff? Loosen up...
> > > Download and play hundreds of games for free on
> > Yahoo! Games.
> > > http://games.yahoo.com/games/front
> > >
> >
> >
> > --
> > Weiwei Shi, Ph.D
> > Research Scientist
> > GeneGO, Inc.
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
>
>
>
>
> ____________________________________________________________________________________
> We won't tell. Get more on shows you hate to love
> (and love to hate): Yahoo! TV's Guilty Pleasures list.
> http://tv.yahoo.com/collections/265
>


-- 
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III



More information about the R-help mailing list