[R] How to do the same thing for all levels of a column?

Tue Jul 24 18:17:18 CEST 2012

The OP's request is a bit ambiguous to me: at a given residue, do you
wish to calculate the proportions for only those amino acids that
appear at that residue, or do you wish to include the proportions for
all amino acids, some of which might then be 0.

Assuming the former, then I don't think one needs to go to the lengths
described by John below.

Using your example (thanks!), the following seems to suffice:

> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x)))

$X1
x
   L    R    T
0.50 0.25 0.25

$X2
x
   E    M
0.75 0.25

$X3
x
   N    Y
0.25 0.75

$X4
x
   I    L    Q
0.25 0.50 0.25

$X5
x
   I    V
0.75 0.25

$X6
x
   P    S
0.75 0.25

$X7
x
   D    E    G
0.25 0.50 0.25

$X8
x
   A    C
0.75 0.25

This could, of course, then be modified to add zero proportions for
all non-appearing amino acids.

-- Cheers,
Bert

On Tue, Jul 24, 2012 at 8:18 AM, John Kane <jrkrideau at inbox.com> wrote:
>
>    I think this does what you want using two packages, plyr and reshape2 that
>    you may have to install.  If so install.packages("plyr", "reshape2") should
>    do the trick.
>    library(plyr)
>    library(reshape2)
>    # using supplied file 'myfile" from below
>    time0total = sum(myfile[,2])
>    mydata  <-  myfile[, 2:10]
>    md1  <-  melt(mydata, id = "Time_zero")
>    ddply(md1, .(variable, value), summarise, sum = sum(Time_zero)/time0total)
>
>
>    John Kane
>    Kingston ON Canada
>
>    -----Original Message-----
>    From: zj29 at cornell.edu
>    Sent: Tue, 24 Jul 2012 10:25:21 -0400
>    To: jrkrideau at inbox.com
>    Subject: Re: [R] How to do the same thing for all levels of a column?
>
>    Hi John,
>    Thank you for the tips. My apologies about the unreadable sample data...
>    So here is the output of the sample data, and hopefully it works this time
>    :)
>    myfile  <-  structure(list(Proteins = structure(1:4, .Label = c("p1", "p2",
>    "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731,
>    9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label = c("L",
>    "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L
>    ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L,
>    1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"), X4 = structure(c(1L,
>    2L,  3L,  2L),  .Label  =  c("I",  "L",  "Q"), class = "factor"), X5 =
>    structure(c(1L,
>    2L, 1L, 1L), .Label = c("I", "V"), class = "factor"), X6 = structure(c(1L,
>    1L, 1L, 2L), .Label = c("P", "S"), class = "factor"), X7 = structure(c(1L,
>    3L,  2L,  2L),  .Label  =  c("D",  "E",  "G"), class = "factor"), X8 =
>    structure(c(1L,
>    1L,  2L,  1L),  .Label  =  c("A",  "C"),  class = "factor")), .Names =
>    c("Proteins",
>    "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"), row.names =
>    c(NA,
>    4L), class = "data.frame")
>    And here is my original question:
>    Basically, I have a bunch of protein sequences composed of different amino
>    acid residues, and each residue is represented by an uppercase letter. I
>    want  to  calculate the ratio of different amino acid residues at each
>    position of the proteins.
>
>    If  I  name  this table as myfile.txt, I have the following scripts to
>    calculate the ratio of each amino acid residue at position 1:
>
>    # showing levels of the 3rd column, which means the types of residues
>
>    >myfile[,3]
>
>
>    # calculating the ratio of L
>
>    >list=c(which(myfile[,3]=="L"))
>
>    >time0total=sum(myfile[,2])
>
>    >AA_L=0
>
>    >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>
>    >ratio_L=AA_L/time0total
>
>
>    So how can I write a script to do the same thing for the other two levels (T
>    and R) in column 3, and also do this for every column that contains amino
>    acid residues?
>
>    Thanks a lot!
>
>    Regards,
>
>    Zhao
>    2012/7/24 John Kane <[1]jrkrideau at inbox.com>
>
>      First thing is to supply the data in a useable format.  As is it is
>      essenatially unreadable.  All R-beginners do this. :)
>      Have a look at the dput function  (?dput) for a good way to supply sample
>      data in an email.
>      If you have a large dataset probably a few dozen lines of data would be
>      fine.
>      Something like dput(head(mydata)) should be fine.  Just copy and paste the
>      output into your email.
>      Welcome to R.  I think you will like it.
>      John Kane
>      Kingston ON Canada
>
>    > -----Original Message-----
>    > From: [2]zj29 at cornell.edu
>    > Sent: Mon, 23 Jul 2012 18:01:11 -0400
>    > To: [3]r-help at r-project.org
>    > Subject: [R] How to do the same thing for all levels of a column?
>    >
>    > Dear all,
>    >
>    >
>    >
>    > I am a R beginner, and I am looking for a way to do the same thing for
>    > all
>    > levels of a column in a table.
>    >
>    >
>    >
>    > Basically, I have a bunch of protein sequences composed of different
>    > amino
>    > acid residues, and each residue is represented by an uppercase letter. I
>    > want to calculate the ratio of different amino acid residues at each
>    > position of the proteins. Here is an example table:
>    >
>    > Proteins
>    >
>    > Time_zero
>    >
>    > 1
>    >
>    > 2
>    >
>    > 3
>    >
>    > 4
>    >
>    > 5
>    >
>    > 6
>    >
>    > 7
>    >
>    > 8
>    >
>    > p1
>    >
>    > 0.0050723
>    >
>    > L
>    >
>    > E
>    >
>    > Y
>    >
>    > I
>    >
>    > I
>    >
>    > P
>    >
>    > D
>    >
>    > A
>    >
>    > p2
>    >
>    > 0.0002731
>    >
>    > T
>    >
>    > E
>    >
>    > N
>    >
>    > L
>    >
>    > V
>    >
>    > P
>    >
>    > G
>    >
>    > A
>    >
>    > p3
>    >
>    > 9.757E-05
>    >
>    > L
>    >
>    > M
>    >
>    > Y
>    >
>    > Q
>    >
>    > I
>    >
>    > P
>    >
>    > E
>    >
>    > C
>    >
>    > p4
>    >
>    > 0.0002077
>    >
>    > R
>    >
>    > E
>    >
>    > Y
>    >
>    > L
>    >
>    > I
>    >
>    > S
>    >
>    > E
>    >
>    > A
>    >
>    >
>    >
>    > If I name this table as myfile.txt, I have the following scripts to
>    > calculate the ratio of each amino acid residue at position 1:
>    >
>    > # showing levels of the 3rd column, which means the types of residues
>    >
>    > >myfile[,3]
>    >
>    >
>    >
>    > # calculating the ratio of L
>    >
>    > >list=c(which(myfile[,3]=="L"))
>    >
>    > >time0total=sum(myfile[,2])
>    >
>    > >AA_L=0
>    >
>    > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>    >
>    > >ratio_L=AA_L/time0total
>    >
>    >
>    >
>    > So how can I write a script to do the same thing for the other two levels
>    > (T and R) in column 3, and also do this for every column that contains
>    > amino acid residues?
>    >
>    >
>    >
>    > Many thanks for any help you could give me on this topic! :)
>    >
>    >
>    >
>    > Regards,
>    >
>    > Zhao
>    > --
>    > Zhao JIN
>    > Ph.D. Candidate
>    > Ruth Ley Lab
>    > 467 Biotech
>    > Field of Microbiology, Cornell University
>    > Lab: 607.255.4954
>    > Cell: 412.889.3675
>    >
>
>      >       [[alternative HTML version deleted]]
>      >
>      > ______________________________________________
>      > [4]R-help at r-project.org mailing list
>      > [5]https://stat.ethz.ch/mailman/listinfo/r-help
>      > PLEASE do read the posting guide
>      > [6]http://www.R-project.org/posting-guide.html
>      > and provide commented, minimal, self-contained, reproducible code.
>      ____________________________________________________________
>      FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
>      your desktop!
>      Check it out at [7]http://www.inbox.com/marineaquarium
>
>    --
>    Zhao JIN
>    Ph.D. Candidate
>    Ruth Ley Lab
>    467 Biotech
>    Field of Microbiology, Cornell University
>    Lab: 607.255.4954
>    Cell: 412.889.3675
>      _________________________________________________________________
>
>    [8]3D Earth Screensaver Preview
>    Free 3D Earth Screensaver
>    Watch   the   Earth   right   on   your   desktop!  Check  it  out  at
>    [9]www.inbox.com/earth
>
> References
>
>    1. mailto:jrkrideau at inbox.com
>    2. mailto:zj29 at cornell.edu
>    3. mailto:r-help at r-project.org
>    4. mailto:R-help at r-project.org
>    5. https://stat.ethz.ch/mailman/listinfo/r-help
>    6. http://www.R-project.org/posting-guide.html
>    7. http://www.inbox.com/marineaquarium
>    8. http://www.inbox.com/earth
>    9. http://www.inbox.com/earth
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm