[R] Calculating symbol (letter) frequencies

Mon Jan 3 18:01:31 CET 2005

Hello:

I am attempting to use R to analyze amino acid frequencies in aligned
protein sequences and need some help. So far, I have imported my sequence
alignment into a data frame (lets call it "alignment") with each site in one
column, so that I have a data frame consisting of columns of letters (the 21
amino acid symbols plus "-") with row names being the corresponding protein
names. >summary(alignment) gives me the counts of symbols in each column.
Now I would like to convert these counts into frequencies and perform
calculations on these frequencies. Do I need to place the individual
elements in summary(alignment) in a separate data frame to perform
calculations on them? What I've been thinking of is doing is creating a data
frame of symbol frequencies with each column corresponding to a column in
the sequence alignment. If it makes sense to do this how do I extract these
data into a data frame so that I can perform further analyses on these
frequencies? I've tried >DF1 <- data.frame(alignment, row.names=AA), where
AA is a character vector of amino acid symbols plus "-", but the error
message tells me that the "row names supplied are of the wrong length". As
not all of the symbols are present in each column of "alignment" this makes
some sense to me, as each summary(alignment[[i]]) varies in length. Also, I
would need to match up the individual symbol entries in each
summary(alignment[[i]]) with the corresponding row in the new data frame
(which I believe can be efficiently done with indexing, but I can't put my
finger on an appropriate example of how to do this). I have looked at the
package Biostrings on the Bioconductor site but it doesn't appear to work
with amino acid sequence alignments. So my questions to the R-help community
are: Can I do various statistical calculations on and using
summary(alignment[[i]]) or do I need a separate data frame? If I should be
using a separate data frame for symbol frequencies how do I extract these
from the data? Should I try to extract this from summary or is there a more
efficient way to calculate symbol frequencies?

Thanks,
Kurt Wollenberg, PhD
Tufts Center for Vision Research 
New England Medical Center
750 Washington St, Box 450 
Boston, MA, USA
kwollenberg at tufts-nemc.org 
617-636-8945 (Fax)
617-636-9028 (Lab)

The most exciting phrase to hear in science, the one that heralds new
discoveries, is not "Eureka!" (I found it!) but  "That's funny ..." 
--Isaac Asimov

********************** 
Confidentiality Notice\ **********************\      The inf...{{dropped}}