[R] Calculating symbol (letter) frequencies

Mon Jan 3 19:14:16 CET 2005

My best advice:

Yours is a complicated question, but it is probably the wrong question. I
suspect that you are trying to reinvent wheels: I believe there has been a
lot of work in the statistical community on (OK, maybe DNA not amino acid)
sequence alignment. Undoubtedly much more published in the
biological/bioinformatics literature that I'm unaware of. I think you should
collaborate with a statistician at your institution to help you access that
literature and its methods, which are almost certainly implementable and
probably implemented within R. Perhaps even on BioConductor (despite your
lack of success there thus far).

My far poorer advice:

See ?lapply and relatives, as well as ?table and the links therein. Also
?factor may be of interest.

lapply(alignment,function(x)table(x)/length(x)) 

will return a list of length the number of sites (columns), each component
of which is a frequency table of the letters at the respective site. You can
then further process these results as you like. Is this the sort of thing
that you want?

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Wollenberg, Kurt R
> Sent: Monday, January 03, 2005 9:02 AM
> To: 'r-help at stat.math.ethz.ch'
> Subject: [R] Calculating symbol (letter) frequencies
> 
> Hello:
> 
> I am attempting to use R to analyze amino acid frequencies in aligned
> protein sequences and need some help. So far, I have imported 
> my sequence
> alignment into a data frame (lets call it "alignment") with 
> each site in one
> column, so that I have a data frame consisting of columns of 
> letters (the 21
> amino acid symbols plus "-") with row names being the 
> corresponding protein
> names. >summary(alignment) gives me the counts of symbols in 
> each column.
> Now I would like to convert these counts into frequencies and perform
> calculations on these frequencies. Do I need to place the individual
> elements in summary(alignment) in a separate data frame to perform
> calculations on them? What I've been thinking of is doing is 
> creating a data
> frame of symbol frequencies with each column corresponding to 
> a column in
> the sequence alignment. If it makes sense to do this how do I 
> extract these
> data into a data frame so that I can perform further analyses on these
> frequencies? I've tried >DF1 <- data.frame(alignment, 
> row.names=AA), where
> AA is a character vector of amino acid symbols plus "-", but the error
> message tells me that the "row names supplied are of the 
> wrong length". As
> not all of the symbols are present in each column of 
> "alignment" this makes
> some sense to me, as each summary(alignment[[i]]) varies in 
> length. Also, I
> would need to match up the individual symbol entries in each
> summary(alignment[[i]]) with the corresponding row in the new 
> data frame
> (which I believe can be efficiently done with indexing, but I 
> can't put my
> finger on an appropriate example of how to do this). I have 
> looked at the
> package Biostrings on the Bioconductor site but it doesn't 
> appear to work
> with amino acid sequence alignments. So my questions to the 
> R-help community
> are: Can I do various statistical calculations on and using
> summary(alignment[[i]]) or do I need a separate data frame? 
> If I should be
> using a separate data frame for symbol frequencies how do I 
> extract these
> from the data? Should I try to extract this from summary or 
> is there a more
> efficient way to calculate symbol frequencies?
> 
> Thanks,
> Kurt Wollenberg, PhD
> Tufts Center for Vision Research 
> New England Medical Center
> 750 Washington St, Box 450 
> Boston, MA, USA
> kwollenberg at tufts-nemc.org 
> 617-636-8945 (Fax)
> 617-636-9028 (Lab)
> 
> The most exciting phrase to hear in science, the one that heralds new
> discoveries, is not "Eureka!" (I found it!) but  "That's funny ..." 
> --Isaac Asimov
> 
> 
> ********************** 
> Confidentiality Notice\ **********************\      The 
> inf...{{dropped}}
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>