[R] Random assignment

Fri Oct 15 17:59:12 CEST 2010

Hi Michael,

Thanks very much for the feedback and taking time to look at the paper.

I am looking to do something similar to the paper ( I am working at Kew gardens , where the rest red listing has taken place so i  have all the red list data available for the moncots - my group.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-1.pdf
Type: application/pdf
Size: 93611 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101015/fa6d3e77/attachment.pdf>
-------------- next part --------------

They do indeed differentiate between high-risk (red listed) and low-risk species within a family ( black line on graph above) but they also do some simple simulation experiments using binomial expectation to see if risk is random with respect to taxonomy ( red line on graph above) at least that is how i have interpreted it? - page 1049, second column - half way down) .  I am happy with generating the black line and playing around with p-values but am unsure how to simulate the "red" line. 

John

On 15 Oct 2010, at 14:18, Michael Bedward wrote:

Hello again John,

I was going to suggest that you just use qbinom to generate the
expected number of extinctions. For example, for the family with 80
spp the central 95% expectation is:

qbinom(c(0.025, 0.975), 80, 0.0748)

which gives 2 - 11 spp.

If you wanted to do look across a large number of families you'd need
to deal with multiple comparison error but as a quick first look it
might be helpful.

However, I've just got a copy of teh paper and it seems that the
authors are calculating something different to a simple binomial
expecation: they are differentiating between high-risk (red listed)
and low-risk species within a family. They state that this equation
(expressed here in R-ese)...

choose(N, R) * p^R * b^(N - R)

...gives the probabilitiy of an entire family becoming extinct, where
N is number of spp in family; R is number of those that are red
listed; p is extinction probability for red list spp (presumably over
some period but I haven't read the paper properly yet); b is
extinction probability for other spp.

Then, in their simulations they hold b constant but play around with a
range of values for p.

So this sounds a bit different to what you originally posted as your
objective (?)

Michael

On 15 October 2010 22:49, Michael Bedward <michael.bedward at gmail.com> wrote:
> Hi John,
> 
> The word "species" attracted my attention :)
> 
> Like Dennis, I'm not sure I understand your idea properly. In
> particular, I don't see what you need the simulation for.
> 
> If family F has Fn species, your random expectation is that p * Fn of
> them will be at risk (p = 0.0748). The variance on that expectation
> will be p * (1-p) * Fn.
> 
> If you do your simulation that's the result you'll get.  Perhaps to
> initial identify families with disproportionate observed extinction
> rates all you need is the dbinom function ?
> 
> Michael
> 
> 
> On 15 October 2010 22:29, John Haart <another83 at me.com> wrote:
>> Hi Denis and list
>> 
>> Thanks for this , and sorry for not providing enough information
>> 
>> First let me put the study into a bit more context : -
>> 
>> I know the number of species at risk in each family, what i am asking  is "Is risk random according to family or do certain families have a disproportionate number of at risk species?"
>> 
>> My idea was to randomly allocate risk to the families based on the criteria below (binomial(nspecies, 0.0748)) and then compare this to the "true data" and see if there was a significant difference.
>> 
>> So in answer to your questions, (assuming my method is correct !)
>> 
>>> Is this over all families, or within a particular family? If the former, why
>>> does a distinction of family matter?
>> 
>> Within a particular family  - this is because i am looking to see if risk in the "observed" data set is random in respect to family so this will provide the baseline to compare against.
>> 
>>> I guess you've stated the p, but what's the n? The number of species in each
>>> family?
>> 
>> This varies largely, for instance i have some families that are monotypic  (with 1 species) and then i have other families with 100+ species
>> 
>> 
>>> Assuming you have multiple families, do you want separate simulations per
>>> family, or do you want to do some sort of weighting (perhaps proportional to
>>> size) over all families?
>> 
>> I am assuming i want some sort of weighting. This is because i am wanting to calculate the number of species expected to be at risk in EACH family under the random binomial distribution ( assuming every species has a 7.48% chance of being at risk.
>> 
>> Thanks
>> 
>> John
>> 
>> 
>> 
>> 
>> On 15 Oct 2010, at 11:19, Dennis Murphy wrote:
>> 
>> Hi:
>> 
>> I don't believe you've provided quite enough information just yet...
>> 
>> On Fri, Oct 15, 2010 at 2:22 AM, John Haart <another83 at me.com> wrote:
>> 
>>> Dear List,
>>> 
>>> I am doing some simulation in R and need basic help!
>>> 
>>> I have a list of animal families for which i know the number of species in
>>> each family.
>>> 
>>> I am working under the assumption that a species has a 7.48% chance of
>>> being at risk.
>>> 
>> 
>> Is this over all families, or within a particular family? If the former, why
>> does a distinction of family matter?
>> 
>>> 
>>> I want to simulate the number of species expected to be at risk under a
>>> random binomial distribution with 10,000 randomizations.
>>> 
>> 
>> I guess you've stated the p, but what's the n? The number of species in each
>> family? If you're simulating on a family by family basis, then it would seem
>> that a binomial(nspecies, 0.0748) distribution would be the reference.
>> Assuming you have multiple families, do you want separate simulations per
>> family, or do you want to do some sort of weighting (perhaps proportional to
>> size) over all families? The latter is doable, but it would require a
>> two-stage simulation: one to randomly select a family and then to randomly
>> select a species.
>> 
>> Dennis
>> 
>> 
>>> 
>>> I am relatively knew to this field and would greatly appreciate a
>>> "idiot-proof" response, I.e how should the data be entered into R? I was
>>> thinking of using read.table, header = T, where the table has F = Family
>>> Name, and SP = Number of species in that family?
>>> 
>>> John
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>>        [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 

On 15 October 2010 23:34, Michael Bedward <michael.bedward at gmail.com> wrote:
> Hi John,
> 
> I haven't read that particular paper but in answer to your question...
> 
>> So if i do this for all the families it will be the same as doing the simulation experiment
>> outline in the method above?
> 
> Yes :)
> 
> Michael
> 
> 
> On 15 October 2010 23:18, John Haart <another83 at me.com> wrote:
>> Hi Michael,
>> 
>> Thanks for this - the reason i am following this approach is that it appeared in a paper i was reading, and i thought it was a interesting angle to take
>> 
>> The paper is
>> 
>> Vamosi & Wilson, 2008. Nonrandom extinction leads to elevated loss of angiosperm evolutionary history. Ecology Letters, (2008) 11: 1047?1053.
>> 
>> and the specific method i am following states :-
>> 
>>> We calculated the number of species expected to be at risk in each family under a random binomial distribution in 10 000 randomizations [generated using R version 2.6.0 (R Development Team 2007)] assuming every species has a 7.48% chance of being at risk.
>> 
>> I guess the reason i am doing the simulation is because i am not hugely statistically minded and the paper was asking the same question i am interested in answering :).
>> 
>> So following your approach -
>> 
>>> if family F has Fn species, your random expectation is that p * Fn of
>>> them will be at risk (p = 0.0748). The variance on that expectation
>>> will be p * (1-p) * Fn.
>> 
>> 
>> Family f = Bromeliaceae , with Fn = 80, p=0.0748
>> random expectation = p*Fn = (0.0748*80) = 5.984
>> variance = p * (1-p) * Fn = (0.0748*0.9252) *80 = 5.5363968
>> 
>> So the random expectation is that the Bromeliaceae will have 6 species at risk, if risk is assigned randomly?
>> 
>> So if i do this for all the families it will be the same as doing the simulation experiment outline in the method above?
>> 
>> Thanks
>> 
>> John
>> 
>> 
>> 
>> 
>> On 15 Oct 2010, at 12:49, Michael Bedward wrote:
>> 
>> Hi John,
>> 
>> The word "species" attracted my attention :)
>> 
>> Like Dennis, I'm not sure I understand your idea properly. In
>> particular, I don't see what you need the simulation for.
>> 
>> If family F has Fn species, your random expectation is that p * Fn of
>> them will be at risk (p = 0.0748). The variance on that expectation
>> will be p * (1-p) * Fn.
>> 
>> If you do your simulation that's the result you'll get.  Perhaps to
>> initial identify families with disproportionate observed extinction
>> rates all you need is the dbinom function ?
>> 
>> Michael
>> 
>> 
>> On 15 October 2010 22:29, John Haart <another83 at me.com> wrote:
>>> Hi Denis and list
>>> 
>>> Thanks for this , and sorry for not providing enough information
>>> 
>>> First let me put the study into a bit more context : -
>>> 
>>> I know the number of species at risk in each family, what i am asking  is "Is risk random according to family or do certain families have a disproportionate number of at risk species?"
>>> 
>>> My idea was to randomly allocate risk to the families based on the criteria below (binomial(nspecies, 0.0748)) and then compare this to the "true data" and see if there was a significant difference.
>>> 
>>> So in answer to your questions, (assuming my method is correct !)
>>> 
>>>> Is this over all families, or within a particular family? If the former, why
>>>> does a distinction of family matter?
>>> 
>>> Within a particular family  - this is because i am looking to see if risk in the "observed" data set is random in respect to family so this will provide the baseline to compare against.
>>> 
>>>> I guess you've stated the p, but what's the n? The number of species in each
>>>> family?
>>> 
>>> This varies largely, for instance i have some families that are monotypic  (with 1 species) and then i have other families with 100+ species
>>> 
>>> 
>>>> Assuming you have multiple families, do you want separate simulations per
>>>> family, or do you want to do some sort of weighting (perhaps proportional to
>>>> size) over all families?
>>> 
>>> I am assuming i want some sort of weighting. This is because i am wanting to calculate the number of species expected to be at risk in EACH family under the random binomial distribution ( assuming every species has a 7.48% chance of being at risk.
>>> 
>>> Thanks
>>> 
>>> John
>>> 
>>> 
>>> 
>>> 
>>> On 15 Oct 2010, at 11:19, Dennis Murphy wrote:
>>> 
>>> Hi:
>>> 
>>> I don't believe you've provided quite enough information just yet...
>>> 
>>> On Fri, Oct 15, 2010 at 2:22 AM, John Haart <another83 at me.com> wrote:
>>> 
>>>> Dear List,
>>>> 
>>>> I am doing some simulation in R and need basic help!
>>>> 
>>>> I have a list of animal families for which i know the number of species in
>>>> each family.
>>>> 
>>>> I am working under the assumption that a species has a 7.48% chance of
>>>> being at risk.
>>>> 
>>> 
>>> Is this over all families, or within a particular family? If the former, why
>>> does a distinction of family matter?
>>> 
>>>> 
>>>> I want to simulate the number of species expected to be at risk under a
>>>> random binomial distribution with 10,000 randomizations.
>>>> 
>>> 
>>> I guess you've stated the p, but what's the n? The number of species in each
>>> family? If you're simulating on a family by family basis, then it would seem
>>> that a binomial(nspecies, 0.0748) distribution would be the reference.
>>> Assuming you have multiple families, do you want separate simulations per
>>> family, or do you want to do some sort of weighting (perhaps proportional to
>>> size) over all families? The latter is doable, but it would require a
>>> two-stage simulation: one to randomly select a family and then to randomly
>>> select a species.
>>> 
>>> Dennis
>>> 
>>> 
>>>> 
>>>> I am relatively knew to this field and would greatly appreciate a
>>>> "idiot-proof" response, I.e how should the data be entered into R? I was
>>>> thinking of using read.table, header = T, where the table has F = Family
>>>> Name, and SP = Number of species in that family?
>>>> 
>>>> John
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>