[R] father and son heights

Gabor Grothendieck ggrothendieck at myway.com
Sun Feb 15 20:57:10 CET 2004

According to:


there are actually two father/son height datasets.  One was
collected by Galton.  Apparently Pearson used that data but 
also collected and used a second dataset together with Alice Lee 
in roughly the same time frame.

Date:   Sun, 15 Feb 2004 15:30:43 -0400 (AST) 
From:   Rolf Turner <rolf at math.unb.ca>
To:   <loraine at loraine.net> 
Cc:   <r-help at stat.math.ethz.ch> 
Subject:   Re: [R] father and son heights 

Ann Loraine wrote:

> I'm looking for Pearson's father and son height data.
> ........... It's a data set that is used to teach Pearson's 
> correlation coefficient in a popular statistics textbook - "Statistics" 
> by Freedman, Pisani, et al.
> It contains over a thousand measurements of son's and their father's 
> heights.
> I would like to find it in electronic form so that I can use it to 
> prepare figures and examples for a lecture.
> If anyone knows where I could find it, please let me know. I've done a 
> few Google searches but haven't had any luck so far. I also used the 
> data() command to look through R's built-in data sets and couldn't find 
> it. Any suggestions would be most welcome!

I believe that you have been searching under the wrong name. The
data are most closely associated with Galton (the bloke to whom the
word ``regression'' is due) rather than with Pearson.

A search on

     Galton height

led me immediately to


where the data appear to be readily available.

I ***presume*** that these are the data you seek, although there are
only 930 observations, not ``over a thousand''. (Close, but!)

The data are given to a limited accurracy, which induces a strangely
grid-like appearance when they are plotted, but that is presumably
the nature of this data set. They were apparently taken from a table
prepared by Galton. Values which were originally given in Galton's
table as ``>= 73.7'' or ``<= 61.7'' are truncated to their respective

One thing that puzzles me: The documentation says that the data
pertain to 928 children, yet there are 930 data points. (????)
I can't find an explanation in the documentation. Maybe I'm just
blind. Or thick.


                         Rolf Turner
                         rolf at math.unb.ca

More information about the R-help mailing list