[R] Referencing variable names rather than column numbers

John-Paul Ferguson ferguson_john-paul at gsb.stanford.edu
Sat Dec 5 18:16:31 CET 2009


Holy Cats, those were four quick responses! And the question,
basically, is answered:

1. When in doubt, try quoting column names where you would try using
unquoted column indexes.
2. Subset() seems, overall, the most flexible analog to Stata's
variable-referencing syntax.

I appreciate the help. I'm encouraging several of my PhD students to
pick up R, given the research that they are doing, but it seems wrong
to make them do that without learning it myself. Humbling to be back
at this level of basic interface interaction, but very good to know
that a resource like this list exists.

Best,
John-Paul

2009/12/5 Marc Schwartz <marc_schwartz at me.com>:
> Alternatively, you can use subset(), which supports the ":" operator
> for the 'select' argument:
>
>  > cor(subset(iris, select = Sepal.Length:Petal.Length))
>              Sepal.Length Sepal.Width Petal.Length
> Sepal.Length    1.0000000  -0.1175698    0.8717538
> Sepal.Width    -0.1175698   1.0000000   -0.4284401
> Petal.Length    0.8717538  -0.4284401    1.0000000
>
>
> which is equivalent to:
>
>  > cor(iris[, 1:3])
>              Sepal.Length Sepal.Width Petal.Length
> Sepal.Length    1.0000000  -0.1175698    0.8717538
> Sepal.Width    -0.1175698   1.0000000   -0.4284401
> Petal.Length    0.8717538  -0.4284401    1.0000000
>
>
> So for the pollute data:
>
>   cor(subset(pollute, select = Pollution:Industry))
>
> should work.
>
> Note also that the 'select' argument to subset can take non-contiguous
> column names:
>
> # Skip 'Sepal.Width'
>  > cor(subset(iris, select = c(Sepal.Length, Petal.Length:Petal.Width)))
>              Sepal.Length Petal.Length Petal.Width
> Sepal.Length    1.0000000    0.8717538   0.8179411
> Petal.Length    0.8717538    1.0000000   0.9628654
> Petal.Width     0.8179411    0.9628654   1.0000000
>
> So you have the option of specifying, by name, multiple series of
> contiguous and non-contiguous column names.
>
> See ?subset
>
> HTH,
>
> Marc Schwartz
>
>
> On Dec 5, 2009, at 10:43 AM, Ista Zahn wrote:
>
>> As baptiste noted, you can do
>>
>> cor(pollute[ ,c("Pollution","Temp","Industry")]).
>>
>> But
>>
>> cor(pollute[,"Pollution":"Industry"])
>>
>> will not work. For that you can do
>>
>> cor
>> (pollute
>> [ ,which
>> (names(pollute)=="Pollution"):which(names(pollute)=="Industry")])
>>
>> -Ista
>>
>> On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson
>> <ferguson_john-paul at gsb.stanford.edu> wrote:
>>> I apologize for how basic a question this is. I am a Stata user who
>>> has begun using R, and the syntax differences still trip me up. The
>>> most basic questions, involving as they do general terms, can be the
>>> hardest to find solutions for through search.
>>>
>>> Assume for the moment that I have a dataset that contains seven
>>> variables: Pollution, Temp, Industry, Population, Wind, Rain and
>>> Wet.days. (This actual dataset is taken from Michael Crawley's
>>> "Statistics: An Introduction Using R" and is available as
>>> "pollute.txt" in
>>> http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
>>> Assume I have attached pollute. Then
>>>
>>> cor(pollute)
>>>
>>> will give me the correlation table for these seven variables. If I
>>> would prefer only to see the correlations between, say, Pollution,
>>> Temp and Industry, I can get that with
>>>
>>> cor(pollute[,1:3])
>>>
>>> or with
>>>
>>> cor(pollute[1:3])
>>>
>>> Similarly, I can see the correlations between Temp, Population and
>>> Rain with
>>>
>>> cor(pollute[,c(2,4,6)])
>>>
>>> or with
>>>
>>> cor(pollute[c(2,4,6)])
>>>
>>> This is fine for a seven-variable dataset. When I have 250 variables,
>>> though, I start to pale at looking up column indexes over and over. I
>>> know from reading the list archives that I can extract the column
>>> index of Industry, for example, by typing
>>>
>>> which("Industry"==names(pollute))
>>>
>>> but doing that before each command seems dire. Trained to using Stata
>>> as I am, I am inclined to check the correlation of the first three or
>>> the second, fourth and sixth columns by substituting the column names
>>> for the column indexes--something like the following:
>>>
>>> cor(pollute[Pollution:Industry])
>>> cor(pollute[c(Temp,Population,Rain)])
>>>
>>> These however throw errors.
>>>
>>> I know that many commands in R are perfectly happy to take variable
>>> names--the regression models, for example--but that some do not. And
>>> so I ask you two general questions:
>>>
>>> 1. Is there a syntax for referring to variable names rather than
>>> column indexes in situations like these?
>>> 2. Is there something that I should look for in a command's help file
>>> that often indicates whether it can take column names rather than
>>> indexes?
>>>
>>> Again, apologies for asking something that has likely been asked
>>> before. I would appreciate any suggestions that you have.
>>>
>>> Best,
>>> John-Paul Ferguson
>>> Assistant Professor of Organizational Behavior
>>> Stanford University Graduate School of Business
>>> 518 Memorial Way, K313
>>> Stanford, CA 94305
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Ista Zahn
>> Graduate student
>> University of Rochester
>> Department of Clinical and Social Psychology
>> http://yourpsyche.org
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>




More information about the R-help mailing list