[R] Recoding variables based on reference values in data frame

Rui Barradas ruipbarradas at sapo.pt
Tue Jul 2 23:21:38 CEST 2013


Hello,

If you have read in the data as factors (stringsAsFactors = TRUE, the 
default), change the function to the following.


fun <- function(x){
	x[x %in% c("nc", "_")] <- NA
	MM <- paste0(as.character(x[1]), as.character(x[1]))  # Major Major
	Mm <- paste0(as.character(x[1]), as.character(x[2]))  # Major minor
	mm <- paste0(as.character(x[2]), as.character(x[2]))  # minor minor
	x[x == MM] <- 0
	x[x == Mm] <- 1
	x[x == mm] <- 2
	x
}



Rui Barradas

Em 02-07-2013 22:15, Rui Barradas escreveu:
> Hello,
>
> I'm not sure I understood, but try the following.
>
>
> Kgeno <- read.table(text = "
> SNP_ID SNP1 SNP2 SNP3 SNP4
> Maj_Allele C G  C  A
> Min_Allele T A T G
> ID1 CC     GG     CT     AA
> ID2 CC     GG     CC AA
> ID3 CC     GG    nc    AA
> ID4 _ _ _ _
> ID5 CC     GG     CC     AA
> ID6 CC     GG     CC     AA
> ID7 CC     GG     CT     AA
> ID8 _ _ _ _
> ID9 CT     GG     CC AG
> ID10 CC     GG     CC     AA
> ID11 CC     GG     CT     AA
> ID12 _ _ _ _
> ID13 CC     GG     CC     AA
> ", header = TRUE, stringsAsFactors = FALSE)
>
> dat
>
> fun <- function(x){
>      x[x %in% c("nc", "_")] <- NA
>      MM <- paste0(x[1], x[1])  # Major Major
>      Mm <- paste0(x[1], x[2])  # Major minor
>      mm <- paste0(x[2], x[2])  # minor minor
>      x[x == MM] <- 0
>      x[x == Mm] <- 1
>      x[x == mm] <- 2
>      x
> }
>
> Kgeno[, -1] <- sapply(Kgeno[, -1], fun)
> Kgeno
>
>
> Also, the best way to post data is by using ?dput.
>
> dput(head(Kgeno[, 1:5], 30))  # post the output of this
>
>
> Hope this helps,
>
> Rui Barradas
>
> Em 02-07-2013 21:46, kathleen askland escreveu:
>> I'm new to R (previously used SAS primarily) and I have a genetics data
>> frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
>> ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
>> the data is shown below:
>>    SNP_ID SNP1 SNP2 SNP3 SNP4  Maj_Allele C G  C  A  Min_Allele T A T
>> G  ID1
>> CC     GG     CT     AA      ID2 CC     GG     CC AA      ID3 CC     GG
>> nc
>> AA      ID4 _ _ _ _  ID5 CC     GG     CC     AA      ID6 CC
>> GG     CC
>>       AA      ID7 CC     GG     CT     AA      ID8 _ _ _ _  ID9 CT     GG
>> CC AG      ID10 CC     GG     CC     AA      ID11 CC     GG     CT     AA
>>        ID12 _ _ _ _  ID13 CC     GG     CC     AA
>> The name of the data file is Kgeno.
>> What I would like to do is recode all of the genotype values to standard
>> integer notation, based on their values relative to the reference rows
>> (Maj_Allele and Min_Allele). Standard notation sums the total of minor
>> alleles in the genotype, so values can be 0, 1 or 2.
>>
>> Here are the changes I want to make:
>> 1. If the genotype= "nc" or '_" then set equal to NA.
>> 2. If genotype value = a character string comprised of two consecutive
>> major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
>> 3. If genotype  value= c(Maj_Allele, Min_Allele) then set equal to 1.
>> 4. If genotype  value = c(Min_Allele, Min_Allele) then set equal to 2.
>>
>> I've tried the following ifelse processing but get error (Warning:
>> Executed
>> script did not end with R session at the top-level prompt.  Top-level
>> state
>> will be restored) and can't seem to fix the code properly. I've
>> counted the
>> parentheses. Also, not sure if it would execute properly if I could
>> fix it.
>>
>> # change 'nc' and '_' to NA, else leave as is:
>> Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
>> Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
>>
>> #convert genotype strings in the first data column to numeric values
>> #(two
>> major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else
>> #leave as
>> is (to preserve NA values).
>>
>> Kgeno[,2] <-
>>
>> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
>> Kgeno[1,2]), sep=""), 0,
>>
>> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
>> Kgeno[2,2]), sep=""), 1,
>>
>> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
>> Kgeno[2,2]), sep=""), 2,
>>              Kgeno[,2])))
>>
>>
>> Finally, if above code were corrected, this would only change the first
>> column of data, but I would like to change all 3000+ columns in the same
>> way.
>>
>> I would greatly appreciate some suggestions on how to proceed.
>>
>> Thank you,
>>
>> Kathleen
>>
>> ---
>> Kathleen Askland, MD
>> Assistant Professor
>> Department of Psychiatry & Human Behavior
>> The Warren Alpert School of Medicine
>> Brown University/Butler Hospital
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list