[BioC] Formatting in R

Stephanie M. Gogarten sdmorris at u.washington.edu
Thu Oct 3 20:28:49 CEST 2013


Hi Avoks,

This is more of an R-help question than a Bioconductor question, but 
here's something that I think will work for you.  Step 1 does some 
indexing of your first input file, and step 2 uses the "merge" function 
in R - see the man page for that function for more details.

 > f1 <- readLines("f1.txt")
 > gene.index <- grep("^GENE", f1)
 > snp.index <- grep("^snp", f1)
 > end.index <- grep("^END", f1)
 > snps <- f1[snp.index]
 > genes <- character(length(snps))
 > for (i in 1:length(gene.index)) {
+   snps.in.gene <- snp.index > gene.index[i] & snp.index < end.index[i]
+   genes[snps.in.gene] <- f1[gene.index[i]]
+ }
 > snp.by.gene <- data.frame(GENE=genes, SNP=snps, stringsAsFactors=FALSE)
 > snp.by.gene
     GENE    SNP
1  GENE1 snp001
2  GENE1 snp002
3  GENE1 snp003
4  GENE1 snp004
5  GENE2 snp005
6  GENE2 snp006
7  GENE2 snp007
8  GENE2 snp008
9  GENE3 snp009
10 GENE3 snp010
11 GENE3 snp011
12 GENE3 snp012
 > f2 <- read.table("f2.txt", as.is=TRUE, header=TRUE)
 > f2
      SNP  Pval  Fst
1 snp001 5e-04 0.25
2 snp002 3e-04 0.75
3 snp003 1e-04 0.65
4 snp004 1e-05 0.30
5 snp005 6e-05 0.50
6 snp006 4e-04 0.10
7 snp007 3e-05 0.60
8 snp008 2e-04 0.75
 > snp.table <- merge(snp.by.gene, f2, by="SNP", all.x=TRUE)
 > snp.table
       SNP  GENE  Pval  Fst
1  snp001 GENE1 5e-04 0.25
2  snp002 GENE1 3e-04 0.75
3  snp003 GENE1 1e-04 0.65
4  snp004 GENE1 1e-05 0.30
5  snp005 GENE2 6e-05 0.50
6  snp006 GENE2 4e-04 0.10
7  snp007 GENE2 3e-05 0.60
8  snp008 GENE2 2e-04 0.75
9  snp009 GENE3    NA   NA
10 snp010 GENE3    NA   NA
11 snp011 GENE3    NA   NA
12 snp012 GENE3    NA   NA

Stephanie

On 10/3/13 12:23 AM, Ovokeraye Achinike-Oduaran wrote:
> Hi all,
>
> I have two files I need to merge but first I need to reformat one of
> them...it looks like this:
>
> GENE1
> snp001
> snp002
> snp003
> snp004
>
> END
> GENE2
> snp005
> snp006
> snp007
> snp008
>
> END
>
> GENE3
> snp009
> snp010
> snp011
> snp012
>
> END
>
> It's pretty much a set file from Plink. What I'd like to do is to have the
> file looking like this:
>
> GENE1  snp001
> GENE1  snp002
> GENE1  snp003
> GENE1  snp004
> GENE2  snp005
> GENE2  snp006
> GENE2  snp007
> GENE2  snp008
> GENE3  snp009
> GENE3  snp010
> ...
>
> The second file looks pretty much a file with most of the same SNPs for the
> most part but with additional data…so I pretty much want to do something
> like can be done on Microsoft Access merging the two files using the common
> unique identifier…the snps.
>
>   SNP
> Pval Fst  snp001 0.0005 0.25  snp002 0.0003 0.75  snp003 0.0001 0.65  snp004
> 0.00001 0.3  snp005 0.00006 0.5  snp006 0.0004 0.1  snp007 0.00003 0.6
> snp008 0.0002 0.75
>
> Any help with this in R will be greatly appreciated.
>
> Thanks.
>
> Avoks
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list