[R] How to match strings in two files and replace strings?

Jim Lemon drj|m|emon @end|ng |rom gm@||@com
Tue Mar 31 04:24:18 CEST 2020


Hi Ana,
This seems to work. It shouldn't be too hard to do the renaming and
reordering of columns.

output11.frq<-read.table(text="CHR  SNP A1 A2  MAF  NCHROBS
1      1:775852:T:C    T    C       0.1707     3444
1     1:1120590:A:C    C    A      0.08753     3496
1     1:1145994:T:C    C    T       0.1765     3496
1     1:1148494:A:G    A    G       0.1059     3464
1     1:1201155:C:T    T    C      0.07923     3496",
header=TRUE,stringsAsFactors=FALSE)

marker_info<-read.csv(text="1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018",
header=FALSE,stringsAsFactors=FALSE)
# create new columns for the merge
output11.frq$match_col<-unlist(lapply(lapply(strsplit(output11.frq$SNP,":"),"[",
 1:2), paste,collapse=":"))
marker_info$match_col<-apply(t(marker_info[,1:2]),2,paste,collapse=":")
# merge to get the result
newout<-merge(output11.frq,marker_info[,c("V5","match_col")],by="match_col")

Jim

On Tue, Mar 31, 2020 at 11:09 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>
> I have a file like this: (has 308545 lines)
>
>     head output11.frq
>      CHR               SNP   A1   A2          MAF  NCHROBS
>        1      1:775852:T:C    T    C       0.1707     3444
>        1     1:1120590:A:C    C    A      0.08753     3496
>        1     1:1145994:T:C    C    T       0.1765     3496
>        1     1:1148494:A:G    A    G       0.1059     3464
>        1     1:1201155:C:T    T    C      0.07923     3496
>     ...
>
> And another file (marker-info) which has the first 24 commented lines
> and is comma separated that looks like this (has total of 500593
> lines):
>
>     1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
>     1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
>     1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
>     1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
>     1,1201155,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
>     ...
>
> I want to replace in output11.frq second column with the 5th column in
> marker-info that has the matching value in 1st and 2nd column so for
> this example the result of the output11.frq would look like this:
>
>     1      rs2980300    T    C       0.1707     3444
>     1      rs4245756    T    C      0.07923     3496
>
> I tried doing this in bash but I got empty file:
>
>     vi tst.awk
>     NR==FNR { map[$1,$2]=$5; next }
>     ($1,$4) in map { $2=map[$1,$4]; print }
>     awk -f tst.awk FS=',' marker-info FS='\t' output11.frq  > output11X.frq
>
> Can this be done in R?
>
> Thanks
> Ana
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list