[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

Sat Oct 18 06:59:10 CEST 2014

Thank you! That was a easy and fast solution!

May I post a follow-up question? (I am not sure if this would rather should be posted as a new question, but I post it here and then I can re-post it if this is the wrong place to ask this). I am ever so grateful for your help!
/Anna

######################### FOLLOW-UP QUESTION ####################################

df1 <- data.frame(cbind(Identifier = c("M123.B23.VJHJ", "M123.B24.VJHJ",
                                       "M123.B23.VLKE", "M123.B23.HKJH",
                                       "M123.B24.LKJH"),
                                       Sequence = c("ATATATATATA", "ATATATATATA",
                                                    "ATATAGCATATA", "ATATATAGGGTA",
                                                    "ATCGCGCGAATA"))) 

# as a follow-up question:
# How can I split the identifier in df1 above into several columns based on the 
# separating dots? The real data includes thousands of rows.
# This is what I want it to look like in the end:

df1_solution <- data.frame(cbind(Identifier1 = c("M123", "M123",
                                       "M123", "M123",
                                       "M123"),
                        Identifier2 = c("B23", "B24", "B23", "B23", "B24"),
                        Identifier3 = c("VJHJ", "VJHJ", "VLKE", "HKJH", "LKJH"),
                        Sequence = c("ATATATATATA", "ATATATATATA",
                                     "ATATAGCATATA", "ATATATAGGGTA",
                                     "ATCGCGCGAATA")))

# I am very grateful for your help! I am no whiz at R and everything I know
# is self-taught. Therefore, some basics can turn out to be quite some
# obsatcles for me. 
# /Anna

><((((º>`•. . • `•. .• `•. . ><((((º>`•. . • `•. .• `•. .><((((º>`•. . • `•. .• `•. .><((((º>

Anna Zakrisson Braeunlich
PhD student

Department of Ecology, Environment and Plant Sciences
Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige

Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland

E-mail: anna.zakrisson at su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b

><((((º>`•. . • `•. .• `•. . ><((((º>`•. . • `•. .• `•. .><((((º>`•. . • `•. .• `•. .><((((º>

________________________________________
From: Ista Zahn [istazahn at gmail.com]
Sent: 13 October 2014 15:42
To: Anna Zakrisson Braeunlich
Cc: r-help at r-project.org
Subject: Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

Hi Anna,

On Sun, Oct 12, 2014 at 3:24 AM, Anna Zakrisson Braeunlich
<anna.zakrisson at su.se> wrote:
> Hi,
>
> I have a question how to split a factor name into different columns. I have metabarcoding data and need to merge the FASTA-file with the taxonomy- and counttable files (dataframes). To be able to do this merge, I need to isolate the common identifier, that unfortunately is baked in with a lot of other labels in the factor name eg:
> sequence identifier: M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2
>
> I want to split this name at every "." to get several columns:
> column1: M01271_77_000000000
> column2: A8J0P_1_1101_10150_1525
> column3: 1
> column4: 322519
> column5: sample_1
> column6: sample_2
>
> I must add that I have no influence on how these names are given. This is how thay are supplied from Illumina Miseq. I just need to be able to deal with it.
>
> Here is some extremely simplified dummy data to further show the issue at hand:
>
> df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
>                   Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
> df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)),
>                   Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))
>
> # I have metabarcoding data with one FASTA-file, one count table and one taxonomy file
> # Above dummy data is just showing the issue at hand. I want to be able to merge my three
> # original data frames (here, the dummy data is only two dataframes). The problem is that
> # the only identifier that is commmon for the dataframes is "hidden" in the
> # factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence need to be able
> # to split this name up into different columns to get "identifierA" alone as one column name
> # Then I can merge the dataframes.
> # How can I do this in R. I know that it can be done in excel, but I would like to
> # produce a complete R-script to get a fast pipeline and avoid copy and paste errors.
> # This is what I want it to look:
>
> df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
>                   Z = factor(rep(LETTERS[1:2], each = 5)),
>                   identifierA = factor(rep(LETTERS[1:2], each = 5)),
>                   B1298712 = factor(rep(LETTERS[1:2], each = 5)))

Use strsplit to separate the components, something like

separateNames <- strsplit(names(df1)[3], split = "\\.")[[1]]
for(name in separateNames) {
    df1[[name]] <- df1[[3]]
}
df1[[3]] <- NULL

Best,
Ista

>
> # Many thank's and with kind regards
> Anna Zakrisson
>
>><((((º>`•. . • `•. .• `•. . ><((((º>`•. . • `•. .• `•. .><((((º>`•. . • `•. .• `•. .><((((º>
>
> Anna Zakrisson Braeunlich
> PhD student
>
> Department of Ecology, Environment and Plant Sciences
> Stockholm University
> Svante Arrheniusv. 21A
> SE-106 91 Stockholm
> Sweden/Sverige
>
> Lives in Berlin.
> For paper mail:
> Katzbachstr. 21
> D-10965, Berlin
> Germany/Deutschland
>
> E-mail: anna.zakrisson at su.se
> Tel work: +49-(0)3091541281
> Mobile: +49-(0)15777374888
> LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
>
>><((((º>`•. . • `•. .• `•. . ><((((º>`•. . • `•. .• `•. .><((((º>`•. . • `•. .• `•. .><((((º>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>