[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sun Oct 19 08:25:45 CEST 2014


On Sat, 18 Oct 2014, Anna Zakrisson Braeunlich wrote:

> Thank you! That was a easy and fast solution!

If it was so easy, why couldn't you adapt Ista's solution? I suspect 
it is because you don't understand his suggestion.

> May I post a follow-up question? (I am not sure if this would rather 
> should be posted as a new question, but I post it here and then I can 
> re-post it if this is the wrong place to ask this). I am ever so 
> grateful for your help!

This probably should have been a new thread, but I will bite anyway.

> /Anna
>
>
> ######################### FOLLOW-UP QUESTION ####################################
>
> df1 <- data.frame(cbind(Identifier = c("M123.B23.VJHJ", "M123.B24.VJHJ",
>                                       "M123.B23.VLKE", "M123.B23.HKJH",
>                                       "M123.B24.LKJH"),
>                                       Sequence = c("ATATATATATA", "ATATATATATA",
>                                                    "ATATAGCATATA", "ATATATAGGGTA",
>                                                    "ATCGCGCGAATA")))

Just because R has a habit of making factors at the drop of a hat doesn't 
mean that if your data are still not ready to be treated as factors that 
you have to accept what it does. When you read the data in via read.csv 
you can use the colClasses argument or the stringsAsFactors argument to 
stop that. You can use stringsAsFactors when you create the data frame 
also, and you can always go back and turn any particular column into a 
factor after you are done manipulating characters.

Also, somewhere you picked up the bad habit of using cbind before you make 
your data frame... that is almost never a good idea, because in data 
frames each column can be have its own storage mode, while cbind creates a 
matrix where every element must have the same storage mode.

df1 <- data.frame( Identifier = c( "M123.B23.VJHJ", "M123.B24.VJHJ"
                                  , "M123.B23.VLKE", "M123.B23.HKJH"
                                  , "M123.B24.LKJH" )
                  , Sequence = c( "ATATATATATA", "ATATATATATA"
                                , "ATATAGCATATA", "ATATATAGGGTA"
                                , "ATCGCGCGAATA" )
                  , stringsAsFactors=FALSE
                  )


>
> # as a follow-up question:
> # How can I split the identifier in df1 above into several columns based on the
> # separating dots? The real data includes thousands of rows.
> # This is what I want it to look like in the end:
>
> df1_solution <- data.frame(cbind(Identifier1 = c("M123", "M123",
>                                       "M123", "M123",
>                                       "M123"),
>                        Identifier2 = c("B23", "B24", "B23", "B23", "B24"),
>                        Identifier3 = c("VJHJ", "VJHJ", "VLKE", "HKJH", "LKJH"),
>                        Sequence = c("ATATATATATA", "ATATATATATA",
>                                     "ATATAGCATATA", "ATATATAGGGTA",
>                                     "ATCGCGCGAATA")))

df1_solution <- data.frame( Identifier1 = c( "M123", "M123"
                                            , "M123", "M123"
                                            , "M123" )
                           , Identifier2 = c( "B23", "B24", "B23"
                                            , "B23", "B24" )
                           , Identifier3 = c( "VJHJ", "VJHJ"
                                            , "VLKE", "HKJH", "LKJH")
                           , Sequence = c( "ATATATATATA", "ATATATATATA"
                                         , "ATATAGCATATA", "ATATATAGGGTA"
                                         , "ATCGCGCGAATA" )
                           , stringsAsFactors=FALSE
                          )


> # I am very grateful for your help! I am no whiz at R and everything I know
> # is self-taught. Therefore, some basics can turn out to be quite some
> # obsatcles for me.
> # /Anna

Pretty much all of us are here to teach ourselves R, Anna. Keep reading 
other people's questions. Learn to try each fragment alone at the command 
line to figure out what is happening. Use the str() function frequently.

# the basic split
parts <- strsplit( as.character( df1$Identifier ), ".", fixed=TRUE )

# extension of Ista's approach to assembly
ans1 <- data.frame( Identifier1 = rep( NA, nrow( ans1 ) )
                   , Identifier2 = rep( NA, nrow( ans1 ) )
                   , Identifier3 = rep( NA, nrow( ans1 ) )
                   , stringsAsFactors = FALSE
)
# note all memory is pre-allocated above... avoid successively
# accumulating rows with rbind... that would be very slow
for ( rw in seq_along( df1$Identifier ) ) {
   v <- parts[[ rw ]]
   ans1[ rw, "Identifier1" ] <- v[ 1 ]
   ans1[ rw, "Identifier2" ] <- v[ 2 ]
   ans1[ rw, "Identifier3" ] <- v[ 3 ]
}
ans1$Sequence <- df1$Sequence

#---- alternative method of assembly
# uses "list-to-data.frame apply" from plyr package
library(plyr)
ans2 <- ldply( parts
              , function( v ) { # called once for each item in parts list
                       # all single-row data frames created in this
                       # function are concatenated at once by ldply to
                       # make one big data.frame "ans2"
                       data.frame( Identifier1=v[1]
                                 , Identifier2=v[2]
                                 , Identifier3=v[3]
                                 , stringsAsFactors=FALSE
                                 )
                }
              )
ans2$Sequence <- DF$Sequence


>> <((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?>
>
> Anna Zakrisson Braeunlich
> PhD student
>
> Department of Ecology, Environment and Plant Sciences
> Stockholm University
> Svante Arrheniusv. 21A
> SE-106 91 Stockholm
> Sweden/Sverige
>
> Lives in Berlin.
> For paper mail:
> Katzbachstr. 21
> D-10965, Berlin
> Germany/Deutschland
>
> E-mail: anna.zakrisson at su.se
> Tel work: +49-(0)3091541281
> Mobile: +49-(0)15777374888
> LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
>
>> <((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?>
>
> ________________________________________
> From: Ista Zahn [istazahn at gmail.com]
> Sent: 13 October 2014 15:42
> To: Anna Zakrisson Braeunlich
> Cc: r-help at r-project.org
> Subject: Re: [R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.
>
> Hi Anna,
>
>
> On Sun, Oct 12, 2014 at 3:24 AM, Anna Zakrisson Braeunlich
> <anna.zakrisson at su.se> wrote:
>> Hi,
>>
>> I have a question how to split a factor name into different columns. I have metabarcoding data and need to merge the FASTA-file with the taxonomy- and counttable files (dataframes). To be able to do this merge, I need to isolate the common identifier, that unfortunately is baked in with a lot of other labels in the factor name eg:
>> sequence identifier: M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2
>>
>> I want to split this name at every "." to get several columns:
>> column1: M01271_77_000000000
>> column2: A8J0P_1_1101_10150_1525
>> column3: 1
>> column4: 322519
>> column5: sample_1
>> column6: sample_2
>>
>> I must add that I have no influence on how these names are given. This is how thay are supplied from Illumina Miseq. I just need to be able to deal with it.
>>
>> Here is some extremely simplified dummy data to further show the issue at hand:
>>
>> df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
>>                   Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
>> df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)),
>>                   Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))
>>
>> # I have metabarcoding data with one FASTA-file, one count table and one taxonomy file
>> # Above dummy data is just showing the issue at hand. I want to be able to merge my three
>> # original data frames (here, the dummy data is only two dataframes). The problem is that
>> # the only identifier that is commmon for the dataframes is "hidden" in the
>> # factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence need to be able
>> # to split this name up into different columns to get "identifierA" alone as one column name
>> # Then I can merge the dataframes.
>> # How can I do this in R. I know that it can be done in excel, but I would like to
>> # produce a complete R-script to get a fast pipeline and avoid copy and paste errors.
>> # This is what I want it to look:
>>
>> df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
>>                   Z = factor(rep(LETTERS[1:2], each = 5)),
>>                   identifierA = factor(rep(LETTERS[1:2], each = 5)),
>>                   B1298712 = factor(rep(LETTERS[1:2], each = 5)))
>
> Use strsplit to separate the components, something like
>
> separateNames <- strsplit(names(df1)[3], split = "\\.")[[1]]
> for(name in separateNames) {
>    df1[[name]] <- df1[[3]]
> }
> df1[[3]] <- NULL
>
> Best,
> Ista
>
>>
>> # Many thank's and with kind regards
>> Anna Zakrisson
>>
>>> <((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?>
>>
>> Anna Zakrisson Braeunlich
>> PhD student
>>
>> Department of Ecology, Environment and Plant Sciences
>> Stockholm University
>> Svante Arrheniusv. 21A
>> SE-106 91 Stockholm
>> Sweden/Sverige
>>
>> Lives in Berlin.
>> For paper mail:
>> Katzbachstr. 21
>> D-10965, Berlin
>> Germany/Deutschland
>>
>> E-mail: anna.zakrisson at su.se
>> Tel work: +49-(0)3091541281
>> Mobile: +49-(0)15777374888
>> LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
>>
>>> <((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list