[R] Problem with parsing a dataset - help earnestly sought

Jan van der Laan djvanderlaan at gmail.com
Fri Apr 23 08:57:22 CEST 2010


Something like this?

# Remove everything after ; to give the status
status    <- sub(';.*$', '', data$cancer.problems)

# Remove everything before the last ; to give tissue
# In case a no ; in the string this goes wrong; correct
tissue    <- sub('^.*;[ \n]*', '', data$cancer.problems)
tissue[! grepl(';', data$cancer.problems)] <- ''

# Select the part between ;'s to give age
indices   <- regexpr(';.*;', data$cancer.problems)
lengths   <- attr(indices, "match.length")
age       <- rep(NA, length(data$cancer.problems))
age[indices>0] <- substring(data$cancer.problems[indices>0],
    indices[indices>0]+1, indices[indices>0]+lengths[indices>0]-2)


Hope it helps.
Regards,
Jan




On Fri, Apr 23, 2010 at 6:24 AM, Min-Han Tan <minhan.science at gmail.com> wrote:
> Dear fellow R-help members,
>
> I hope to seek your advice on how to parse/manage a dataset with hundreds of
> columns. Two examples of these columns, 'cancer.problems', and
> 'neuro.problems' are depicted below. Essentially, I need to parse this into
> a useful dataset, and unfortunately, I am not familiar with perl or any such
> language.
>
> data <- data.frame(id=c(1:10))
> data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y;
> LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS")
> data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74;
> STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS")
>
> As can be seen, the semi-colon delimiter follows its own set of rules, which
> are internally consistent - with all 3 elements of data, it should be
> "Status; Age; Tissue Type". However, if there is only tissue type, it is"
> Status; Tissue Type", without the trailing semi-colon. However, if there is
> Age available, it is "Status; Age;".
>
> The main challenge for me is how to parse/convert this dataset into a useful
> and consistent data.frame, or list, where I can capture Status, Age and
> Tissue Type as separate fields. Due to the varying application of the
> delimiter, I cannot use strsplit consistently. I have tried a convoluted
> method by identifying "AGE" as the character string identifying 3 element
> fields per below, but faced problems with unlist, given the empty fields.
>
> age.present <- grepl("AGE",data[,2])
> data.3column <- strsplit(data[age.present,2],";")
> data.2column <- strsplit(data[!age.present,2],";")
> data$cancer.status[age.present] <- unlist(data.3column)
> [(1:sum(age.present)*3)-2]
> ...
>
> Your advice is earnestly sought.
>
> Thanks.
>
> Min-Han
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list