[R] Finding strings in a dataset

Tuhin Chakraborty tuh|nch@kr@borty50 @end|ng |rom gm@||@com
Mon May 17 08:44:47 CEST 2021


Thank you. This possibly will work.
Tuhin Chakraborty
PhD
Geology & Geophysics
Indian Institute Of Technology, Kharagpur
Kharagpur-721302


On Sun, May 16, 2021 at 1:58 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:

> Hello,
>
> You can also create an extra column with the column names corresponding
> to the column col. I believe this extra column is not needed and with a
> big data set it's even a waste of time and memory space but the code
> below creates it.
>
>
> res <- which(found, arr.ind = TRUE)
> res <- as.data.frame(res)
> res$col_name <- names(df1)[ res$col ]
>
>
> With a big data set the first res is a numeric matrix and it's access
> and extraction is faster, matrix operations are generally faster than
> data.frame operations.
>
> Hope this helps,
>
> Rui Barradas
>
> Às 08:30 de 16/05/21, Rui Barradas escreveu:
> > Hello,
> >
> > The data makes clearer.
> > Do you want to know where are the values that cannot be coerced to
> numeric?
> > The auxiliary function f outputs a logical vector, sapply applies it
> > column by column and which(., arr.ind) gives the TRUE values as (row,
> > col) pairs.
> >
> >
> > txt <- "
> > LI(PPM) SC(PPM) TI(PPM) V(PPM)
> > 3.1/0.5 ? ? ?
> > ? ? 0.2/0.3 ?
> > ? 2.8/0.75 ? >0.2
> > 0.0389 108.6591 0.0214 85.18818
> > 0.0688 146.1739 0.0117 108.0221
> > 0.0265 121.3268 0.00749 85.34932
> > 0.139901 125.3066 0.00984 97.23175
> > "
> > df1 <- read.table(text = txt, header = TRUE)
> > df1
> >
> > f <- function(x){
> >    suppressWarnings(is.na(as.numeric(x)))
> > }
> > found <- sapply(df1, f)
> > which(found, arr.ind = TRUE)
> >
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> >
> > Às 06:31 de 16/05/21, Tuhin Chakraborty escreveu:
> >> Thank you everyone, for the very helpful suggestions. I understand
> >> that my
> >> question is not altogether clear. So let me share an example.
> >> The below is a part of a dataset, there are around 40000 rows.
> >> LI(PPM) SC(PPM) TI(PPM) V(PPM)
> >> 3.1/0.5 ? ? ?
> >> ? ? 0.2/0.3
> >> ?
> >> ? 2.8/0.75 ? >0.2
> >> 0.0389 108.6591 0.0214 85.18818
> >> 0.0688 146.1739 0.0117 108.0221
> >> 0.0265 121.3268 0.00749 85.34932
> >> 0.139901 125.3066 0.00984 97.23175
> >>
> >> Now the 0.2/0.3, >0.2 these are treated as strings. When I am using the
> >> spec(Dataset) function in R, it shows me which columns contain strings.
> >> Like it will tell me that LI (PPM), SC(PPM) etc. contain strings. But, I
> >> would like to know if there is someway where I can learn exactly where
> >> the
> >> string values are, like for LI(PPM) in the top row. As this is a huge
> >> dataset, it is difficult to go through all the rows manually.
> >> Thank you again and in anticipation.
> >> Tuhin
> >>
> >>
> >>
> >> On Sun, May 16, 2021 at 4:25 AM Avi Gross via R-help
> >> <r-help using r-project.org>
> >> wrote:
> >>
> >>> Tuhin,
> >>>
> >>> What do you mean by a 2-D dataset? You say some columns contain
> >>> strings so
> >>> it does not sound like you are using a matrix as then  ALL columns
> >>> would be
> >>> of the same type.
> >>>
> >>> So are you using a data.frame or tibble or something you made on your
> >>> own?
> >>>
> >>> Can you address one column at a time and would that be of type
> >>> vector? Some
> >>> methods work fairly easily on those and some also on lists.
> >>>
> >>> Once you have that vector, there are quite a few ways to find what you
> >>> want.
> >>> Is it fixed text like looking for an exact full match so it would be
> >>> something like "theta" to be matched in full, or would you want to
> match
> >>> "the" and both "theta" and "lathe" would match? Or are you matching a
> >>> pattern that is more complex like looking for all text that has two
> >>> vowels
> >>> in a row in it?
> >>>
> >>> Once you figure out what you have and what you want, how do you want to
> >>> identify what you are looking for? Will there be one match or
> >>> possibly many
> >>> or even all? Many methods will return a TRUE/FALSE vector of the same
> >>> length
> >>> or the integer offset of a match such as telling you it is the fifth
> >>> item.
> >>>
> >>> R has collections of string functions including in packages like
> >>> stringr/stringi that deal well with many things you might need. For
> >>> matching
> >>> patterns, there is a family of functions using "grep" and so on.
> >>>
> >>> Good luck.
> >>>
> >>> -----Original Message-----
> >>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Tuhin
> >>> Chakraborty
> >>> Sent: Saturday, May 15, 2021 1:08 PM
> >>> To: r-help using r-project.org
> >>> Subject: [R] Finding strings in a dataset
> >>>
> >>> Hi,
> >>> How can I find the location of string data in my 2D dataset?
> >>> spec(Dataset)
> >>> will reveal the columns that contain the strings. But can I know where
> >>> exactly the string values are in the column?
> >>>
> >>>          [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >>     [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list