[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

Bert Gunter bgunter.4567 at gmail.com
Sun Sep 25 06:03:57 CEST 2016


"So I want a **Fuzzy logic approach** to..."

That is a near meaningless buzzword.

I suggest you search on "fuzzy logic" on the rseek.org website and see
if any of the hits there does whatever it is that you have in mind.

Cheers,
Bert




Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Sep 24, 2016 at 11:49 AM, Aarushi Kaushal
<kaushalaarushi at gmail.com> wrote:
> Hey there,
>
> I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
> which is involved in financial services, Portfolio management to be
> precise. Recently we've started creating ourselves a database using R for
> all the stocks etc. to be automated and hence analyzed accordingly for
> future investment purposes (data related to which is already available, and
> in our possession).
>
> I and a colleague of mine, we are currently at the data cleaning stage -
> where we need to organize and format the data according to how we want it
> in the database. The problem lies in notation & symbols used in the
> original csv data files acquired from the government website - where we
> have to do approximate matching (for efficiency) and thereby extract the
> numerics only from that string of characters from the respective columns of
> the dataframe.
>
> 1.) As of now we are looking at using the agrep function, to detect &
> locate the pattern matches namely - DIVIDEND , SPLIT, BONUS
>
> 2.) From there on carry out the extraction of the respective numeric values
> associated with these actions in to the corresponding columns -
> BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
> SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
> FInal Dividend, Interim Dividend & Special Dividend.
>
>
> COLUMN PURPOSE
>
>    1. DIVIDEND-RE.1/- PER SHARE
>    2. AGM/DIV-RS.3.50 PER SHARE
>    3. SPL DIV-RS.2.70 PER SHARE
>    4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
>    5. FV SPLIT Rs.10 to RE.1
>    6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
>    7. BONUS 4:1
>    8. DIV:10%
>
> Ex.
> DIVIDEND-RE.1/- PER SHARE
> FINAL_DIV-1
>
> AGM/DIV-RS.3.50 PER SHARE
> FINAL_DIV-3.50
>
> SPL DIV-RS.2.70 PER SHARE
> SPECIAL DIV-2.70
>
> Ex.
> FV SPLIT Rs.10 to RE.1
> SPLIT_NUM - 1
> SPLIT_DEN - 10
>
> Ex. BONUS 4:1
> BONUS_NUM - 4
> BONUS_DEN - 1
>
> However, the problem with that is that agrep returns the vector indices
>  instead of the string indices which makes it cumbersome to extract the
> numeric values following the respective matches.
> So I want a Fuzzy logic approach to
>
>    - check for the presence of SPLIT, DIVIDEND, BONUS
>    - index of which ever cell the pattern match occurs in the column
>    PURPOSE of the data frame
>    - index position of that particular pattern in the string to extract the
>    numerical value following the matched pattern
>
> *Basically Is there any way in R to determine if the patterns can be
> checked and matched approximately while returning for value - the indices
> for the same in the respective strings?**(such that if in case the symbols
> change furthermore in the future according to the government website's
> notation in the data storage, or the format/positioning/spacing changes -
> it could account for all those changes automatically.)*
> I am attaching below the .csv file consisting of just the column we need to
> carry out the cleaning in for your convenience.
>
> It would be very helpful, if we could get some guidance as to how to
> proceed further at the earliest.
>
> regards,
> aarushi kaushal
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list