[R] Frequency of Combinations

Thierry Onkelinx thierry.onkelinx at inbo.be
Mon Apr 24 16:43:06 CEST 2017


Dear Mustafa,

I'd recommend the packages readxls to import the data, tidyr to transform
the data into long format and dplyr to select the data.

1. read the data into R with read_excel()
2. transform sheet 1 into a long format with gather(). The result is one
row for each patient / drug combination
3. select the relevant drugs in sheet 2 with filter()
4. join long sheet 1 and filtered sheet2 with inner_join()
5. summarise() the drug codes and names after group_by(patient_id)
6. count() the number of drug codes.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2017-04-24 7:07 GMT+02:00 abo dalash <abo_dlsh op hotmail.com>:

> Hi there
>
> I have data set with 500,000 patients (rows) and the first column is
> Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20 so
> each row starts with the patient ID and the remaining of cells in the row
> are codes for names of the treatments taken by the patient. Number of
> treatments differ between patients. For example, there are patients with 3
> treatments only and patients with 20 drugs. The unique number of treatments
> in the entire data set is about 6700 drugs. However, I'm interested in
> studying only 128 drugs, these drugs are listed in a second sheet as code
> numbers associated with their meanings (names of drugs representing the
> code). I'm interested in identifying the most frequently used DRUG
> COMBINATIONS between only the 128 drugs among the 6700 drugs. The structure
> of the Excell file to be used in analysis is like this:
>
> -Sheet 1( the entire data set):-
> 1   Patient ID    Drug1    Drug2 ....    Drug 20.
> 2  1125            45            46                55
> 3  1126              60           55                 45
> .
> .
> 500,000
>
> -Sheet 2 (list of codes meanings for only the drugs of interest):
>
> 1    Drug code             meaning
> 2    45                            Simvastatin
> 3    55                            Aspirin
> 4    60                            Paracetamol
> .
> 128
>
> The desired output I'm looking for :
>
>  Drug codes      Meaning          Frequency
> 45+55               Simvastatin              2
>                          +Aspirin
> 60+55             Aspirin+                       1
>                         Paracetamol
> 60+45             Simvastatin+                1
>                         Paracetamol
>
> Please note the the final output does not include any combination
> containing drug 46 as this is not in the list of drugs preferred to be
> studied which are mentioned in sheet 2.
>
> Could you please help me which R codes and packages should be used to run
> this analyisis?
>
> Regards
> Mustafa
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help op r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list