[R] Query about calculating the monthly average of daily data columns

Fri Sep 13 15:33:11 CEST 2019

Hi

I am almost 100% sure that you would spare yourself much trouble if you changed your date column to real date

?as.Date

reshape your wide format to long one
library(reshape2)
?melt

to get 3 column data.frame with one date column, one country column and one value column

use ?aggregate and ?format to get summary value

something like
aggregate(value column, list(format(date column, "%m.%Y"), country column), mean)

But if you insist to scratch your left ear with right hand accross your head, you could continue your way.

Cheers
Petr

> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Subhamitra
> Patra
> Sent: Friday, September 13, 2019 3:20 PM
> To: Jim Lemon <drjimlemon using gmail.com>; r-help mailing list <r-help using r-
> project.org>
> Subject: Re: [R] Query about calculating the monthly average of daily data
> columns
>
> Dear Sir,
>
> Yes, I understood the logic. But, still, I have a few queries that I mentioned
> below your answers.
>
> "# if you only have to get the monthly averages, it can be done this way
> > spdat$month<-sapply(strsplit(spdat$dates,"-"),"["*,2*)
> > spdat$year<-sapply(strsplit(spdat$dates,"-"),"[",*3*)"
> >
> > B. Here, I need to define the no. of months, and years separately, right?
> > or else what 2, and 3 (in bold) indicates?
> >
>
> To get the grouping variable of sequential months that you want, you only
> need the month and year values of the dates in the first column. First I used
> the "strsplit" function to split the date field at the hyphens, then used
> "sapply" to extract ("[") the second (month) and *third (year)* parts as two
> new columns. Because you have more than one year of data, you need the
> year values or you will group all Januarys, all Februarys and so on.
> Notice how I pass both of the new columns as a list (a data frame is a type of
> list) in the call to get the mean of each month.
>
> 1. Here, as per my understanding, the "3" indicates the 3rd year, right?
> But, you showed an average for 2 months of the same year. Then, what "3"
> in the  spdat$year object indicate?
>
>
> C. From this part, I got the exact average values of both January and
> > February of 1994 for country A, and B. But, in code, I have a query
> > that I need to define  spdat$returnA, and  spdat$returnB separately
> > before writing this code, right? Like this, I need to define for each
> > 84 countries separately with their respective number of months, and
> > years before writing this code, right?
> >
>
> I don't think so. Because I don't know what your data looks like, I am
> guessing that for each row, it has columns for each of the 84 countries. I
> don't know what these columns are named, either. Maybe:
>
> date             Australia   Belarus   ...    Zambia
> 01/01/1994   20             21                 22
> ...
>
> Here, due to my misunderstanding about the code, I was wrong. But, what
> data structure you guessed, it is absolutely right that for each row, I have
> columns for each of the 84 countries. So, I think, I need to define the date
> column with no. of months, and years once for all the countries.
> Therefore, I got my answer to the first and third question in the previous
> email (what you suggested) that I no need to define the column of each
> country, as the date, and no. of observations are same for all countries.
> But, the no. of days are different for each month, and similarly, for each
> year. So, I think I need to define date for each year separately.  Hence, I have
> given an example of 12 months, for 2 years (i.e. 1994, and 1995), and have
> written the following code. Please correct me in case I am wrong.
>
>  spdat<-data.frame(
>
> dates=paste(c(1:21,1:20,1:23,1:21,1:22,1:22,1:21,1:23,1:22,1:21,1:22,1:22),c(r
> ep(1,21),rep(2,20),
> rep(3,23), rep(4,21),
> rep(5,22),rep(6,22),rep(7,21),rep(8,23),rep(9,22),rep(10,21),rep(11,22),rep(12
> ,22)
> ),rep(1994,260)
>  dates1=
> paste(c(1:22,1:20,1:23,1:20,1:23,1:22,1:21,1:23,1:21,1:22,1:22,1:21),c(rep(1,2
> 2),rep(2,20),
> rep(3,23), rep(4,20),
> rep(5,23),rep(6,22),rep(7,21),rep(8,23),rep(9,21),rep(10,21),rep(11,22),rep(12
> ,21)
> ),rep(1995,259) ,sep="-")
>
> Concerning the exporting of structure of the dataset to excel, I will have
> 12*84 matrix. But, please suggest me the way to proceed for the large
> sample. I have mentioned below what I understood from your code. Please
> correct me if I am wrong.
> 1. I need to define the date for each year as the no. of days in each month
> are different for each year (as mentioned in my above code). For instance, in
> my data file, Jan 1994 has 21 days while Jan 1995 has 22 days.
> 2. Need to define the date column as character.
> 3. Need to define the monthly average for each month, and year. So, now
> code will be as follows.
> spdat$month<-sapply(strsplit(spdat$dates,"-"),"[",2,3,4,5,6,7,8,9,10,11,12)
>   %%%%As I need all months average sequentially.
> spdat$year<-sapply(strsplit(spdat$dates,"-"),"[",3)
>
> Here, this meaning of "3", I am really unable to get.
>
> 4. Need to define each country with each month and year as mentioned in
> the last part of your code.
>
> Please suggest me in this regard.
>
> Thank you.
>
>
>
>
>
>
>
> [image: Mailtrack]
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_ca
> mpaign=signaturevirality5&>
> Sender
> notified by
> Mailtrack
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_ca
> mpaign=signaturevirality5&>
> 09/13/19,
> 06:41:41 PM
>
> On Fri, Sep 13, 2019 at 4:24 PM Jim Lemon <drjimlemon using gmail.com> wrote:
>
> > Hi Subhamitra,
> > I'll try to write my answers adjacent to your questions below.
> >
> > On Fri, Sep 13, 2019 at 6:08 PM Subhamitra Patra <
> > subhamitra.patra using gmail.com> wrote:
> >
> >> Dear Sir,
> >>
> >> Thank you very much for your suggestion.
> >>
> >> Yes, your suggested code worked. But, actually, I have data from 3rd
> >> January 1994 to 3rd August 2017 for very large (i.e. for 84
> >> countries) sample. From this, I have given the example of the years
> >> up to 2000. Before applying the same code for the long 24 years, I
> >> want to learn the logic behind the code. Actually, some part of the
> >> code is not understandable to me which I mentioned in the bold letter as
> follows.
> >>
> >> "spdat<-data.frame(
> >>   dates=paste(c(1:30,1:28),c(rep(1,30),rep(2,28)),rep(1994,58),sep="-"),
> >>   returnA=sample(*15:50*,58,TRUE),returnB=sample(*10:45*,58,TRUE))"
> >>
> >> A. Here, I need to define the no. of days in a month, and the no. of
> >> countries name separately, right? But, what is meant by 15:50, and
> >> 10:45 in return A, and B respectively?
> >>
> >
> > To paraphrase Donald Trump, this is FAKE DATA! I have no idea what the
> > real values of return are, so I made them up using the "sample" function.
> > However, this is not meant to mislead anyone, just to show how
> > whatever numbers are in your data can be used in calculations. The
> > colon (":") operator creates a sequence of numbers starting with the
> > one to the left and ending with the one to the right.
> >
> >>
> >> "# if you only have to get the monthly averages, it can be done this
> >> way
> >> spdat$month<-sapply(strsplit(spdat$dates,"-"),"["*,2*)
> >> spdat$year<-sapply(strsplit(spdat$dates,"-"),"[",*3*)"
> >>
> >> B. Here, I need to define the no. of months, and years separately, right?
> >> or else what 2, and 3 (in bold) indicates?
> >>
> >
> > To get the grouping variable of sequential months that you want, you
> > only need the month and year values of the dates in the first column.
> > First I used the "strsplit" function to split the date field at the
> > hyphens, then used "sapply" to extract ("[") the second (month) and
> > third (year) parts as two new columns. Because you have more than one
> > year of data, you need the year values or you will group all Januarys,
> > all Februarys and so on. Notice how I pass both of the new columns as
> > a list (a data frame is a type of
> > list) in the call to get the mean of each month.
> >
> >>
> >> "# get the averages by month and year - is this correct?
> >> monthlyA<-by(*spdat$returnA*,spdat[,c("month","year")],mean)
> >> monthlyB<-by(*spdat$returnB*,spdat[,c("month","year")],mean)"
> >>
> >> C. From this part, I got the exact average values of both January and
> >> February of 1994 for country A, and B. But, in code, I have a query
> >> that I need to define  spdat$returnA, and  spdat$returnB separately
> >> before writing this code, right? Like this, I need to define for each
> >> 84 countries separately with their respective number of months, and
> >> years before writing this code, right?
> >>
> >
> > I don't think so. Because I don't know what your data looks like, I am
> > guessing that for each row, it has columns for each of the 84
> > countries. I don't know what these columns are named, either. Maybe:
> >
> > date             Australia   Belarus   ...    Zambia
> > 01/01/1994   20             21                 22
> > ...
> >
> >
> >> Yes, after obtaining the monthly average for each country's data, I
> >> need to use them for further calculations. So, I want to export the
> >> result to excel. But, until understanding the code, I think I willn't
> >> able to apply for the entire sample, and cannot be able to discuss
> >> the format of the resulted column to export to excel.
> >>
> >
> > Say that we perform the grouped mean calculation for the first two
> > country columns like this:
> > monmeans<-sapply(spdat[,2:3],by,spdat[,c("month","year")],mean)
> > monmeans
> >     Australia  Belarus
> > [1,]  29.70000 30.43333
> > [2,]  34.17857 27.39286
> >
> > We are presented with a 2x2 matrix of monthly means in just the format
> > someone might use for importing into Excel. The first row is January
> > 1994, the second February 1994 and so on. By expanding the columns to
> > include all the countries in your data, You should have the result you want.
> >
> > Jim
> >
>
>
> --
> *Best Regards,*
> *Subhamitra Patra*
> *Phd. Research Scholar*
> *Department of Humanities and Social Sciences* *Indian Institute of
> Technology, Kharagpur*
> *INDIA*
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/