[R] Query about calculating the monthly average of daily data columns

Sat Sep 14 02:35:29 CEST 2019

Sorry, forgot to include the list.

On Sat, Sep 14, 2019 at 10:27 AM Jim Lemon <drjimlemon using gmail.com> wrote:
>
> See inline
>
> On Fri, Sep 13, 2019 at 11:20 PM Subhamitra Patra <subhamitra.patra using gmail.com> wrote:
>>
>> Dear Sir,
>>
>> Yes, I understood the logic. But, still, I have a few queries that I mentioned below your answers.
>>
>>> "# if you only have to get the monthly averages, it can be done this way
>>> spdat$month<-sapply(strsplit(spdat$dates,"-"),"[",2)
>>> spdat$year<-sapply(strsplit(spdat$dates,"-"),"[",3)"
>>>
>>> B. Here, I need to define the no. of months, and years separately, right? or else what 2, and 3 (in bold) indicates?
>>
>>
>> To get the grouping variable of sequential months that you want, you only need the month and year values of the dates in the first column. First I used the "strsplit" function to split the date field at the hyphens, then used "sapply" to extract ("[") the second (month) and third (year) parts as two new columns. Because you have more than one year of data, you need the year values or you will group all Januarys, all Februarys and so on. Notice how I pass both of the new columns as a list (a data frame is a type of list) in the call to get the mean of each month.
>>
>> 1. Here, as per my understanding, the "3" indicates the 3rd year, right? But, you showed an average for 2 months of the same year. Then, what "3" in the  spdat$year object indicate?
>
>
> No, as I explained in the initial email and below, the "strsplit" function takes one or more strings (your dates) and breaks them at the specified character ("-"), So
>
> strsplit("1-1-1994","-")
> [[1]]
> [1] "1"    "1"    "1994"
>
> That is passed to the "sapply" function that applies the extraction ("[") operator to the result of "strsplit". The "3" indicates that you want to extract the third element, in this case, the year.
>
> > sapply(strsplit("1-1-1994","-"),"[",3)
> [1] "1994"
>
> So by splitting the dates and extracting the second (month) and third (year) element from each date, we have all the information needed to create a grouping variable for monthly averages.
>
>>
>>
>>> C. From this part, I got the exact average values of both January and February of 1994 for country A, and B. But, in code, I have a query that I need to define  spdat$returnA, and  spdat$returnB separately before writing this code, right? Like this, I need to define for each 84 countries separately with their respective number of months, and years before writing this code, right?
>>
>>
>> I don't think so. Because I don't know what your data looks like, I am guessing that for each row, it has columns for each of the 84 countries. I don't know what these columns are named, either. Maybe:
>>
>> date             Australia   Belarus   ...    Zambia
>> 01/01/1994   20             21                 22
>> ...
>>
>> Here, due to my misunderstanding about the code, I was wrong. But, what data structure you guessed, it is absolutely right that for each row, I have columns for each of the 84 countries. So, I think, I need to define the date column with no. of months, and years once for all the countries. Therefore, I got my answer to the first and third question in the previous email (what you suggested) that I no need to define the column of each country, as the date, and no. of observations are same for all countries. But, the no. of days are different for each month, and similarly, for each year. So, I think I need to define date for each year separately.  Hence, I have given an example of 12 months, for 2 years (i.e. 1994, and 1995), and have written the following code. Please correct me in case I am wrong.
>>
>>  spdat<-data.frame(
>>   dates=paste(c(1:21,1:20,1:23,1:21,1:22,1:22,1:21,1:23,1:22,1:21,1:22,1:22),c(rep(1,21),rep(2,20),rep(3,23), rep(4,21), rep(5,22),rep(6,22),rep(7,21),rep(8,23),rep(9,22),rep(10,21),rep(11,22),rep(12,22)),rep(1994,260)
>>  dates1=paste(c(1:22,1:20,1:23,1:20,1:23,1:22,1:21,1:23,1:21,1:22,1:22,1:21),c(rep(1,22),rep(2,20),rep(3,23), rep(4,20), rep(5,23),rep(6,22),rep(7,21),rep(8,23),rep(9,21),rep(10,21),rep(11,22),rep(12,21)),rep(1995,259) ,sep="-")
>>
> First, you don't have to recreate the data that you already have. I did because I don't have it and have to guess what it looks like. Remember neither I nor any of the others who have offered help have your data or even a representative sample. If you tried the code above, you surely must know that it doesn't work. I could create code that would produce the dates from 1-1-1994 to 31/12/1995 or any other stretch you would like, but it would only confuse you more.  _You already have the dates in your data file._ What I have shown you is how to use those dates to create the grouping variable that you want.
>
>> Concerning the exporting of structure of the dataset to excel, I will have 12*84 matrix. But, please suggest me the way to proceed for the large sample. I have mentioned below what I understood from your code. Please correct me if I am wrong.
>> 1. I need to define the date for each year as the no. of days in each month are different for each year (as mentioned in my above code). For instance, in my data file, Jan 1994 has 21 days while Jan 1995 has 22 days.
>> 2. Need to define the date column as character.
>> 3. Need to define the monthly average for each month, and year. So, now code will be as follows.
>> spdat$month<-sapply(strsplit(spdat$dates,"-"),"[",2,3,4,5,6,7,8,9,10,11,12)    %%%%As I need all months average sequentially.
>> spdat$year<-sapply(strsplit(spdat$dates,"-"),"[",3)
>>
>> Here, this meaning of "3", I am really unable to get
>
>
> You have missed the point here, as above. I didn't mean to suggest that you had to recreate the dates that you already have. What I did was to show how you could use the dates that you already have to create a grouping variable for your calculation.
>>
>> .
>>
>> 4. Need to define each country with each month and year as mentioned in the last part of your code.
>>
> What I did was to add the month and year of each row as two separate columns of data. You should be able to see that by looking at spdat after the "strsplit/sapply" operation. Then you have in each row the returns from your 84 countries _and_ the month/year for that row. When I used the "by" function to get the monthly means from spdat, I showed you how to use the same code on your data frame, after creating the month and year columns, to get the monthly average return for the 84 countries for as many years as you have. As this should return a matrix of successive months as rows and countries as columns, you should easily be able to import this into Excel.
>
> The reason I did it this way was to illustrate how to define the grouping variable from the existing information and perform an easily understood calculation. I thought that using methods that automatically perform the operations I used might allow you to get the result without understanding how you had gotten it. Like the data I didn't have, I took a guess at how much my example would help you understand what was happening and give you the skills to do it yourself. I do hope that you get to the point where you are able to think of my example as unsophisticated, for I could have done it with dplyr/ts/reshape and the rest.
>
> Jim