[R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Fri Jun 10 21:19:10 CEST 2022

Hi Michael,

Thanks for your input on this. It's appreciated!

 I think I did not make this very clear in my post here, but I am
trying to model day-to-day variation in deaths and temperature. In
other words, if people get exposed to high temperature today, are
there increased numbers of deaths over the following 6 days? Because
I'm trying to model day-to-day variation, I'm not interested in
seasonal effects (it's a nuisance variable), so I am trying to remove
seasonality.There's a somewhat better post of my problem on Coss
Validated: https://stats.stackexchange.com/questions/577790/high-concurvity-collinearity-between-time-and-temperature-in-gam-predicting-dea

The deaths on the 15th of each month are huge errors. The data is from
a low-income country with less than adequate registration of deaths.
This data was collected by interviewers visiting housholds at half
yearly, or even yearly (or larger) intervals and asking if any deaths
had occurred over this period. If people could not remember the
specific date on which a death had occurred, it was - as a rule -
registered on the 15th of the month in which they though the death had
occurred. So the number of deaths on the 15th of each month does not
contribute anything useful to the model. It bears no relation to the
temperature on that day. It carries seasonal information, but I'm not
interested in that, and in fact need to remove that.  Hence why I
modelled those days as a factor variable with 169 levels (14 years of
data x 12 months +1 for all non-heaping days). Because there is only
one data point for each day, the idea was that this would remove any
effect of heaping. Following on from Simon Wood's comments, I'm now
thinking about using interpolation of deaths on those days, rather
than trying to control for heaping in the model. Removing rows of data
for each 15th day of the month doesn't seem right to me, because the
model includes distributed lag. Because of this distributed lag I am
still interested in the temperature on the 15th of each month, because
that they may have lagged effect over the following few days. I just
want to get rid of the error in deaths on the 15th, which I think I
can do by interpolation.

The ACF showed a peak at 15 (albeit barely significant). I don't
understand why that would happen given that heap as a factor with 169
levels should have effectively removed the effect of heaping, but it's
too much of a coincidence, so hopefully interpolation will help with
this too.

As for the concurvity between time and temperature, am going to see if
Simon's suggestion of adding an autoregressive term AR1 term does
anything to reduce it.

Thanks to both you and Simon for your input!

Jade

On Fri, 10 Jun 2022 at 16:45, Michael Dewey <lists using dewey.myzen.co.uk> wrote:
>
> Dear Jade
>
> It seems to me that there are several issues here.
>
> 1 - if you fit a separate parameter for each heaping day you efectively
> remove them from the model altogether. If they do carry meaningful
> information then that is undesirable.
>
> 2 - if the reason that there ae so many 15s is because people state,
> "Oh, it was in July" and either they or the interviewere impute 15
> because it si the month mid-point then it begins to look as though
> attempting to use daily data is fraught with problems and monthly might
> be better.
>
> 3 - if you fit a binary heap variable which is zero except for the 15 of
> the month and 1 for the 15 then you are taking out any general overall
> tendency to report 15 but you are allowing for the possibility that the
> relative size of heaping days can vary during the study period.
>
> I am not an expert on this sort of model, that is Simon clearly, but it
> seems to me that, as usual, we have a mix of statistical and scientific
> questions here and you may need to rethink your sceientific model in the
> light of the issue with your data.
>
> Michael
>
> On 10/06/2022 10:30, jade.shodan using googlemail.com wrote:
> > Hi Michael,
> >
> > I don't think my reply to your email came through to the list, so am
> > resending (see below). Problems with subscription have now hopefully
> > been resolved. Apologies if this is a double posting!
> >
> > On Thu, 9 Jun 2022 at 15:27, jade.shodan using googlemail.com
> > <jade.shodan using googlemail.com> wrote:
> >>
> >> Hi Michael,
> >>
> >> Thanks for the reply! When I ran the gam  with the gam() function, the
> >> model worked fine with heap having 169 levels. The same model with
> >> bam() however, fails.  I don't understand the difference between bam()
> >> and gam() at all (other than computational efficiency), but could the
> >> fact that each level only has 1 data point be the reason for it?
> >>
> >> These heaping days are very large measurement errors I need to get rid
> >> off, so I just want to take them out of the model altogether. (My data
> >> is quite noisy already, because date of death is based on memory
> >> recall, rather than formal death registration, due to the data being
> >> from a low income country). Median of deaths is approx. 2 per day, but
> >> on heaping days it can be as high as 50 or so.
> >>
> >> My understanding was that coding with 169 levels would effectively
> >> take these measurements out of the model (but do correct me if I'm
> >> wrong!)  I originally coded heap as a binary variable with 0 for
> >> non-heaping days and 1 for heaping days, but was told that meant that
> >> I was assuming the effect was the same for all heaping days. If I
> >> coded with 12 or 14 levels, wouldn't that leave a lot of noise in the
> >> data?
> >>
> >> Jade
> >>
> >> On Thu, 9 Jun 2022 at 14:52, Michael Dewey <lists using dewey.myzen.co.uk> wrote:
> >>>
> >>> Dear Jade
> >>>
> >>> Do you really need to fit a separate parameter for each heaping day? Can
> >>> you not just make it a binary predictor or a categorical one with fewer
> >>> levels, perhaps 14 (for heaping in each year) or 12 (for each calendar
> >>> month). I have no idea whether that would help but it seems worth a try.
> >>>
> >>> Michael
> >>>
> >>> On 08/06/2022 18:15, jade.shodan--- via R-help wrote:
> >>>> Hi Simon,
> >>>>
> >>>> Thanks so much for this!! I have two follow up questions, if you don't mind.
> >>>>
> >>>> 1. Does including an autoregressive term not adjust away part of the
> >>>> effect of the response in a distributed lag model (where the outcome
> >>>> accumulates over time)?
> >>>> 2. I've tried to fit a model using bam (just a first attempt without
> >>>> AR term), but including the factor variable heap creates errors:
> >>>>
> >>>> bam0 <- bam(deaths~te(year, month, week, weekday,
> >>>> bs=c("cr","cc","cc","cc"), k = c(4,5,5,5)) + heap +
> >>>>                         te(temp_max, lag, k=c(8, 3)) +
> >>>>                         te(precip_daily_total, lag, k=c(8, 3)),
> >>>>                         data = dat, family = nb, method = 'fREML',
> >>>> select = TRUE, discrete = TRUE,
> >>>>                         knots = list(month = c(0.5, 12.5), week = c(0.5,
> >>>> 52.5), weekday = c(0, 6.5)))
> >>>>
> >>>> This model results in errors:
> >>>>
> >>>> Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w,  :
> >>>>     step failure in theta estimation
> >>>> Warning in sqrt(family$dev.resids(object$y, object$fitted.values,
> >>>> object$prior.weights)) :
> >>>>     NaNs produced
> >>>>
> >>>>
> >>>> Including heap as as.numeric(heap) runs the model without error
> >>>> messages or warnings, but model diagnostics look terrible, and it also
> >>>> doesn't make sense (to me) to make heap a numeric. The factor variable
> >>>> heap (with 169 levels) codes the fact that all deaths for which no
> >>>> date was known, were registered on the 15th day of each month. I've
> >>>> coded all non-heaping days as 0. All heaping days were coded as a
> >>>> value between 1-168. The time series spans 14 years, so a heaping day
> >>>> in each month results in 14*12 levels = 168, plus one level for
> >>>> non-heaping days.
> >>>>
> >>>> So my second question is: Does bam allow factor variables? And if not,
> >>>> how should I model this heaping on the 15th day of the month instead?
> >>>>
> >>>> With thanks,
> >>>>
> >>>> Jade
> >>>>
> >>>> On Wed, 8 Jun 2022 at 12:05, Simon Wood <simon.wood using bath.edu> wrote:
> >>>>>
> >>>>> I would not worry too much about high concurvity between variables like
> >>>>> temperature and time. This just reflects the fact that temperature has a
> >>>>> strong temporal pattern.
> >>>>>
> >>>>> I would also not be too worried about the low p-values on the k check.
> >>>>> The check only looks for pattern in the residuals when they are ordered
> >>>>> with respect to the variables of a smooth. When you have time series
> >>>>> data and some smooths involve time then it's hard not to pick up some
> >>>>> degree of residual auto-correlation, which you often would not want to
> >>>>> model with a higher rank smoother.
> >>>>>
> >>>>> The NAs  for the distributed lag terms just reflect the fact that there
> >>>>> is no obvious way to order the residuals w.r.t. the covariates for such
> >>>>> terms, so the simple check for residual pattern is not really possible.
> >>>>>
> >>>>> One simple approach is to fit using bam(...,discrete=TRUE) which will
> >>>>> let you specify an AR1 parameter to mop up some of the residual
> >>>>> auto-correlation without resorting to a high rank smooth that then does
> >>>>> all the work of the covariates as well. The AR1 parameter can be set by
> >>>>> looking at the ACF of the residuals of the model without this. You need
> >>>>> to look at the ACF of suitably standardized residuals to check how well
> >>>>> this has worked.
> >>>>>
> >>>>> best,
> >>>>>
> >>>>> Simon
> >>>>>
> >>>>> On 05/06/2022 20:01, jade.shodan--- via R-help wrote:
> >>>>>> Hello everyone,
> >>>>>>
> >>>>>> A few days ago I asked a question about concurvity in a GAM (the
> >>>>>> anologue of collinearity in a GLM) implemented in mgcv. I think my
> >>>>>> question was a bit unfocussed, so I am retrying again, but with
> >>>>>> additional information included about the autocorrelation function. I
> >>>>>> have also posted about this on Cross Validated. Given all the model
> >>>>>> output, it might make for easier
> >>>>>> reading:https://stats.stackexchange.com/questions/577790/high-concurvity-collinearity-between-time-and-temperature-in-gam-predicting-dea
> >>>>>>
> >>>>>> As mentioned previously, I have problems with concurvity in my thesis
> >>>>>> research, and don't have access to a statistician who works with time
> >>>>>> series, GAMs or R. I'd be very grateful for any (partial) answer,
> >>>>>> however short. I'll gladly return the favour where I can! For really
> >>>>>> helpful input I'd be more than happy to offer co-authorship on
> >>>>>> publication. Deadlines are very close, and I'm heading towards having
> >>>>>> no results at all if I can't solve this concurvity issue :(
> >>>>>>
> >>>>>> I'm using GAMs to try to understand the relationship between deaths
> >>>>>> and heat-related variables (e.g. temperature and humidity), using
> >>>>>> daily time series over a 14-year period from a tropical, low-income
> >>>>>> country. My aim is to understand the relationship between these
> >>>>>> variables and deaths, rather than pure prediction performance.
> >>>>>>
> >>>>>> The GAMs include distributed lag models (set up as 7-column matrices,
> >>>>>> see code at bottom of post), since deaths may occur over several days
> >>>>>> following exposure.
> >>>>>>
> >>>>>> Simple GAMs with just time, lagged temperature and lagged
> >>>>>> precipitation (a potential confounder) show very high concurvity
> >>>>>> between lagged temperature and time, regardless of the many different
> >>>>>> ways I have tried to decompose time. The autocorrelation functions
> >>>>>> (ACF) however, shows values close to zero, only just about breaching
> >>>>>> the 'significance line' in a few instances. It does show patterning
> >>>>>> though, although the regularity is difficult to define.
> >>>>>>
> >>>>>> My questions are:
> >>>>>> 1) Should I be worried about the high concurvity, or can I ignore it
> >>>>>> given the mostly non-significant ACF? I've read dozens of
> >>>>>> heat-mortality modelling studies and none report on concurvity between
> >>>>>> weather variables and time (though one 2012 paper discussed
> >>>>>> autocorrelation).
> >>>>>>
> >>>>>> 2) If I cannot ignore it, what should I do to resolve it? Would
> >>>>>> including an autoregressive term be appropriate, and if so, where can
> >>>>>> I find a coded example of how to do this? I've also come across
> >>>>>> sequential regression][1]. Would this be more or less appropriate? If
> >>>>>> appropriate, a pointer to an example would be really appreciated!
> >>>>>>
> >>>>>> Some example GAMs are specified as follows:
> >>>>>> ```r
> >>>>>> conc38b <- gam(deaths~te(year, month, week, weekday,
> >>>>>> bs=c("cr","cc","cc","cc")) + heap +
> >>>>>>                          te(temp_max, lag, k=c(10, 3)) +
> >>>>>>                          te(precip_daily_total, lag, k=c(10, 3)),
> >>>>>>                          data = dat, family = nb, method = 'REML', select = TRUE,
> >>>>>>                          knots = list(month = c(0.5, 12.5), week = c(0.5,
> >>>>>> 52.5), weekday = c(0, 6.5)))
> >>>>>> ```
> >>>>>> Concurvity for the above model between (temp_max, lag) and (year,
> >>>>>> month, week, weekday) is 0.91:
> >>>>>>
> >>>>>> ```r
> >>>>>> $worst
> >>>>>>                                        para te(year,month,week,weekday)
> >>>>>> te(temp_max,lag) te(precip_daily_total,lag)
> >>>>>> para                        1.000000e+00                1.125625e-29
> >>>>>>         0.3150073                  0.6666348
> >>>>>> te(year,month,week,weekday) 1.400648e-29                1.000000e+00
> >>>>>>         0.9060552                  0.6652313
> >>>>>> te(temp_max,lag)            3.152795e-01                8.998113e-01
> >>>>>>         1.0000000                  0.5781015
> >>>>>> te(precip_daily_total,lag)  6.666348e-01                6.652313e-01
> >>>>>>         0.5805159                  1.0000000
> >>>>>> ```
> >>>>>>
> >>>>>> Output from ```gam.check()```:
> >>>>>> ```r
> >>>>>> Method: REML   Optimizer: outer newton
> >>>>>> full convergence after 16 iterations.
> >>>>>> Gradient range [-0.01467332,0.003096643]
> >>>>>> (score 8915.994 & scale 1).
> >>>>>> Hessian positive definite, eigenvalue range [5.048053e-05,26.50175].
> >>>>>> Model rank =  544 / 544
> >>>>>>
> >>>>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
> >>>>>> indicate that k is too low, especially if edf is close to k'.
> >>>>>>
> >>>>>>                                      k'      edf k-index p-value
> >>>>>> te(year,month,week,weekday) 319.0000  26.6531    0.96    0.06 .
> >>>>>> te(temp_max,lag)             29.0000   3.3681      NA      NA
> >>>>>> te(precip_daily_total,lag)   27.0000   0.0051      NA      NA
> >>>>>> ---
> >>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>>>> ```
> >>>>>>
> >>>>>> Some output from ```summary(conc38b)```:
> >>>>>> ```r
> >>>>>> Approximate significance of smooth terms:
> >>>>>>                                      edf Ref.df  Chi.sq p-value
> >>>>>> te(year,month,week,weekday) 26.653127    319 166.803 < 2e-16 ***
> >>>>>> te(temp_max,lag)             3.368129     27  11.130 0.00145 **
> >>>>>> te(precip_daily_total,lag)   0.005104     27   0.002 0.69636
> >>>>>> ---
> >>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>>>>
> >>>>>> R-sq.(adj) =  0.839   Deviance explained = 53.3%
> >>>>>> -REML =   8916  Scale est. = 1         n = 5107
> >>>>>> ```
> >>>>>>
> >>>>>>
> >>>>>> Below are the ACF plots (note limit y-axis = 0.1 for clarity of
> >>>>>> pattern). They show peaks at 5 and 15, and then there seems to be a
> >>>>>> recurring pattern at multiples of approx. 30 (suggesting month is not
> >>>>>> modelled adequately?). Not sure what would cause the spikes at 5 and
> >>>>>> 15. There is heaping of deaths on the 15th day of each month, to which
> >>>>>> deaths with unknown date were allocated. This heaping was modelled
> >>>>>> with categorical variable/ factor ```heap``` with 169 levels (0 for
> >>>>>> all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent
> >>>>>> heaping day over the 14-year period):
> >>>>>>
> >>>>>>      [2]: https://i.stack.imgur.com/FzKyM.png
> >>>>>>      [3]: https://i.stack.imgur.com/fE3aL.png
> >>>>>>
> >>>>>>
> >>>>>> I get an identical looking ACF when I decompose time into (year,
> >>>>>> month, monthday) as in model conc39 below, although concurvity between
> >>>>>> (temp_max, lag) and the time term has now dropped somewhat to 0.83:
> >>>>>>
> >>>>>> ```r
> >>>>>> conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
> >>>>>>                         te(temp_max, lag, k=c(10, 4)) +
> >>>>>>                         te(precip_daily_total, lag, k=c(10, 4)),
> >>>>>>                         data = dat, family = nb, method = 'REML', select = TRUE,
> >>>>>>                         knots = list(month = c(0.5, 12.5)))
> >>>>>> ```
> >>>>>> ```r
> >>>>>>
> >>>>>> Method: REML   Optimizer: outer newton
> >>>>>> full convergence after 14 iterations.
> >>>>>> Gradient range [-0.001578187,6.155096e-05]
> >>>>>> (score 8915.763 & scale 1).
> >>>>>> Hessian positive definite, eigenvalue range [1.894391e-05,26.99215].
> >>>>>> Model rank =  323 / 323
> >>>>>>
> >>>>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
> >>>>>> indicate that k is too low, especially if edf is close to k'.
> >>>>>>
> >>>>>>                                    k'     edf k-index p-value
> >>>>>> te(year,month,monthday)    79.0000 25.0437    0.93  <2e-16 ***
> >>>>>> te(temp_max,lag)           39.0000  4.0875      NA      NA
> >>>>>> te(precip_daily_total,lag) 36.0000  0.0107      NA      NA
> >>>>>> ---
> >>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>>>> ```
> >>>>>> Some output from ```summary(conc39)```:
> >>>>>> ```r
> >>>>>> Approximate significance of smooth terms:
> >>>>>>                                    edf Ref.df  Chi.sq  p-value
> >>>>>> te(year,month,monthday)    38.75573     99 187.231  < 2e-16 ***
> >>>>>> te(temp_max,lag)            4.06437     37  25.927 1.66e-06 ***
> >>>>>> te(precip_daily_total,lag)  0.01173     36   0.008    0.557
> >>>>>> ---
> >>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>>>>
> >>>>>> R-sq.(adj) =  0.839   Deviance explained = 53.8%
> >>>>>> -REML =   8915  Scale est. = 1         n = 5107
> >>>>>> ```
> >>>>>>
> >>>>>>
> >>>>>> ```r
> >>>>>> $worst
> >>>>>>                                       para te(year,month,monthday)
> >>>>>> te(temp_max,lag) te(precip_daily_total,lag)
> >>>>>> para                       1.000000e+00            3.261007e-31
> >>>>>> 0.3313549                  0.6666532
> >>>>>> te(year,month,monthday)    3.060763e-31            1.000000e+00
> >>>>>> 0.8266086                  0.5670777
> >>>>>> te(temp_max,lag)           3.331014e-01            8.225942e-01
> >>>>>> 1.0000000                  0.5840875
> >>>>>> te(precip_daily_total,lag) 6.666532e-01            5.670777e-01
> >>>>>> 0.5939380                  1.0000000
> >>>>>> ```
> >>>>>>
> >>>>>> Modelling time as ```te(year, doy)``` with a cyclic spline for doy and
> >>>>>> various choices for k or as ```s(time)``` with various k does not
> >>>>>> reduce concurvity either.
> >>>>>>
> >>>>>>
> >>>>>> The default approach in time series studies of heat-mortality is to
> >>>>>> model time with fixed df, generally between 7-10 df per year of data.
> >>>>>> I am, however, apprehensive about this approach because a) mortality
> >>>>>> profiles vary with locality due to sociodemographic and environmental
> >>>>>> characteristics and b) the choice of df is based on higher income
> >>>>>> countries (where nearly all these studies have been done) with
> >>>>>> different mortality profiles and so may not be appropriate for
> >>>>>> tropical, low-income countries.
> >>>>>>
> >>>>>> Although the approach of fixing (high) df does remove more temporal
> >>>>>> patterns from the ACF (see model and output below), concurvity between
> >>>>>> time and lagged temperature has now risen to 0.99! Moreover,
> >>>>>> temperature (which has been a consistent, highly significant predictor
> >>>>>> in every model of the tens (hundreds?) I have run, has now turned
> >>>>>> non-significant. I am guessing this is because time is now a very
> >>>>>> wiggly function that not only models/ removes seasonal variation, but
> >>>>>> also some of the day-to-day variation that is needed for the
> >>>>>> temperature smooth  :
> >>>>>>
> >>>>>> ```r
> >>>>>> conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
> >>>>>>                          te(temp_max, lag, k=c(10,3)) +
> >>>>>>                          te(precip_daily_total, lag, k=c(10,3)),
> >>>>>>                          data = dat, family = nb, method = 'REML', select = TRUE)
> >>>>>> ```
> >>>>>> Output from ```gam.check(conc20a, rep = 1000)```:
> >>>>>>
> >>>>>> ```r
> >>>>>> Method: REML   Optimizer: outer newton
> >>>>>> full convergence after 9 iterations.
> >>>>>> Gradient range [-0.0008983099,9.546022e-05]
> >>>>>> (score 8750.13 & scale 1).
> >>>>>> Hessian positive definite, eigenvalue range [0.0001420112,15.40832].
> >>>>>> Model rank =  336 / 336
> >>>>>>
> >>>>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
> >>>>>> indicate that k is too low, especially if edf is close to k'.
> >>>>>>
> >>>>>>                                     k'      edf k-index p-value
> >>>>>> s(time)                    111.0000 111.0000    0.98    0.56
> >>>>>> te(temp_max,lag)            29.0000   0.6548      NA      NA
> >>>>>> te(precip_daily_total,lag)  27.0000   0.0046      NA      NA
> >>>>>> ```
> >>>>>> Output from ```concurvity(conc20a, full=FALSE)$worst```:
> >>>>>>
> >>>>>> ```r
> >>>>>>                                       para      s(time) te(temp_max,lag)
> >>>>>> te(precip_daily_total,lag)
> >>>>>> para                       1.000000e+00 2.462064e-19        0.3165236
> >>>>>>                    0.6666348
> >>>>>> s(time)                    2.462398e-19 1.000000e+00        0.9930674
> >>>>>>                    0.6879284
> >>>>>> te(temp_max,lag)           3.170844e-01 9.356384e-01        1.0000000
> >>>>>>                    0.5788711
> >>>>>> te(precip_daily_total,lag) 6.666348e-01 6.879284e-01        0.5788381
> >>>>>>                    1.0000000
> >>>>>>
> >>>>>> ```
> >>>>>>
> >>>>>> Some output from ```summary(conc20a)```:
> >>>>>> ```r
> >>>>>> Approximate significance of smooth terms:
> >>>>>>                                     edf Ref.df  Chi.sq p-value
> >>>>>> s(time)                    1.110e+02    111 419.375  <2e-16 ***
> >>>>>> te(temp_max,lag)           6.548e-01     27   0.895   0.249
> >>>>>> te(precip_daily_total,lag) 4.598e-03     27   0.002   0.868
> >>>>>> ---
> >>>>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>>>>
> >>>>>> R-sq.(adj) =  0.843   Deviance explained = 56.1%
> >>>>>> -REML = 8750.1  Scale est. = 1         n = 5107
> >>>>>> ```
> >>>>>>
> >>>>>> ACF functions:
> >>>>>>
> >>>>>> [4]: https://i.stack.imgur.com/7nbXS.png
> >>>>>> [5]: https://i.stack.imgur.com/pNnZU.png
> >>>>>>
> >>>>>> Data can be found on my [GitHub][6] site in the file
> >>>>>> [data_cross_validated_post2.rds][7]. A csv version is also available.
> >>>>>> This is my code:
> >>>>>>
> >>>>>> ```r
> >>>>>> library(readr)
> >>>>>> library(mgcv)
> >>>>>>
> >>>>>> df <- read_rds("data_crossvalidated_post2.rds")
> >>>>>>
> >>>>>> # Create matrices for lagged weather variables (6 day lags) based on
> >>>>>> example by Simon Wood
> >>>>>> # in his 2017 book ("Generalized additive models: an introduction with
> >>>>>> R", p. 349) and
> >>>>>> # gamair package documentation
> >>>>>> (https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54)
> >>>>>>
> >>>>>> lagard <- function(x,n.lag=7) {
> >>>>>> n <- length(x); X <- matrix(NA,n,n.lag)
> >>>>>> for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1]
> >>>>>> X
> >>>>>> }
> >>>>>>
> >>>>>> dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE),
> >>>>>> deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month,
> >>>>>> weekday = df$weekday, week = df$week, monthday = df$monthday, time =
> >>>>>> df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum
> >>>>>> = df$precip_hourly_dailysum)
> >>>>>> dat$temp_max <- lagard(df$temp_max)
> >>>>>> dat$temp_min <- lagard(df$temp_min)
> >>>>>> dat$temp_mean <- lagard(df$temp_mean)
> >>>>>> dat$wbgt_max <- lagard(df$wbgt_max)
> >>>>>> dat$wbgt_mean <- lagard(df$wbgt_mean)
> >>>>>> dat$wbgt_min <- lagard(df$wbgt_min)
> >>>>>> dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max)
> >>>>>> dat$sh_mean <- lagard(df$sh_mean)
> >>>>>> dat$solar_mean <- lagard(df$solar_mean)
> >>>>>> dat$wind2m_mean <- lagard(df$wind2m_mean)
> >>>>>> dat$sh_max <- lagard(df$sh_max)
> >>>>>> dat$solar_max <- lagard(df$solar_max)
> >>>>>> dat$wind2m_max <- lagard(df$wind2m_max)
> >>>>>> dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean)
> >>>>>> dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum)
> >>>>>> dat$precip_hourly <- lagard(df$precip_hourly)
> >>>>>> dat$precip_daily_total <- lagard( df$precip_daily_total)
> >>>>>> dat$temp <- lagard(df$temp)
> >>>>>> dat$sh <- lagard(df$sh)
> >>>>>> dat$rh <- lagard(df$rh)
> >>>>>> dat$solar <- lagard(df$solar)
> >>>>>> dat$wind2m <- lagard(df$wind2m)
> >>>>>>
> >>>>>>
> >>>>>> conc38b <- gam(deaths~te(year, month, week, weekday,
> >>>>>> bs=c("cr","cc","cc","cc")) + heap +
> >>>>>>                          te(temp_max, lag, k=c(10, 3)) +
> >>>>>>                          te(precip_daily_total, lag, k=c(10, 3)),
> >>>>>>                          data = dat, family = nb, method = 'REML', select = TRUE,
> >>>>>>                          knots = list(month = c(0.5, 12.5), week = c(0.5,
> >>>>>> 52.5), weekday = c(0, 6.5)))
> >>>>>>
> >>>>>> conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
> >>>>>>                         te(temp_max, lag, k=c(10, 4)) +
> >>>>>>                         te(precip_daily_total, lag, k=c(10, 4)),
> >>>>>>                         data = dat, family = nb, method = 'REML', select = TRUE,
> >>>>>>                         knots = list(month = c(0.5, 12.5)))
> >>>>>>
> >>>>>> conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
> >>>>>>                          te(temp_max, lag, k=c(10,3)) +
> >>>>>>                          te(precip_daily_total, lag, k=c(10,3)),
> >>>>>>                          data = dat, family = nb, method = 'REML', select = TRUE)
> >>>>>>
> >>>>>> ```
> >>>>>> Thank you if you've read this far!! :-))
> >>>>>>
> >>>>>>      [1]: https://scholar.google.co.uk/scholar?output=instlink&q=info:PKdjq7ZwozEJ:scholar.google.com/&hl=en&as_sdt=0,5&scillfp=17865929886710916120&oi=lle
> >>>>>>      [2]: https://i.stack.imgur.com/FzKyM.png
> >>>>>>      [3]: https://i.stack.imgur.com/fE3aL.png
> >>>>>>      [4]: https://i.stack.imgur.com/7nbXS.png
> >>>>>>      [5]: https://i.stack.imgur.com/pNnZU.png
> >>>>>>      [6]: https://github.com/JadeShodan/heat-mortality
> >>>>>>      [7]: https://github.com/JadeShodan/heat-mortality/blob/main/data_cross_validated_post2.rds
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>> --
> >>>>> Simon Wood, School of Mathematics, University of Edinburgh,
> >>>>> https://www.maths.ed.ac.uk/~swood34/
> >>>>>
> >>>>
> >>>> ______________________________________________
> >>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>
> >>> --
> >>> Michael
> >>> http://www.dewey.myzen.co.uk/home.html
> >
>
> --
> Michael
> http://www.dewey.myzen.co.uk/home.html