Questions and solutions in previous exam paper. Numbers in square brackets indicate marks.

## 1. lm, glsCorr, glsCorrVar [14]

#### (a) *Describe predict power* of a lm model. *Comment model performance* over some covariates e.g. time! [2]

[1]

- Reasonably poor;
- fairly low \(R^2\) 0.2;
- poor agreement between observed and fitted

[1] Prior to A are underestimated, between A to B are overestimated, post B are underestimated.

#### (b) Three *assumptions* of lm model; Descibe *validity* of them. [3]

[1] **Normality assumption**, appears to be violated, Shapiro-Wilk \(H_0\) NORMAL

[1] **Constant error variance assumption**, no evidence violated, Bresh-Pagan test \(H_0\) CONSTANT

[1] **Independence assumption**, clearly violated, correlation shown in \(\text{acf}\) plot & Durbin-Watson test \(H_0\) INDEPENDENT, test stats less than 2

#### (c) Think'bout source of data, other reason of correlation. [1]

Realistic reason.

#### (d) Describe mean-var relationships underlying BP test, contrast with that assumed by `glsVar`

. [2]

[1] Breusch-Pagan Test use \(r_i^2 = \alpha_0 + \alpha_1x_i + \gamma_i\) to determine the extent of agreement between residual variance and vars \(x\), where \(r_i^2\) are squared residuals, \(\alpha_0\) and \(\alpha_1\) are estimated in the model.

[1] BP test assume the test statistics \(NR^2\sim \chi^2_p\) in this case degree of freedom p = 1.

GLS model assume \(r^2_i \sim N(0,\sigma^2|\hat{y_i}|^{2m})\) or \(N(0,\sigma^2e^{2my_i})\)

#### (e) What do BP test and glsVar suggest about the mean-var relationship? [3]

[1] BP test checked(not constant): No evidence for a *linear relationship* with squared residuals.

[1] `glsVar`

: There appears to be a non-zero *power-based relationship*.

[1] AIC smaller when the power coefficient is fitted, and zero is not a plausible in the gls model.

#### (f) Conclude which model is most defensiable. [3]

[1] Overall Conclusion: There is no/strong evidence for a change of RespVar over ExpVars.

[1] Best fitting model (based on AIC [1], it's `glsCorrVar`

) exhibits a large/small p-value (0.05) for the relationship. The models *which inappropriately* ignore something concludes ExpVar is (not) significant.

## 2. glm for Poisson(OD) [11]

#### (a) Relationship impied by `glmPois`

model on RAW and LINK scale. Which is suitable. [3]

[1] `glmPois`

assumes a *nonlinear* relationship between RespVar and ExpVars and a *linear* relationship on *log/sqrt* scale.

[1] Not good for this data; [1] model allow **monotonic** relationship but function seems to need inflection points.

#### (b) Mean-Var relationships underpins `glmPois`

and `glmPoisOD`

. Which is more realistic. [3]

[1] `glmPois`

assume fitted mean and residual var are **equal**.

[1] `glmPoisOD`

assume residual var is **proportional** to fitted mean. \(\text{var} = \phi \lambda\)

[1] In this case, latter is more realistic since estimate of dispersion parameter \(\hat\phi=399\) is much larger than 1.

#### (c) Contrast conclusions of `glmPois`

,`glmPoisOD`

and`glsCorrVar`

, which is most defensiable. Additional methods to improve. [3]

[1] `glmPois`

and `glmPoisOD`

suggest strong evidence for negative/positive Resp-Exp relationship.

`glsCorrVar`

suggest this not well-evidenced, could be **sampling variability** alone.

[1] would use `glsCorrVar`

to base my conclusion on. Because although looks nonlinear (even Poisson-based are barely linear on raw scale), the non-constant error variance and correlation are modelled in errors.

[1] smoother based function(splines) to improve the relationship? GEE(Generalized Estimating Equation) much like GLMs but allow non-independence.

#### (d) How to investigate the effects of XXX or XXX on RespVar? If RespVar available at both. [2]

[2] e.g. Interactions, piecewise linear models, time as a factor variable.

## 3. glm for Binomial [15]

#### (a) Describe `fit.logit`

, including RespVar, linear predictior, random component, link function. [4]

[1] Linear predictor

[1] Link function

[1] Intercept and Error term

[1] Other β parameters

Assuming \(y_i \sim Binomial(n_i,p)\) , where \(y_i\) is the no. of observation with XXX out of the \(n_i\) observations, and \(p\) is the probability of XXX (being caught with fish in stomach), also the **RespVar** of the model, varying over ExpVars. \[
p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} + \epsilon_i\\
\] **Linear predictor** is obtained by transforming RespVar by the **Link function**, which is \[
g(p_i)=\log{(\frac{p_i}{1-p_i})}=\eta_i=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5
\] \(\beta_0\)is the intercept parameter (baseline), represents \(\text{female, <2.3m}\) and \(\text{george}\).

\(\beta_1\)is the coefficient for male (compared with female)

\(\beta_2\)is the coefficient for > 2.3m (compared with < 2.3m)

\(\beta_3\)is the coefficient for hancock (compared with george)

\(\beta_4\)is the coefficient for oklawaha (compared with george)

\(\beta_5\)is the coefficient for trsfford (compared with george)

\(\epsilon_i\) is the **random component** - the binomial error term

#### (b) Odds of Binomial GLM [2]

\[ \text{Odds} = \frac{p(\text{success})}{p(\text{failue})} = \frac{p_{it}}{1-p_{it}}= e^{\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5} \]

#### (c) Calculate odds use `fit.logit.best`

[2]

\[ \text{odds(small,trafford)} = e^{\beta_0+\beta_4}=e^{-0.1218-1.3733} = 0.2242262 \]

#### (d) Multiplicative effect between odds for different levels [1]

[1] For baseline level 0 compared with non-base level n, the odds of presence(success) vs absence(failure) are estimated to change by a factor \(e^{\beta_n}\)

#### (e) Explain Deviance [4]

[1] Deviance provides a measure of discrepancy between fitted model and Saturated model. The smaller the D, the better the model. \[ D = 2[l(\hat\beta_{sat},\phi) - l(\hat\beta,\phi)] \] [1] If model correct, \(D \sim \chi^2_{n-p-1}\), n observations, p predictors

A \(\chi^2\) test (\(H_0\) model is correct) will give a large p-value for a good fitting.

[1] \(\chi^2\) approximation of D often poor for Binomial GLMs

[1] Computing \(D\) involves \(\phi\), so if it's unknown we can't use the result.

#### (f) adjustment to raw residuals and why [2]

[1] Adjustment. \[ \text{Pearson residuals} = \frac{\text{raw residual}}{\hat{\text{SD}}}=\frac{y-\hat{y}}{\sqrt{\text{Var}(y)}} \]

[1] Why. We'd like to see no patterns in Pearson residuals if Mean-Var relationship is appropriate.

## 4. glm for Multinomial [10]

#### (a) Nominal and Ordinal [1]

Nominal: Response value categories has no natural order. e.g.

Ordinal: The matter of response matters. e.g.

#### (b) Assumptions for multinomial GLM `fit.mult`

[3]

[1] **Independent** observations from **Multinomial** distribution.

[1] Linear relationship with covars, on **(cumulative) log odds** scale.

[1] **IIA** : Independence from Irrelevant Alternatives. Assuming the odds of one outcome vs another does not dependent on what alternative outcomes are available.

#### (c) Model Selection procedure to choose covariates[2]

[1] Fit models with all possible combinations of covariates; dredge

[1] use fit criteria (e.g. AIC) to rank models.

#### (d) Calculate response. Reptile; small; hancock[2]

\[ p_{ij} = \frac{e^{\eta_{ij}}}{1+\sum_{k=2}^J e^{\eta_{ik}}} \]

\(\eta_3 = -3.66588368 + 1.2431622 =-2.42272148\)

\(\eta_2 = -0.09083394 -1.6583241= -1.74915804\)

\(\eta_4 = -2.72380722 + 0.6952142 = -2.02859302\)

\(\eta_4 =-1.57283851+ 0.8262891= -0.74654941\)

\[
p_{ij} = \frac{e^{-2.42272148}}{1+e^{-1.74915804}+e^{-2.42272148}+e^{-2.02859302}+e^{-0.74654941}}=0.04747016
\]

#### (e) Assumption of a proportional odds model [2]

[1] **Proportional odds Assumption** (especially for ordinal!):

The **slope** of covar relationship is the **same for for each outcome level**.

[1] To test the validity:

Fit a model that does not make it, use a model selection statistics to test between two.

General Reasons for using GLM instead of LR:

- Response Var is not guaranteed to change
**linearly**with Explainatory Vars; - Response Var is naturally
**bounded by some range**and LR predictions can produce values outside this range; - Errors are unlikely to be
**normal with constant variance**.

## 评论