# MT5761 Statistics Modeling - Revision Note

Questions and solutions in previous exam paper. Numbers in square brackets indicate marks.

## 1. lm, glsCorr, glsCorrVar [14]

#### (a) Describe predict power of a lm model. Comment model performance over some covariates e.g. time! [2]

[1]

• Reasonably poor;
• fairly low $R^2$ 0.2;
• poor agreement between observed and fitted

[1] Prior to A are underestimated, between A to B are overestimated, post B are underestimated.

#### (b) Three assumptions of lm model; Descibe validity of them. [3]

[1] Normality assumption, appears to be violated, Shapiro-Wilk $H_0$ NORMAL

[1] Constant error variance assumption, no evidence violated, Bresh-Pagan test $H_0$ CONSTANT

[1] Independence assumption, clearly violated, correlation shown in $\text{acf}$ plot & Durbin-Watson test $H_0$ INDEPENDENT, test stats less than 2

#### (c) Think'bout source of data, other reason of correlation. [1]

Realistic reason.

#### (d) Describe mean-var relationships underlying BP test, contrast with that assumed by glsVar. [2]

[1] Breusch-Pagan Test use $r_i^2 = \alpha_0 + \alpha_1x_i + \gamma_i$ to determine the extent of agreement between residual variance and vars $x$, where $r_i^2$ are squared residuals, $\alpha_0$ and $\alpha_1$ are estimated in the model.

[1] BP test assume the test statistics $NR^2\sim \chi^2_p$ in this case degree of freedom p = 1.

GLS model assume $r^2_i \sim N(0,\sigma^2|\hat{y_i}|^{2m})$ or $N(0,\sigma^2e^{2my_i})$

#### (e) What do BP test and glsVar suggest about the mean-var relationship? [3]

[1] BP test checked(not constant): No evidence for a linear relationship with squared residuals.

[1] glsVar: There appears to be a non-zero power-based relationship.

[1] AIC smaller when the power coefficient is fitted, and zero is not a plausible in the gls model.

#### (f) Conclude which model is most defensiable. [3]

[1] Overall Conclusion: There is no/strong evidence for a change of RespVar over ExpVars.

[1] Best fitting model (based on AIC [1], it's glsCorrVar) exhibits a large/small p-value (0.05) for the relationship. The models which inappropriately ignore something concludes ExpVar is (not) significant.

## 2. glm for Poisson(OD) [11]

[1] glmPois assumes a nonlinear relationship between RespVar and ExpVars and a linear relationship on log/sqrt scale.

[1] Not good for this data; [1] model allow monotonic relationship but function seems to need inflection points.

#### (b) Mean-Var relationships underpins glmPois and glmPoisOD. Which is more realistic. [3]

[1] glmPois assume fitted mean and residual var are equal.

[1] glmPoisOD assume residual var is proportional to fitted mean. $\text{var} = \phi \lambda$

[1] In this case, latter is more realistic since estimate of dispersion parameter $\hat\phi=399$ is much larger than 1.

#### (c) Contrast conclusions of glmPois ,glmPoisOD andglsCorrVar, which is most defensiable. Additional methods to improve. [3]

[1] glmPois and glmPoisOD suggest strong evidence for negative/positive Resp-Exp relationship.

glsCorrVar suggest this not well-evidenced, could be sampling variability alone.

[1] would use glsCorrVar to base my conclusion on. Because although looks nonlinear (even Poisson-based are barely linear on raw scale), the non-constant error variance and correlation are modelled in errors.

[1] smoother based function(splines) to improve the relationship? GEE(Generalized Estimating Equation) much like GLMs but allow non-independence.

#### (d) How to investigate the effects of XXX or XXX on RespVar? If RespVar available at both. [2]

[2] e.g. Interactions, piecewise linear models, time as a factor variable.

## 3. glm for Binomial [15]

[1] Linear predictor

[1] Intercept and Error term

[1] Other β parameters

Assuming $y_i \sim Binomial(n_i,p)$ , where $y_i$ is the no. of observation with XXX out of the $n_i$ observations, and $p$ is the probability of XXX (being caught with fish in stomach), also the RespVar of the model, varying over ExpVars. $p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} + \epsilon_i\\$ Linear predictor is obtained by transforming RespVar by the Link function, which is $g(p_i)=\log{(\frac{p_i}{1-p_i})}=\eta_i=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5$ $\beta_0$is the intercept parameter (baseline), represents $\text{female, <2.3m}$ and $\text{george}$.

$\beta_1$is the coefficient for male (compared with female)

$\beta_2$is the coefficient for > 2.3m (compared with < 2.3m)

$\beta_3$is the coefficient for hancock (compared with george)

$\beta_4$is the coefficient for oklawaha (compared with george)

$\beta_5$is the coefficient for trsfford (compared with george)

$\epsilon_i$ is the random component - the binomial error term

#### (b) Odds of Binomial GLM [2]

$\text{Odds} = \frac{p(\text{success})}{p(\text{failue})} = \frac{p_{it}}{1-p_{it}}= e^{\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5}$

#### (c) Calculate odds use fit.logit.best [2]

$\text{odds(small,trafford)} = e^{\beta_0+\beta_4}=e^{-0.1218-1.3733} = 0.2242262$

#### (d) Multiplicative effect between odds for different levels [1]

[1] For baseline level 0 compared with non-base level n, the odds of presence(success) vs absence(failure) are estimated to change by a factor $e^{\beta_n}$

#### (e) Explain Deviance [4]

[1] Deviance provides a measure of discrepancy between fitted model and Saturated model. The smaller the D, the better the model. $D = 2[l(\hat\beta_{sat},\phi) - l(\hat\beta,\phi)]$ [1] If model correct, $D \sim \chi^2_{n-p-1}$, n observations, p predictors

​ A $\chi^2$ test ($H_0$ model is correct) will give a large p-value for a good fitting.

[1] $\chi^2$ approximation of D often poor for Binomial GLMs

[1] Computing $D$ involves $\phi$, so if it's unknown we can't use the result.

#### (f) adjustment to raw residuals and why [2]

[1] Adjustment. $\text{Pearson residuals} = \frac{\text{raw residual}}{\hat{\text{SD}}}=\frac{y-\hat{y}}{\sqrt{\text{Var}(y)}}$

[1] Why. We'd like to see no patterns in Pearson residuals if Mean-Var relationship is appropriate.

## 4. glm for Multinomial [10]

#### (a) Nominal and Ordinal [1]

Nominal: Response value categories has no natural order. e.g.

Ordinal: The matter of response matters. e.g.

#### (b) Assumptions for multinomial GLM fit.mult [3]

[1] Independent observations from Multinomial distribution.

[1] Linear relationship with covars, on (cumulative) log odds scale.

[1] IIA : Independence from Irrelevant Alternatives. Assuming the odds of one outcome vs another does not dependent on what alternative outcomes are available.

#### (c) Model Selection procedure to choose covariates[2]

[1] Fit models with all possible combinations of covariates; dredge

[1] use fit criteria (e.g. AIC) to rank models.

#### (d) Calculate response. Reptile; small; hancock[2]

$p_{ij} = \frac{e^{\eta_{ij}}}{1+\sum_{k=2}^J e^{\eta_{ik}}}$

$\eta_3 = -3.66588368 + 1.2431622 =-2.42272148$

$\eta_2 = -0.09083394 -1.6583241= -1.74915804$

$\eta_4 = -2.72380722 + 0.6952142 = -2.02859302$

$\eta_4 =-1.57283851+ 0.8262891= -0.74654941$
$p_{ij} = \frac{e^{-2.42272148}}{1+e^{-1.74915804}+e^{-2.42272148}+e^{-2.02859302}+e^{-0.74654941}}=0.04747016$

#### (e) Assumption of a proportional odds model [2]

[1] Proportional odds Assumption (especially for ordinal!):

​ The slope of covar relationship is the same for for each outcome level.

[1] To test the validity:

​ Fit a model that does not make it, use a model selection statistics to test between two.

General Reasons for using GLM instead of LR:

1. Response Var is not guaranteed to change linearly with Explainatory Vars;
2. Response Var is naturally bounded by some range and LR predictions can produce values outside this range;
3. Errors are unlikely to be normal with constant variance.