# MT5761 Statistics Modeling - Revision Note

Questions and solutions in previous exam paper. Numbers in square brackets indicate marks.

## 1. lm, glsCorr, glsCorrVar 

#### (a) Describe predict power of a lm model. Comment model performance over some covariates e.g. time! 



• Reasonably poor;
• fairly low $R^2$ 0.2;
• poor agreement between observed and fitted

 Prior to A are underestimated, between A to B are overestimated, post B are underestimated.

#### (b) Three assumptions of lm model; Descibe validity of them. 

 Normality assumption, appears to be violated, Shapiro-Wilk $H_0$ NORMAL

 Constant error variance assumption, no evidence violated, Bresh-Pagan test $H_0$ CONSTANT

 Independence assumption, clearly violated, correlation shown in $\text{acf}$ plot & Durbin-Watson test $H_0$ INDEPENDENT, test stats less than 2

#### (c) Think'bout source of data, other reason of correlation. 

Realistic reason.

#### (d) Describe mean-var relationships underlying BP test, contrast with that assumed by glsVar. 

 Breusch-Pagan Test use $r_i^2 = \alpha_0 + \alpha_1x_i + \gamma_i$ to determine the extent of agreement between residual variance and vars $x$, where $r_i^2$ are squared residuals, $\alpha_0$ and $\alpha_1$ are estimated in the model.

 BP test assume the test statistics $NR^2\sim \chi^2_p$ in this case degree of freedom p = 1.

GLS model assume $r^2_i \sim N(0,\sigma^2|\hat{y_i}|^{2m})$ or $N(0,\sigma^2e^{2my_i})$

#### (e) What do BP test and glsVar suggest about the mean-var relationship? 

 BP test checked(not constant): No evidence for a linear relationship with squared residuals.

 glsVar: There appears to be a non-zero power-based relationship.

 AIC smaller when the power coefficient is fitted, and zero is not a plausible in the gls model.

#### (f) Conclude which model is most defensiable. 

 Overall Conclusion: There is no/strong evidence for a change of RespVar over ExpVars.

 Best fitting model (based on AIC , it's glsCorrVar) exhibits a large/small p-value (0.05) for the relationship. The models which inappropriately ignore something concludes ExpVar is (not) significant.

## 2. glm for Poisson(OD) 

 glmPois assumes a nonlinear relationship between RespVar and ExpVars and a linear relationship on log/sqrt scale.

 Not good for this data;  model allow monotonic relationship but function seems to need inflection points.

#### (b) Mean-Var relationships underpins glmPois and glmPoisOD. Which is more realistic. 

 glmPois assume fitted mean and residual var are equal.

 glmPoisOD assume residual var is proportional to fitted mean. $\text{var} = \phi \lambda$

 In this case, latter is more realistic since estimate of dispersion parameter $\hat\phi=399$ is much larger than 1.

#### (c) Contrast conclusions of glmPois ,glmPoisOD andglsCorrVar, which is most defensiable. Additional methods to improve. 

 glmPois and glmPoisOD suggest strong evidence for negative/positive Resp-Exp relationship.

glsCorrVar suggest this not well-evidenced, could be sampling variability alone.

 would use glsCorrVar to base my conclusion on. Because although looks nonlinear (even Poisson-based are barely linear on raw scale), the non-constant error variance and correlation are modelled in errors.

 smoother based function(splines) to improve the relationship? GEE(Generalized Estimating Equation) much like GLMs but allow non-independence.

#### (d) How to investigate the effects of XXX or XXX on RespVar? If RespVar available at both. 

 e.g. Interactions, piecewise linear models, time as a factor variable.

## 3. glm for Binomial 

 Linear predictor

 Intercept and Error term

 Other β parameters

Assuming $y_i \sim Binomial(n_i,p)$ , where $y_i$ is the no. of observation with XXX out of the $n_i$ observations, and $p$ is the probability of XXX (being caught with fish in stomach), also the RespVar of the model, varying over ExpVars. $p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} + \epsilon_i\\$ Linear predictor is obtained by transforming RespVar by the Link function, which is $g(p_i)=\log{(\frac{p_i}{1-p_i})}=\eta_i=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5$ $\beta_0$is the intercept parameter (baseline), represents $\text{female, <2.3m}$ and $\text{george}$.

$\beta_1$is the coefficient for male (compared with female)

$\beta_2$is the coefficient for > 2.3m (compared with < 2.3m)

$\beta_3$is the coefficient for hancock (compared with george)

$\beta_4$is the coefficient for oklawaha (compared with george)

$\beta_5$is the coefficient for trsfford (compared with george)

$\epsilon_i$ is the random component - the binomial error term

#### (b) Odds of Binomial GLM 

$\text{Odds} = \frac{p(\text{success})}{p(\text{failue})} = \frac{p_{it}}{1-p_{it}}= e^{\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5}$

#### (c) Calculate odds use fit.logit.best 

$\text{odds(small,trafford)} = e^{\beta_0+\beta_4}=e^{-0.1218-1.3733} = 0.2242262$

#### (d) Multiplicative effect between odds for different levels 

 For baseline level 0 compared with non-base level n, the odds of presence(success) vs absence(failure) are estimated to change by a factor $e^{\beta_n}$

#### (e) Explain Deviance 

 Deviance provides a measure of discrepancy between fitted model and Saturated model. The smaller the D, the better the model. $D = 2[l(\hat\beta_{sat},\phi) - l(\hat\beta,\phi)]$  If model correct, $D \sim \chi^2_{n-p-1}$, n observations, p predictors

​ A $\chi^2$ test ($H_0$ model is correct) will give a large p-value for a good fitting.

 $\chi^2$ approximation of D often poor for Binomial GLMs

 Computing $D$ involves $\phi$, so if it's unknown we can't use the result.

#### (f) adjustment to raw residuals and why 

 Adjustment. $\text{Pearson residuals} = \frac{\text{raw residual}}{\hat{\text{SD}}}=\frac{y-\hat{y}}{\sqrt{\text{Var}(y)}}$

 Why. We'd like to see no patterns in Pearson residuals if Mean-Var relationship is appropriate.

## 4. glm for Multinomial 

#### (a) Nominal and Ordinal 

Nominal: Response value categories has no natural order. e.g.

Ordinal: The matter of response matters. e.g.

#### (b) Assumptions for multinomial GLM fit.mult 

 Independent observations from Multinomial distribution.

 Linear relationship with covars, on (cumulative) log odds scale.

 IIA : Independence from Irrelevant Alternatives. Assuming the odds of one outcome vs another does not dependent on what alternative outcomes are available.

#### (c) Model Selection procedure to choose covariates

 Fit models with all possible combinations of covariates; dredge

 use fit criteria (e.g. AIC) to rank models.

#### (d) Calculate response. Reptile; small; hancock

$p_{ij} = \frac{e^{\eta_{ij}}}{1+\sum_{k=2}^J e^{\eta_{ik}}}$

$\eta_3 = -3.66588368 + 1.2431622 =-2.42272148$

$\eta_2 = -0.09083394 -1.6583241= -1.74915804$

$\eta_4 = -2.72380722 + 0.6952142 = -2.02859302$

$\eta_4 =-1.57283851+ 0.8262891= -0.74654941$
$p_{ij} = \frac{e^{-2.42272148}}{1+e^{-1.74915804}+e^{-2.42272148}+e^{-2.02859302}+e^{-0.74654941}}=0.04747016$

#### (e) Assumption of a proportional odds model 

 Proportional odds Assumption (especially for ordinal!):

​ The slope of covar relationship is the same for for each outcome level.

 To test the validity:

​ Fit a model that does not make it, use a model selection statistics to test between two.

General Reasons for using GLM instead of LR:

1. Response Var is not guaranteed to change linearly with Explainatory Vars;
2. Response Var is naturally bounded by some range and LR predictions can produce values outside this range;
3. Errors are unlikely to be normal with constant variance.