MT5761 Statistics Modeling - Revision Note

Questions and solutions in previous exam paper. Numbers in square brackets indicate marks.

1. lm, glsCorr, glsCorrVar [14]

(a) Describe predict power of a lm model. Comment model performance over some covariates e.g. time! [2]

[1]

  • Reasonably poor;
  • fairly low \(R^2\) 0.2;
  • poor agreement between observed and fitted

[1] Prior to A are underestimated, between A to B are overestimated, post B are underestimated.

(b) Three assumptions of lm model; Descibe validity of them. [3]

[1] Normality assumption, appears to be violated, Shapiro-Wilk \(H_0\) NORMAL

[1] Constant error variance assumption, no evidence violated, Bresh-Pagan test \(H_0\) CONSTANT

[1] Independence assumption, clearly violated, correlation shown in \(\text{acf}\) plot & Durbin-Watson test \(H_0\) INDEPENDENT, test stats less than 2

(c) Think'bout source of data, other reason of correlation. [1]

Realistic reason.

(d) Describe mean-var relationships underlying BP test, contrast with that assumed by glsVar. [2]

[1] Breusch-Pagan Test use \(r_i^2 = \alpha_0 + \alpha_1x_i + \gamma_i\) to determine the extent of agreement between residual variance and vars \(x\), where \(r_i^2\) are squared residuals, \(\alpha_0\) and \(\alpha_1\) are estimated in the model.

[1] BP test assume the test statistics \(NR^2\sim \chi^2_p\) in this case degree of freedom p = 1.

GLS model assume \(r^2_i \sim N(0,\sigma^2|\hat{y_i}|^{2m})\) or \(N(0,\sigma^2e^{2my_i})\)

(e) What do BP test and glsVar suggest about the mean-var relationship? [3]

[1] BP test checked(not constant): No evidence for a linear relationship with squared residuals.

[1] glsVar: There appears to be a non-zero power-based relationship.

[1] AIC smaller when the power coefficient is fitted, and zero is not a plausible in the gls model.

(f) Conclude which model is most defensiable. [3]

[1] Overall Conclusion: There is no/strong evidence for a change of RespVar over ExpVars.

[1] Best fitting model (based on AIC [1], it's glsCorrVar) exhibits a large/small p-value (0.05) for the relationship. The models which inappropriately ignore something concludes ExpVar is (not) significant.

2. glm for Poisson(OD) [11]

[1] glmPois assumes a nonlinear relationship between RespVar and ExpVars and a linear relationship on log/sqrt scale.

[1] Not good for this data; [1] model allow monotonic relationship but function seems to need inflection points.

(b) Mean-Var relationships underpins glmPois and glmPoisOD. Which is more realistic. [3]

[1] glmPois assume fitted mean and residual var are equal.

[1] glmPoisOD assume residual var is proportional to fitted mean. \(\text{var} = \phi \lambda\)

[1] In this case, latter is more realistic since estimate of dispersion parameter \(\hat\phi=399\) is much larger than 1.

(c) Contrast conclusions of glmPois ,glmPoisOD andglsCorrVar, which is most defensiable. Additional methods to improve. [3]

[1] glmPois and glmPoisOD suggest strong evidence for negative/positive Resp-Exp relationship.

glsCorrVar suggest this not well-evidenced, could be sampling variability alone.

[1] would use glsCorrVar to base my conclusion on. Because although looks nonlinear (even Poisson-based are barely linear on raw scale), the non-constant error variance and correlation are modelled in errors.

[1] smoother based function(splines) to improve the relationship? GEE(Generalized Estimating Equation) much like GLMs but allow non-independence.

(d) How to investigate the effects of XXX or XXX on RespVar? If RespVar available at both. [2]

[2] e.g. Interactions, piecewise linear models, time as a factor variable.

3. glm for Binomial [15]

[1] Linear predictor

[1] Link function

[1] Intercept and Error term

[1] Other β parameters

Assuming \(y_i \sim Binomial(n_i,p)\) , where \(y_i\) is the no. of observation with XXX out of the \(n_i\) observations, and \(p\) is the probability of XXX (being caught with fish in stomach), also the RespVar of the model, varying over ExpVars. \[ p_i = \frac{e^{\eta_i}}{1+e^{\eta_i}} + \epsilon_i\\ \] Linear predictor is obtained by transforming RespVar by the Link function, which is \[ g(p_i)=\log{(\frac{p_i}{1-p_i})}=\eta_i=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5 \] \(\beta_0\)is the intercept parameter (baseline), represents \(\text{female, <2.3m}\) and \(\text{george}\).

\(\beta_1\)is the coefficient for male (compared with female)

\(\beta_2\)is the coefficient for > 2.3m (compared with < 2.3m)

\(\beta_3\)is the coefficient for hancock (compared with george)

\(\beta_4\)is the coefficient for oklawaha (compared with george)

\(\beta_5\)is the coefficient for trsfford (compared with george)

\(\epsilon_i\) is the random component - the binomial error term

(b) Odds of Binomial GLM [2]

\[ \text{Odds} = \frac{p(\text{success})}{p(\text{failue})} = \frac{p_{it}}{1-p_{it}}= e^{\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4+\beta_5x_5} \]

(c) Calculate odds use fit.logit.best [2]

\[ \text{odds(small,trafford)} = e^{\beta_0+\beta_4}=e^{-0.1218-1.3733} = 0.2242262 \]

(d) Multiplicative effect between odds for different levels [1]

[1] For baseline level 0 compared with non-base level n, the odds of presence(success) vs absence(failure) are estimated to change by a factor \(e^{\beta_n}\)

(e) Explain Deviance [4]

[1] Deviance provides a measure of discrepancy between fitted model and Saturated model. The smaller the D, the better the model. \[ D = 2[l(\hat\beta_{sat},\phi) - l(\hat\beta,\phi)] \] [1] If model correct, \(D \sim \chi^2_{n-p-1}\), n observations, p predictors

​ A \(\chi^2\) test (\(H_0\) model is correct) will give a large p-value for a good fitting.

[1] \(\chi^2\) approximation of D often poor for Binomial GLMs

[1] Computing \(D\) involves \(\phi\), so if it's unknown we can't use the result.

(f) adjustment to raw residuals and why [2]

[1] Adjustment. \[ \text{Pearson residuals} = \frac{\text{raw residual}}{\hat{\text{SD}}}=\frac{y-\hat{y}}{\sqrt{\text{Var}(y)}} \]

[1] Why. We'd like to see no patterns in Pearson residuals if Mean-Var relationship is appropriate.

4. glm for Multinomial [10]

(a) Nominal and Ordinal [1]

Nominal: Response value categories has no natural order. e.g.

Ordinal: The matter of response matters. e.g.

(b) Assumptions for multinomial GLM fit.mult [3]

[1] Independent observations from Multinomial distribution.

[1] Linear relationship with covars, on (cumulative) log odds scale.

[1] IIA : Independence from Irrelevant Alternatives. Assuming the odds of one outcome vs another does not dependent on what alternative outcomes are available.

(c) Model Selection procedure to choose covariates[2]

[1] Fit models with all possible combinations of covariates; dredge

[1] use fit criteria (e.g. AIC) to rank models.

(d) Calculate response. Reptile; small; hancock[2]

\[ p_{ij} = \frac{e^{\eta_{ij}}}{1+\sum_{k=2}^J e^{\eta_{ik}}} \]

\(\eta_3 = -3.66588368 + 1.2431622 =-2.42272148\)

\(\eta_2 = -0.09083394 -1.6583241= -1.74915804\)

\(\eta_4 = -2.72380722 + 0.6952142 = -2.02859302\)

\(\eta_4 =-1.57283851+ 0.8262891= -0.74654941\)
\[ p_{ij} = \frac{e^{-2.42272148}}{1+e^{-1.74915804}+e^{-2.42272148}+e^{-2.02859302}+e^{-0.74654941}}=0.04747016 \]

(e) Assumption of a proportional odds model [2]

[1] Proportional odds Assumption (especially for ordinal!):

​ The slope of covar relationship is the same for for each outcome level.

[1] To test the validity:

​ Fit a model that does not make it, use a model selection statistics to test between two.

General Reasons for using GLM instead of LR:

  1. Response Var is not guaranteed to change linearly with Explainatory Vars;
  2. Response Var is naturally bounded by some range and LR predictions can produce values outside this range;
  3. Errors are unlikely to be normal with constant variance.
It's always the little things MT4113 Statistic Computing - Revision Syllabus

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×