# MT4113 Statistic Computing - Revision Syllabus

Revision syllabus for MT4113 by lectures/chapters.

## Fundamental programming principles

### Computer (lecture 1)

• [x] Definition of a computer

• [x] Hardware, software

• [x] Programming languages: classification, generations,

• [x] compiled vs interpreted, which should be used when

##### Some Definition
• Computer - A device that accept info and manipulates it according to a sequence of instructions to produce output

• Hardware - a physical computer equipment, including CPU, memory (Register, RAM), mass storage(ROM), I/O (keyboard,screen,printer)

• Software - computer program, a set of instructions operating the hardware. System sw(Linux, Win), App sw(SAS, Word), Programming sw(R)

##### Programming Languages

Classification

• by level
• Low - close to hardware, machine lang.
• High - far from hardware, close to natural lang. run on diff OS
• Generality - high lvl. more specialized in application

• Generation - 1st ~ 5th

• Speed/Efficiency: Low level generally faster(not always)

Generations

• 1st: Machin lang. - Relatively few diﬀerent instructions

• 2nd: Assembly lang. - Human-readable machine lang.

• 3rd: CPU-independent lang. - Fortran, C, Pascal, OOLs(C++, Java). most we use written in 3GL

• 4th: designed with a speciﬁc application;
• database query lang(SQL)
• graphical user interface(GUI)
• maths lang(Mathematica, Maple)
• stats lang(R, SAS)
• 5th: to solving problems given problem speciﬁcation; no need explicitly writting algorithm
• Prolog. Used is in artiﬁcial intelligence studies.
##### Compiled vs Interpreted

Two approaches to turn langs into machine langs(must be done b4 canbe executed):

• Interpret: do it line by line when entering using an 'interpreter', run it straight way. Python, R

• Compile: written and save in file, turn all into machine code in one go using a 'compiler'.

##### Which and When
• Simple do once stuff - GUI, no reproducible trail

• more complex few times - 4GL within interpreter

• prodiction where efficiency important - 3GL and compile, or prototype in a 4GL plus slow bits in 3GL

### Algorithms (lecture 2)

• [x] Definition; important features; components of an algorithm

• [x] Examples, criticism of examples

• [x] Code and pseudocode

• [x] Control structures and vectorization in R

##### What's An Algorithm

Algorithm: an ordered sequence of unambiguous and well-deﬁned instructions, performing some task and halting in ﬁnite time!

Important features

• an ordered sequence

• unambiguous, well-defined(clear, do-able, canbe done without difficulty)

• perform tasks, nothing left out

• halt in finite time, so algorithm needs to be terminate

Example

• boil a egg!

Elements

• assigning values

• computation

• return values

Control Structures

• Conditional exec - if, switch case

• Iteration (looping) - repeat a set of instr
• fixed number
• variable number (warning! potential infinite loop)

Recursion - algorithms that call themselves

##### Code and Pseudocode

Code: instructions in computer lang that implements an algorithm.

Pseudocode: instructions that are generic, informal, lang-unspecified, specify an algorithm.

Always write out pseudocode b4 coding!

##### Vectorization

Much faster than loops

Some variants of apply - sapply, lapply, tapply, mapply, %by%

### Modularization and good programming practice (lect 2, 3, prac 2, 3)

• [x] Modular programming, definition, advantages, interface, passing by reference and value, best practices, encapsulation

• [x] Environments, scoping, dynamic lookup

• [x] Strategies for code design

• [x] Coding conventions and style

• [x] Testing and debugging. Types of code errors; how to build tests; debugging strategies.

• [x] Online collaborative programming tools: git repositories, clone, pull, push, commit.

##### Modularization

Modular programming！split program into discrete, readable blocks of code so that each block does a small amount.

• avoid repeating

• easier to test

• esaier to update and propagate changes

• divide and conquer problems

##### Module interfaces
• passing by value
• make copies, stored in separate location from original var
• changes inside function have no aﬀect on var’s value outside
• inefficient but safer
• passing by reference
• directly using original var (location)
• changes to the variable within function aﬀect its’ value after the function run
• efficient but dangerous

Most 3GLs allow both interfaces; one can specify interfaces for certain parameters in C; Base R does not allow passing by reference.

##### Best practice/concept of Encapsulation
• pass everything required as an argument

• pass everything out using return()

• No global side effects within functions

##### Environment (Workspace?!)
• Base environment (contains basic packages)

• Global environment (pkg loaded, var created), "reset" every time reopen R

• Current environment (environment inside of a function)

##### Scoping
• Name masking: looks up one level for undefined names(can rewrite predefined var, e.g. pi)

• R default to the version in most recently loaded pkg/func for same name func

• use of function independent of any previous (env wiped clean for new use)

What's Dynamic Lookup

It only matters when code is executed, rather than created.

##### Design Strategies
• Top-down (rigid) and bottom-up (iterative) approaches

• flowcharts for planning and documentation

• outlines or pseudocode (breaking big task into manageable bits)

##### Coding Conventions

• <– for assignment, = for argument, == for boolean

• function comments listing purpose, I/O

• indentation and spacing around operators and after commas

• lines <80 characters long

• meaningful names in consistent style

##### Testing and Debugging

Type of errors:

• syntax errors

• logic errors

• numerical errors (overflow/underflow/floating point problems)

How to build test:

• what code should produce

• how does code behave with special case (0, nagetive, NA, gigantic)

• build test prior to building code (code is created to pass tests)

Debugging strategy:

• Trial and error

• global arguments and step through func

• print() object as function progresses

### Computer arithmetic (lecture 4)

• [x] How numbers are stored on computers. Fixed point – addition, multiplication, overflow. Floating point – limitations, consequences of lack of accuracy.

• [x] Data size on computer.

• [x] Fixed and floating point numbers in R.

• [x] Living with floating point limitations – strategies to diagnose and avoid problems.

• [x] Character storage.

##### Fixed Point (32bits integer in R)

• multiplication 就是 shift and add

• overflow, 加起来超位数了！

binary 乘以 2的n次幂 = 小数点右移补零

binary 除以 2的n次幂 = 小数点左移舍位

$n\times 12 = n \times (2^3) + n \times (2^2)$

n添3个零 + n添2个零

• Pro - results are exact, fast

• Con - limited applicability and range

In R, 32 bit integer 最大是 $2^{31}-1$

##### Float Point (numeric in R)

Represented as $S.F \times B^E$

$S$ - Sign

$F$ - Fraction, unsigned integer

$B$ - Base, 2 for computer

$E$ - Exponent, signed integer, shift left/right

e.g.

$-.110\times 2^{1011} = -.000110 = -0.9375$

Limitations

• bits for $F$ (fraction) limits the accuracy

• bits for $E$ (exponent) limits the size

• $\epsilon = 2^{1-f}$, the smallest positive number (interval) can be distinguished

consequences of lack of accuracy

• rarely add up exactly, due to rounding err

• order of summation matters (can't overflow at the beginning)

• subtraction of two nearly equal numbers eliminates the most significant digits

Pro - wide applicability, wider range

Con - result not exact, slow

##### In R
• defualt numeric

• 64 bits in total, 1 for sign, 11 for exponent, 53 for fraction (1 extra bit by a trick)

• largest $+.111..._{53} \times 2^{+1023} = 8.98 \times 10^{+307}$

• accuracy $\epsilon = 2^{1-53} = 2.20\times 10^{-16}$

• Underows are set to 0, no warning

##### Live with limit of Float Point
• re-work expression avoid overflow

• Normalize, scale by dividing

• change exact test to 'good enough' test, from x == y to x - y < e

• careful about order of operation

• double-precision(desperate)

Many show a trade-off between accuracy and speed

##### Character Storage
• as binary, determined by character set

• ASCII, 7-bit, 128 unique chars

• Unicode, 16-bit, 65536 chars

##### Data Size
• bit: binary digit, smallest unit of storage on a machine, hold 0/1

• byte: 8 bits - hold a single character

• word: natural data size of a computer, most 64 bits

• kilobyte (KB) $2^{10}$ = 1024 bytes

• megabyte (MB) $2^{20}$ = 1024 KB

• gitabyte (GB) $2^{30}$ = 1024 MB

## Computer-intensive statistics (lect 6, 7, 8)

An understanding of how these methods work, ability to reproduce the algorithms and apply to case studies, pros and cons – which to use when.

• [x] Overview of parametric and nonparametric methods for constructing tests and confidence intervals. Contexts include one sample, two sample, ANOVA, multiple regression.

• [x] Permutation methods – test and interval

• [x] Randomization methods – test and interval

• [x] Monte Carlo tests; Monte Carlo methods

• [x] Bootstrap – nonparametric (including balanced bootstrap) and parametric. Percentile confidence intervals.

#### Traditional Par and Non-par Method

1. PDF for population is known, $f(\theta) \sim N(\mu, \sigma^2)$;

2. distribution of test statistic $T(x)$;

Problems: result variant to transformation; bias correction.

Traditional Non-parametric method, avoid fully specifying $f()$; e.g. Wilcoxon Signed Rank test:

1. $f()$ is symmetric about some median m;

2. $T(X)$ is specified

Cons: Wilcoxon e.g. require to formulate $H_0$ in terms of median; Less powerful than a t-test when parametric assumptions met

#### Computer-intensive Statistical Method

Avoid fully specifying either $f()$ and/or $T()$, e.g.

• Permutation-based approach

• Randomization-based approach

• Monte-Carlo-based version of t test (avoid $T()$, but not $f(\theta)$)

• Par and Non-Par Bootstrap CI (avoid $T()$, $f(\theta)$ depends)

##### Permutation Methods

Assume $x_i$ is symmetric about $\mu_0$. Given 12 obs., there are $2^{12} = 4096$ ways (permutaions). $p$-value is the proportion of that $|\bar{X}|$ greater then observed $|\bar{x}|$, how extreme the observed mean is. e.g. 48 out of 4096 are as extreme - p = 48/4096 = 0.012

Permutation tests

• pick $H-0$ and $T(X)$

• under $H-0$ all permutation are equally likely

• calculate the $p$-value (proportion as extreme)

free to choose whatever test statistic e.g. median, trimmed mean, t-stat, etc.

readily extends to other simuation e.g. two-sample, ANOVA, multi regression

Permutation CI

Find the set of $\mu_0$ such that $H_0: \mu = \mu_0$ is not rejected

Pros - widely applicable; produce exact result; no specific distribution assumption for population; no analytic distribution of test statistic.

Cons - each permutation must be equally likely (or prob is known); hard to program; can be prohibitively intensive for large sample size.

##### Randomization Methods

Ramdomization is like permutation, except that only random subset of all permutations. e.g. randomly assign $|x_i-\mu_0|$ to each

Compared with Permutation Methods

Pros - all pros of it plus; easier to code; number of randomization is fixed rather than $2^{n}$

Cons - each permutation must be equally likely (or prob is known); results vary.

##### Monte-Carlo Tests

Monte-Carlo tests is a generalization of randomization test to include parametric distribution. Algorithm as

1. Set population distribution $f()$ or $f(\theta)$ and $H_0$

2. Simulate a dataset under $H_0$ and calculate $T(X)$ (repeat 999 times)

3. Add $T(x)$ of sample data

4. quantile method - calc $p$-value (proportion of $T(X)$ as extreme or more than the one of sample data $T(x)$)

Pros - Can be more flexible than randomization, par or non-par; no analytic distribution of test statistic; can cope with non-iid data.

Cons(compare to permutation) - need to extimate parameters for par; results vary.

##### Bootstrap

Bootstrapping simulations can be

• Parametric, simulate from a distribution that properties based on sample data (no iid assumption)

• Non-parametric, simulate by resampling with replacement from sample data (from Empirical DF)

Percentile Method, 95% CI given by 2.5% and 97.5% percentile of the $\mu$ set.

Pros

• Simple, general and robust,

Cons

• poor performance when underlying distribution is very skewed or estimator is very biased

• generally, only asymptotically exact as $B$ and $n \to \infty$

$f()$ assumption and iid assumption are respectively pro and con for Par and Non-par Bootstrapping.

#### General Monte Carlo Method

Monte Carlo Methods are a broad class of computational algorithms that rely on repeated random sampling to generate numerical results. Used for

• Inference, (computer-intensive stats, e.g. MC test, Bootstrap, Bayesian stats)

• Optimization, (later)

• Stochastic Numerical Integration (compared to Deterministic one!)

## Optimization methods (lect 9, 10, 11)

An understanding of how the methods work, ability to reproduce the algorithms and suggest which might work best in case studies, pros and cons, which to use when. An understanding of what the R functions nls, optimize and optim do.

• [x] Components of optimization

• [x] Outline of approaches: graphical, analytic, brute force, iterative: “basic”, calculus-based, stochastic

• [x] Systematic: Line and grid searches

• [x] For iterative approaches: convergence, convergence order, stopping rules, robustification

• [x] Bisection; golden section

• [x] Newton’s method – univariate, multivariate

• [x] Gauss-Newton

• [x] The EM algorithm.

• [x] Stochastic optimization: optimizing stochastic functions; stochastic search methods. Blind random search, localized random search, simulated annealing.

• [x] Not examinable: tabu methods and genetic algorithms.

#### Why optimization?

• non-linear Least-squares problems

• Maximum Likelihood

• find CI via test inversion

• Minimizing risk in Bayesian decision analysis

#### Components of Optimization

• Objective function: what we want to maximize/minimize, e.g. logLik for ML, $(y-\hat{y})^2$ for GLS

• Criteria: Constraints on parameters

• Stopping Rule(s): desired accuracy/maximum iterations

#### Approaches 优化途径

1. Graphical - esay; but can't quantify error, hard for multivariate problems

2. Analytic - aka differentiate func then set differential to zero; solution exact; but there's often no analytic solution!

3. Brute force - line/grid search (for univariate/multivariate problems); reliable, with known error; inefficient: require numerous calculation dependes on accuracy ($\epsilon$), hard for multivariate space;

'Brute force' could be used to find starting value for other methods to refine

#### Iterative methods

• "basic": plotting, line or grid searches

• Calculus-based (lect 11), Newton, Gauss-Newton

• Stochastic (lect 12)

#### General Iterative Algorithm

1. Start iteration $t=0$, initial guess $x^0$

2. Next iteration $t=t+1$, improved guess $x^t = f(x^{t-1})$, using an updating equation

3. Evaluate whether new guess is sufficiently accurate, using a stopping rule
• If yes, stop and return $x^* = x^t$ and optimized objective function $g(x^*)$
• If no, go back to 2 or consider whether to give up (report no convergence)

Take finding root of $g'(x) = 0$ as an example.

Start with initial interval $[a_0, b_0]$, and $g'(a_0)g'(b_0)<0$, and systematically shrink the interval half by once.

Golden Section Search

More efficient (converge more quickly), as an extension of bisection method that takes andvantage of the Golden Ratio.

##### Convergence

Each iteration produces an answer closer to the optimum, we gradually converge on the solution we want. Convergence order is an index of how fast it converges.

Convergence criterion serves as stopping rules, two main types, absolute and relative

Absolute: stop when $|{x^t - x^{t-1}|} < \epsilon$, max tolerance.

Problems: regardless of the size of the numbers, e.g. err of 0.1 is usually more important when $x^\ast$ is 0.5 than it is 50,000

Relative: $\frac{|x^t - x^{t-1}|}{|x^{t-1}|} < \epsilon$, max proportional tolerance.

##### Robustification
• Use more numerically stable operation

e.g. use $a+\frac{b-a}{2}$ instead of $\frac{a+b}{2}$

• "good enough" convergence (helps when $x^t$ close to zero)

$\frac{|x^t-x^{t-1}|}{|x^{t-1}|+\epsilon} < \epsilon$

• stop anyway after $N$ iteration, whether convergence achieved or not
• stop if no converging over several iterations, restart somewhere else

Global optimum may not be bracketed at the beginning! or even if bracket it, there may be multiple optima, you may not turn out getting the best one.

#### Newton's Method (univariate)

Also called "Newton-Raphson Method"

Iterative, numerical technique for optimization, based on linear approximation with Taylor Series. Think of it as using Successive tangent lines!

##### Rationale

Approximate $g'(x^\ast)$ by linear Taylor series expansion at $x^t$ $g'(x^\ast) \approx g'(x^t) + (x^\ast-x^t)g''(x^t) = 0$

Re-arranging gives $x^\ast \approx x^t - \frac{g'(x^t)}{g''(x^t)}$

Here requires that $g'(x)$ is continuously differentiable, and $g''(\ast)\ne 0$

So, the updaing equation is $x^{t-1} = x^t - \frac{g'(x^t)}{g''(x^t)}$

##### Algorithm
1. define objective function $g(x)$ and convergence criterion tolerance $\epsilon$

2. start iteration $t=0$, set $x^0$, calculate $g(x^0)$.

3. next iteration $t=t+1$, update $x^{t+1} = x^t - g'(x^t)/g''(x^t)$, calculate $g(x^{t+1})$.

4. evaluate whether convergence criteria are met.

5. if met, return $x^\ast = x^t$ and exit; otherwise, return to step 3.

##### Summary of Newton's
• Can converge etremely rapidly - convergence order $\beta=2$

• but, may not converge at all, depending on shape of $g'(x)$

• Can easily get stuck in local optima

#### Newton's Method (multivariate)

• Objective function is $g(\theta)$ where $\theta$ is a vector of $n$ parameters

• $\nabla$ and $\nabla^2$ is the first and second derivative (gradient and Hessian)

• $\Delta$ is the vector of each direction (for one step), $\theta_{1} = \theta_0 + \Delta$

$\nabla^2 g(\theta) \times \Delta = - \nabla g(\theta)$

where $\nabla^2g(\theta)$ is an $n \times n$ matrix; $\Delta$ is a $1 \times n$ vactor;

$\nabla g(\theta)$ is an $n$ by $n\times 1$ vector.

##### Algorithm Pseudocode
1. set objective function $g(\theta)$ and convergence tolerance $\epsilon$

2. chose starting $\theta_0$

3. calculate
• $g(\theta_0)$
• $g'(\theta) = \nabla$
• $g''(\theta) = \nabla^2$
4. $\nabla^2 g(\theta)\Delta = - \nabla g(\theta)$ , solve for $\Delta$

5. $\theta_{1} = \theta_0 + \Delta$

#### Gauss-Newton Method (multivariate)

Iterative procedure for extimating parameters of nonlinear least-square problems(GLS).

• Objective function is always the sum of square of residuals.

• Only first order derivatives required (advantage over Newton's)

##### Rationale
• Linearise the regression with a 1st order Taylor series evaluated at $\theta^0$

• re-arrange in the form of linear regression

\begin{aligned} f(x,\theta) - f(x,\theta^0) &= \sum_j (\theta_j - \theta^0_j)\frac{df} {d\theta_j}\bigg|_{\theta^0} \\ &\equiv \sum_j \gamma_j\omega_j \end{aligned}

where

• $\gamma_j = \theta_j-\theta_j^0$ corresponds to the unknown parameter

• $\omega_j = \frac{df}{d\theta_j}\Big|_{\theta^0}$ corresponds to the known covariate

##### Algorithm
1. start with $i=0$, set initial parameter $\theta^0$

2. estimate $\gamma$: $\gamma^1 = (W^{0T}W^0)^{-1} \bullet W^{0T}z^0$

3. update $\theta^1 = \theta^0 + \gamma^1$

4. re-evaluate $W$ and $z$ at $\theta^1$

5. if met, return $\theta$; if not go to step 2.

### Summary of Newton and Gauss-Newton

Newton's Method

• Any objective functions

• require 1st and 2nd derivatives

• Uni and Multivariate case

Gauss-Newton Method

• Objective function always sum of square of resid

• require only 1st derivatives

• for Multivariate non-linear least-square problem

### Stochastic Methods

Previous methods are all deterministic methods. OBJ function can be evaluated exactly; algorithm is deterministic - i.e. given same data and start point, it takes same path to same solution.

#### Why Stochastic

• Necessary to deal with stochastic func

• Can be used for discrete problem

• do better in finding global rather than local!

some Cons; faster than brute force but slower than deterministic. Want a good solution or the best?

Stochastic optimization methods use random variables. Two variaties below can overlap:

• Optimizing stochastic functions (OBJ function contains random noise); step size decrease, accuracy increase as iteration number increase

• Optimization algorithm with random search (search methods contains random element); converge faster, improve robustness when global optimum hard to find.

#### Stochastic search Algorithm

1. Blind random search - generate new candidate from some chosen PDF

2. Localized random search - generate new candidate from some chosen PDF (depends on current candidate); step neither fixed(brute force) nor full random (blind)

3. Simulated annealing - many variants, work well but sloooow

##### Consequences of Stochasticity
• Evaluation of OBJ func (and its gradient) no longer accurate

• grid search methods may pick wrong point; iterative methods may fail to converge

• Accuracy can be improved by repeat eval, err decrease with $1/\sqrt{N}$

## Reproducibility and graphics (lect 12, 13, 14)

#### Reproducible research

• [x] What is it and why is it a good idea

• [x] Ways to achieve reproducibility (or fail to achieve reproducibility)

• [x] Key components of reproducible research

• [x] Not examinable: specifics of R Markdown.

Replicable, independent researchers can follow the same methods and arrive at the same conclusions. OLD.

Reproducible, at least the result should be reproducible: using same data and code yield the same result

##### Why Reproducible
• can reproduce the project even forget most details after long time

• easier to run with addtional data

• planning for longevity makes research more impactful

Using different tools for each phase of project lead to bad reproducibility, because

• err can occur in any/all of these phases

• new data arise, whole process start over

##### Components of Reproducibility
• Publication with linked and executable code and data

• Never touch raw data

• avoid mannual data manipulation

• version vontrol

• include random seeds info

• use relative paths

Also, file project components well!

#### Graphical presentation of data

• [x] What makes a good graphic; Tufte’s principles; graphical integrity

• [x] Ability to recognize good and bad graphics and explain why they are good or bad

• [x] Not examinable: specifics of data visualization methods in R.

##### Tufte's Principles
• numbers match true proportions

• dimensions match the data

• labels clear and detailed

• use well-known units to represent money

• not vary for some ulterior motive and not imply unintended context

Lie factor - is the ratio of effect shown in graphic to effect shown in data. $\text{Lie Factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect shown in data}}$ Chart Junk - unnecessary complexicy in graphic

#### Code performance

• [x] What a profiler is and how and when to use one

• [x] Ability to recognize what might benefit from optimization given profiler output

Profiler - software that records the instructions computer is executing

When to use - after code is working as intended.

#### Tidyverse

• [x] Definition of tidy data

• [x] Ability to recognize tidy vs. untidy data

• [x] Not examinable: specific R tidyverse packages and functions.

Tidy data is easier to manipulate, model and visualize! define as

1. Each variable is saved in its onw column

2. Each observation is saved in its own row