Revision syllabus for MT4113 by lectures/chapters.

## Fundamental programming principles

### Computer (lecture 1)

[x] Definition of a computer

[x] Hardware, software

[x] Programming languages: classification, generations,

[x] compiled vs interpreted, which should be used when

##### Some Definition

*Computer*- A device that accept info and manipulates it according to a sequence of instructions to produce output*Hardware*- a physical computer equipment, including CPU, memory (Register, RAM), mass storage(ROM), I/O (keyboard,screen,printer)*Software*- computer program, a set of instructions operating the hardware. System sw(Linux, Win), App sw(SAS, Word), Programming sw(R)

##### Programming Languages

Classification

- by level
- Low - close to hardware, machine lang.
- High - far from hardware, close to natural lang. run on diff OS

Generality - high lvl. more specialized in application

Generation - 1st ~ 5th

Speed/Efficiency: Low level generally faster(not always)

Generations

1st: Machin lang. - Relatively few diﬀerent instructions

2nd: Assembly lang. - Human-readable machine lang.

3rd: CPU-independent lang. - Fortran, C, Pascal, OOLs(C++, Java). most we use written in 3GL

- 4th: designed with a speciﬁc application;
- database query lang(SQL)
- graphical user interface(GUI)
- maths lang(Mathematica, Maple)
- stats lang(R, SAS)

- 5th: to solving problems given problem speciﬁcation; no need explicitly writting algorithm
- Prolog. Used is in artiﬁcial intelligence studies.

##### Compiled vs Interpreted

Two approaches to turn langs into machine langs(must be done b4 canbe executed):

Interpret: do it line by line when entering using an 'interpreter', run it straight way. Python, R

Compile: written and save in file, turn

*all*into machine code*in one go*using a 'compiler'.

##### Which and When

Simple do once stuff - GUI, no reproducible trail

more complex few times - 4GL within interpreter

prodiction where efficiency important - 3GL and compile, or prototype in a 4GL plus slow bits in 3GL

### Algorithms (lecture 2)

[x] Definition; important features; components of an algorithm

[x] Examples, criticism of examples

[x] Code and pseudocode

[x] Control structures and vectorization in R

##### What's An Algorithm

Algorithm: an *ordered* sequence of *unambiguous* and well-deﬁned instructions, *performing some task* and *halting* in ﬁnite time!

Important features

an ordered sequence

unambiguous, well-defined(clear, do-able, canbe done without difficulty)

perform tasks, nothing left out

halt in finite time, so algorithm needs to be terminate

Example

- boil a egg!

Elements

assigning values

computation

return values

Control Structures

Conditional exec - if, switch case

- Iteration (looping) - repeat a set of instr
- fixed number
- variable number (warning! potential infinite loop)

Recursion - algorithms that call themselves

##### Code and Pseudocode

Code: instructions in computer lang that implements an algorithm.

Pseudocode: instructions that are generic, informal, lang-unspecified, specify an algorithm.

Always write out pseudocode b4 coding!

##### Vectorization

Much faster than loops

Some variants of `apply - sapply, lapply, tapply, mapply, %by%`

### Modularization and good programming practice (lect 2, 3, prac 2, 3)

[x] Modular programming, definition, advantages, interface, passing by reference and value, best practices, encapsulation

[x] Environments, scoping, dynamic lookup

[x] Strategies for code design

[x] Coding conventions and style

[x] Testing and debugging. Types of code errors; how to build tests; debugging strategies.

[x] Online collaborative programming tools: git repositories, clone, pull, push, commit.

##### Modularization

Modular programming！split program into discrete, readable blocks of code so that each block does a small amount.

Function Advantages

avoid repeating

easier to test

esaier to update and propagate changes

divide and conquer problems

##### Module interfaces

- passing by value
- make copies, stored in separate location from original var
- changes inside function have no aﬀect on var’s value outside
- inefficient but safer

- passing by reference
- directly using original var (location)
- changes to the variable within function aﬀect its’ value after the function run
- efficient but dangerous

Most 3GLs allow both interfaces; one can specify interfaces for certain parameters in C; Base R does not allow passing by reference.

##### Best practice/concept of Encapsulation

pass everything required as an argument

pass everything out using

`return()`

No global side effects within functions

##### Environment (Workspace?!)

Base environment (contains basic packages)

Global environment (pkg loaded, var created), "reset" every time reopen R

Current environment (environment inside of a function)

##### Scoping

Name masking: looks up one level for undefined names(can rewrite predefined var, e.g.

`pi`

)R default to the version in most recently loaded pkg/func for same name func

use of function independent of any previous (env wiped clean for new use)

**What's Dynamic Lookup**

It only matters when code is executed, rather than created.

##### Design Strategies

Top-down (rigid) and bottom-up (iterative) approaches

flowcharts for planning and documentation

outlines or pseudocode (breaking big task into manageable bits)

##### Coding Conventions

Basically readablity and consistency

`<–`

for assignment,`=`

for argument,`==`

for booleanfunction comments listing purpose, I/O

indentation and spacing around operators and after commas

lines <80 characters long

meaningful names in consistent style

##### Testing and Debugging

Type of errors:

syntax errors

logic errors

numerical errors (overflow/underflow/floating point problems)

How to build test:

what code should produce

how does code behave with special case (0, nagetive, NA, gigantic)

build test prior to building code (code is created to pass tests)

Debugging strategy:

Trial and error

global arguments and step through func

`print()`

object as function progresses

### Computer arithmetic (lecture 4)

[x] How numbers are stored on computers. Fixed point – addition, multiplication, overflow. Floating point – limitations, consequences of lack of accuracy.

[x] Data size on computer.

[x] Fixed and floating point numbers in R.

[x] Living with floating point limitations – strategies to diagnose and avoid problems.

[x] Character storage.

##### Fixed Point (32bits integer in R)

addition 就直接加

multiplication 就是 shift and add

overflow, 加起来超位数了！

binary 乘以 2的n次幂 = 小数点右移补零

binary 除以 2的n次幂 = 小数点左移舍位

\(n\times 12 = n \times (2^3) + n \times (2^2)\)

n添3个零 + n添2个零

Pro - results are exact, fast

Con - limited applicability and range

In R, 32 bit integer 最大是 \(2^{31}-1\)

##### Float Point (numeric in R)

Represented as \(S.F \times B^E\)

\(S\) - Sign

\(F\) - Fraction, unsigned integer

\(B\) - Base, 2 for computer

\(E\) - Exponent, signed integer, shift left/right

e.g.

\(-.110\times 2^{1011} = -.000110 = -0.9375\)

**Limitations**

bits for \(F\) (fraction) limits the accuracy

bits for \(E\) (exponent) limits the size

\(\epsilon = 2^{1-f}\), the smallest positive number (interval) can be distinguished

**consequences of lack of accuracy**

rarely add up exactly, due to rounding err

order of summation matters (can't overflow at the beginning)

subtraction of two nearly equal numbers eliminates the most significant digits

Pro - wide applicability, wider range

Con - result not exact, slow

##### In R

defualt numeric

64 bits in total, 1 for sign, 11 for exponent, 53 for fraction (1 extra bit by a trick)

largest \(+.111..._{53} \times 2^{+1023} = 8.98 \times 10^{+307}\)

accuracy \(\epsilon = 2^{1-53} = 2.20\times 10^{-16}\)

Underows are set to 0, no warning

##### Live with limit of Float Point

re-work expression avoid overflow

Normalize, scale by dividing

change exact test to 'good enough' test, from

`x == y`

to`x - y < e`

careful about order of operation

double-precision(desperate)

Many show a trade-off between accuracy and speed

##### Character Storage

as binary, determined by character set

ASCII, 7-bit, 128 unique chars

Unicode, 16-bit, 65536 chars

##### Data Size

bit: binary digit, smallest unit of storage on a machine, hold 0/1

byte: 8 bits - hold a single character

word: natural data size of a computer, most 64 bits

kilobyte (KB) \(2^{10}\) = 1024 bytes

megabyte (MB) \(2^{20}\) = 1024 KB

gitabyte (GB) \(2^{30}\) = 1024 MB

## Computer-intensive statistics (lect 6, 7, 8)

An understanding of how these methods work, ability to reproduce the algorithms and apply to case studies, pros and cons – which to use when.

[x] Overview of parametric and nonparametric methods for constructing tests and confidence intervals. Contexts include one sample, two sample, ANOVA, multiple regression.

[x] Permutation methods – test and interval

[x] Randomization methods – test and interval

[x] Monte Carlo tests; Monte Carlo methods

[x] Bootstrap – nonparametric (including balanced bootstrap) and parametric. Percentile confidence intervals.

#### Traditional Par and Non-par Method

Traditional **Parametric** method, specify

PDF for population is known, \(f(\theta) \sim N(\mu, \sigma^2)\);

distribution of test statistic \(T(x)\);

Problems: result variant to transformation; bias correction.

Traditional **Non-parametric** method, avoid fully specifying \(f()\); e.g. Wilcoxon Signed Rank test:

\(f()\) is symmetric about some median m;

\(T(X)\) is specified

Cons: Wilcoxon e.g. require to formulate \(H_0\) in terms of median; Less powerful than a t-test when parametric assumptions met

#### Computer-intensive Statistical Method

Avoid fully specifying either \(f()\) and/or \(T()\), e.g.

**Permutation**-based approach**Randomization**-based approach**Monte-Carlo**-based version of t test (avoid \(T()\), but not \(f(\theta)\))Par and Non-Par

**Bootstrap**CI (avoid \(T()\), \(f(\theta)\) depends)

##### Permutation Methods

Assume \(x_i\) is symmetric about \(\mu_0\). Given 12 obs., there are \(2^{12} = 4096\) ways (permutaions). \(p\)-value is the proportion of that \(|\bar{X}|\) greater then observed \(|\bar{x}|\), how extreme the observed mean is. e.g. 48 out of 4096 are as extreme - p = 48/4096 = 0.012

**Permutation tests**

pick \(H-0\) and \(T(X)\)

under \(H-0\) all permutation are equally likely

calculate the \(p\)-value (proportion as extreme)

free to choose whatever test statistic e.g. median, trimmed mean, t-stat, etc.

readily extends to other simuation e.g. two-sample, ANOVA, multi regression

**Permutation CI**

Find the set of \(\mu_0\) such that \(H_0: \mu = \mu_0\) is not rejected

Pros - widely applicable; *produce exact result*; no specific distribution assumption for population; no analytic distribution of test statistic.

Cons - each permutation must be equally likely (or prob is known); *hard to program*; can be *prohibitively intensive for large sample size*.

##### Randomization Methods

Ramdomization is like permutation, except that only random subset of all permutations. e.g. randomly assign \(|x_i-\mu_0|\) to each

Compared with Permutation Methods

Pros - all pros of it plus; easier to code; number of randomization is fixed rather than \(2^{n}\)

Cons - each permutation must be equally likely (or prob is known); *results vary*.

##### Monte-Carlo Tests

Monte-Carlo tests is *a generalization of randomization test* to include parametric distribution. Algorithm as

Set population distribution \(f()\) or \(f(\theta)\) and \(H_0\)

Simulate a dataset under \(H_0\) and calculate \(T(X)\) (repeat 999 times)

Add \(T(x)\) of sample data

quantile method - calc \(p\)-value (proportion of \(T(X)\) as extreme or more than the one of sample data \(T(x)\))

Pros - Can be more *flexible than randomization*, par or non-par; no analytic distribution of test statistic; can *cope with non-iid data*.

Cons(compare to permutation) - need to extimate parameters for par; results vary.

##### Bootstrap

Bootstrapping simulations can be

Parametric, simulate from a distribution that properties based on sample data (no iid assumption)

Non-parametric, simulate by resampling with replacement from sample data (from Empirical DF)

**Percentile Method**, 95% CI given by 2.5% and 97.5% percentile of the \(\mu\) set.

**Pros**

- Simple, general and robust,

**Cons**

poor performance when underlying distribution is very skewed or estimator is very biased

generally, only asymptotically exact as \(B\) and \(n \to \infty\)

\(f()\) assumption and iid assumption are respectively pro and con for Par and Non-par Bootstrapping.

#### General Monte Carlo Method

Monte Carlo Methods are a broad class of computational algorithms that rely on repeated random sampling to generate numerical results. Used for

Inference, (computer-intensive stats, e.g. MC test, Bootstrap, Bayesian stats)

Optimization, (later)

Stochastic Numerical Integration (compared to Deterministic one!)

## Optimization methods (lect 9, 10, 11)

An understanding of how the methods work, ability to reproduce the algorithms and suggest which might work best in case studies, pros and cons, which to use when. An understanding of what the R functions nls, optimize and optim do.

[x] Components of optimization

[x] Outline of approaches: graphical, analytic, brute force, iterative: “basic”, calculus-based, stochastic

[x] Systematic: Line and grid searches

[x] For iterative approaches: convergence, convergence order, stopping rules, robustification

[x] Bisection; golden section

[x] Newton’s method – univariate, multivariate

[x] Gauss-Newton

[x] The EM algorithm.

[x] Stochastic optimization: optimizing stochastic functions; stochastic search methods. Blind random search, localized random search, simulated annealing.

[x] Not examinable: tabu methods and genetic algorithms.

#### Why optimization?

non-linear Least-squares problems

Maximum Likelihood

find CI via test inversion

Minimizing risk in Bayesian decision analysis

#### Components of Optimization

**Objective function**: what we want to maximize/minimize, e.g. logLik for ML, \((y-\hat{y})^2\) for GLS**Criteria**: Constraints on parameters**Stopping Rule(s)**: desired accuracy/maximum iterations

#### Approaches 优化途径

Graphical - esay; but can't quantify error, hard for multivariate problems

Analytic - aka differentiate func then set differential to zero; solution exact; but there's often no analytic solution!

Brute force - line/grid search (for univariate/multivariate problems); reliable, with known error; inefficient: require numerous calculation dependes on accuracy (\(\epsilon\)), hard for multivariate space;

'Brute force' could be used to find starting value for other methods to refine

#### Iterative methods

"basic": plotting, line or grid searches

Calculus-based (lect 11), Newton, Gauss-Newton

Stochastic (lect 12)

#### General Iterative Algorithm

Start iteration \(t=0\), initial guess \(x^0\)

Next iteration \(t=t+1\), improved guess \(x^t = f(x^{t-1})\), using an

**updating equation**- Evaluate whether new guess is sufficiently accurate, using a
**stopping rule**- If yes, stop and return \(x^* = x^t\) and optimized objective function \(g(x^*)\)
- If no, go back to 2 or consider whether to give up (report no convergence)

##### Bisection Search

Take finding root of \(g'(x) = 0\) as an example.

Start with initial interval \([a_0, b_0]\), and \(g'(a_0)g'(b_0)<0\), and systematically shrink the interval half by once.

**Golden Section Search**

More efficient (converge more quickly), as an extension of bisection method that takes andvantage of the Golden Ratio.

##### Convergence

Each iteration produces an answer closer to the optimum, we gradually converge on the solution we want. *Convergence order* is an index of how fast it converges.

Convergence criterion serves as stopping rules, two main types, absolute and relative

**Absolute**: stop when \(|{x^t - x^{t-1}|} < \epsilon\), max tolerance.

Problems: regardless of the size of the numbers, e.g. err of 0.1 is usually more important when \(x^\ast\) is 0.5 than it is 50,000

**Relative**: \(\frac{|x^t - x^{t-1}|}{|x^{t-1}|} < \epsilon\), max proportional tolerance.

##### Robustification

Use more numerically stable operation

e.g. use \(a+\frac{b-a}{2}\) instead of \(\frac{a+b}{2}\)

"good enough" convergence (helps when \(x^t\) close to zero)

\[ \frac{|x^t-x^{t-1}|}{|x^{t-1}|+\epsilon} < \epsilon \]

- stop anyway after \(N\) iteration, whether convergence achieved or not
- stop if no converging over several iterations, restart somewhere else

Global optimum may not be bracketed at the beginning! or even if bracket it, there may be multiple optima, you may not turn out getting the best one.

#### Newton's Method (univariate)

Also called "Newton-Raphson Method"

Iterative, numerical technique for optimization, based on linear approximation with Taylor Series. Think of it as using Successive tangent lines!

##### Rationale

Approximate \(g'(x^\ast)\) by linear Taylor series expansion at \(x^t\) \[ g'(x^\ast) \approx g'(x^t) + (x^\ast-x^t)g''(x^t) = 0 \]

Re-arranging gives \[ x^\ast \approx x^t - \frac{g'(x^t)}{g''(x^t)} \]

Here requires that \(g'(x)\) is continuously differentiable, and \(g''(\ast)\ne 0\)

So, the **updaing equation** is \[
x^{t-1} = x^t - \frac{g'(x^t)}{g''(x^t)}
\]

##### Algorithm

define

**objective function**\(g(x)\) and**convergence criterion tolerance**\(\epsilon\)start iteration \(t=0\), set \(x^0\), calculate \(g(x^0)\).

next iteration \(t=t+1\), update \(x^{t+1} = x^t - g'(x^t)/g''(x^t)\), calculate \(g(x^{t+1})\).

evaluate whether convergence criteria are met.

if met, return \(x^\ast = x^t\) and exit; otherwise, return to step 3.

##### Summary of Newton's

Can converge etremely rapidly - convergence order \(\beta=2\)

but, may not converge at all, depending on shape of \(g'(x)\)

Can easily get stuck in local optima

#### Newton's Method (multivariate)

Objective function is \(g(\theta)\) where \(\theta\) is a vector of \(n\) parameters

\(\nabla\) and \(\nabla^2\) is the first and second derivative (gradient and Hessian)

\(\Delta\) is the vector of each direction (for one step), \(\theta_{1} = \theta_0 + \Delta\)

\[ \nabla^2 g(\theta) \times \Delta = - \nabla g(\theta) \]

where \(\nabla^2g(\theta)\) is an \(n \times n\) matrix; \(\Delta\) is a \(1 \times n\) vactor;

\(\nabla g(\theta)\) is an \(n\) by \(n\times 1\) vector.

##### Algorithm Pseudocode

set objective function \(g(\theta)\) and convergence tolerance \(\epsilon\)

chose starting \(\theta_0\)

- calculate
- \(g(\theta_0)\)
- \(g'(\theta) = \nabla\)
- \(g''(\theta) = \nabla^2\)

\(\nabla^2 g(\theta)\Delta = - \nabla g(\theta)\) , solve for \(\Delta\)

\(\theta_{1} = \theta_0 + \Delta\)

#### Gauss-Newton Method (multivariate)

Iterative procedure for extimating parameters of nonlinear least-square problems(GLS).

Objective function is always the sum of square of residuals.

Only

**first order**derivatives required (advantage over Newton's)

##### Rationale

Linearise the regression with a 1st order Taylor series evaluated at \(\theta^0\)

re-arrange in the form of linear regression

\[ \begin{aligned} f(x,\theta) - f(x,\theta^0) &= \sum_j (\theta_j - \theta^0_j)\frac{df} {d\theta_j}\bigg|_{\theta^0} \\ &\equiv \sum_j \gamma_j\omega_j \end{aligned} \]

where

\(\gamma_j = \theta_j-\theta_j^0\) corresponds to the unknown parameter

\(\omega_j = \frac{df}{d\theta_j}\Big|_{\theta^0}\) corresponds to the known covariate

##### Algorithm

start with \(i=0\), set initial parameter \(\theta^0\)

estimate \(\gamma\): \[ \gamma^1 = (W^{0T}W^0)^{-1} \bullet W^{0T}z^0 \]

update \(\theta^1 = \theta^0 + \gamma^1\)

re-evaluate \(W\) and \(z\) at \(\theta^1\)

if met, return \(\theta\); if not go to step 2.

### Summary of Newton and Gauss-Newton

Newton's Method

Any objective functions

require 1st and 2nd derivatives

Uni and Multivariate case

Gauss-Newton Method

Objective function always sum of square of resid

require only 1st derivatives

for Multivariate non-linear least-square problem

### Stochastic Methods

Previous methods are all deterministic methods. OBJ function can be evaluated exactly; algorithm is deterministic - i.e. given same data and start point, it takes same path to same solution.

#### Why Stochastic

Necessary to deal with stochastic func

Can be used for discrete problem

do better in finding global rather than local!

some Cons; faster than brute force but slower than deterministic. Want a good solution or the best?

**Stochastic optimization** methods use random variables. Two variaties below can overlap:

Optimizing stochastic functions (OBJ function contains random noise); step size decrease, accuracy increase as iteration number increase

Optimization algorithm with random search (search methods contains random element); converge faster, improve robustness when global optimum hard to find.

#### Stochastic search Algorithm

Blind random search - generate new candidate from some chosen PDF

Localized random search - generate new candidate from some chosen PDF (depends on current candidate); step neither fixed(brute force) nor full random (blind)

Simulated annealing - many variants, work well but sloooow

##### Consequences of Stochasticity

Evaluation of OBJ func (and its gradient) no longer accurate

grid search methods may pick wrong point; iterative methods may fail to converge

Accuracy can be improved by repeat eval, err decrease with \(1/\sqrt{N}\)

## Reproducibility and graphics (lect 12, 13, 14)

#### Reproducible research

[x] What is it and why is it a good idea

[x] Ways to achieve reproducibility (or fail to achieve reproducibility)

[x] Key components of reproducible research

[x] Not examinable: specifics of R Markdown.

Replicable, independent researchers can follow the same methods and arrive at the same conclusions. OLD.

**Reproducible**, at least the result should be reproducible: using same data and code yield the same result

##### Why Reproducible

can reproduce the project even forget most details after long time

easier to run with addtional data

planning for longevity makes research more impactful

*Using different tools for each phase of project lead to bad reproducibility, because*

err can occur in any/all of these phases

new data arise, whole process start over

##### Components of Reproducibility

Publication with linked and executable code and data

Never touch raw data

avoid mannual data manipulation

version vontrol

include random seeds info

use relative paths

Also, file project components well!

#### Graphical presentation of data

[x] What makes a good graphic; Tufte’s principles; graphical integrity

[x] Ability to recognize good and bad graphics and explain why they are good or bad

[x] Not examinable: specifics of data visualization methods in R.

##### Tufte's Principles

numbers match true proportions

dimensions match the data

labels clear and detailed

use well-known units to represent money

not vary for some ulterior motive and not imply unintended context

** Lie factor** - is the ratio of effect shown in graphic to effect shown in data. \[
\text{Lie Factor} = \frac{\text{size of effect shown in graphic}}{\text{size of effect shown in data}}
\]

**- unnecessary complexicy in graphic**

*Chart Junk*#### Code performance

[x] What a profiler is and how and when to use one

[x] Ability to recognize what might benefit from optimization given profiler output

**Profiler** - software that *records the instructions computer is executing*

**When to use** - after code is working as intended.

#### Tidyverse

[x] Definition of tidy data

[x] Ability to recognize tidy vs. untidy data

[x] Not examinable: specific R tidyverse packages and functions.

Tidy data is easier to manipulate, model and visualize! define as

Each variable is saved in its onw column

Each observation is saved in its own row

## Comments