Unit of study_

# ENVX3002: Statistics in the Natural Sciences

This unit of study is designed to introduce students to the analysis of data they may face in their future careers, in particular data that are not well behaved. The data may be non-normal, there may be missing observations, they may be correlated in space and time or too numerous to analyse with standard models. The unit is presented in an applied context with an emphasis on correctly analysing authentic datasets, and interpreting the output. It begins with the analysis and design of experiments based on the general linear model. In the second part, students will learn about the generalisation of the general linear model to accommodate non-normal data with a particular emphasis on the binomial and Poisson distributions. In the third part linear mixed models will be introduced which provide the means to analyse datasets that do not meet the assumptions of independent and equal errors, for example data that is correlated in space and time. The unit ends with an introduction to machine learning and predictive modelling. A key feature of the unit is using R to develop coding skills that are become essential in science for processing and analysing datasets of ever increasing size.

Code ENVX3002 Life and Environmental Sciences Academic Operations 6
 Prerequisites: ? ENVX2001 or STAT2X12 or BIOL2X22 or DATA2X02 or QBIO2001 None None

At the completion of this unit, you should be able to:

• LO1. explain the similarities and differences between ANOVA and regression data analysis in terms of the general linear model
• LO2. demonstrate proficiency in the use of R (and interpretation of the output) for modelling data with a general linear model
• LO3. demonstrate proficiency in the use of R (and interpretation of the output) for modelling datasets that have non-normal distributions (binomial, Poisson) using a generalised linear model
• LO4. demonstrate proficiency in the use of R (and interpretation of the output) for modelling univariate relationship using non-linear functions and splines
• LO5. demonstrate proficiency in the use of R (and interpretation of the output) for modelling datasets that fail to meet the assumption of i.i.d errors (non-normality and/or correlation in space and time) by using residual maximum likelihood (REML)
• LO6. demonstrate proficiency in the use of R (and interpretation of the output) for making predictive models using tree-based models, and for assessing the quality of the models
• LO7. demonstrate proficiency in the use of R (and interpretation of the output) in analysing big data using modern computational intensive techniques.

## Unit outlines

Unit outlines will be available 1 week before the first day of teaching for the relevant session.