# School of Mathematics

- You are here: Faculty of Mathematics and Physical Sciences
- School of Mathematics
- Postgraduate research
- ...
- How to apply
- Project opportunities
- PhD research topics in statistics

#### Search site

## PhD research topics in statistics

**Dynamic shape modelling**

Prof K.V. Mardia, Prof J.T. Kent & Prof B.S. Khambey

Objects are everywhere – natural and man-made. Advances in technology have led to the routine collection of geometrical information on objects and the study of their shape is more important than ever. Analytically, shape comprises the geometrical information that remains when location, scale and rotational effects are removed from an object. In many settings it is possible to define "landmarks" which can be consistently identified across a set of objects, and developments over the past 30 years have led to the new subject of statistical shape analysis, an extension of multivariate analysis.

In dynamic shape analysis the shape of an object changes through time. The time scale can range from years, e.g. the growth of a human face between childhood and adulthood, down to seconds, e.g. the formation of human facial expressions such as a smile. Such applications can be viewed as multivariate time series, sometimes with change points.

The main motivating application for this project comes from craniofacial surgery for facial deformities such as cleft lip. One measure of success for the surgery is the ability of the patient to make "normal-looking" facial expressions, such smiling. At the moment this judgement is made subjectively. The aim of the project is to develop and assess more objective measures.

**Random permutations and integer partitions**

Dr. L.V. Bogachev

Permutations and integer partitions are the basic combinatorial structures that appear in numerous areas of mathematics and its applications - from number theory, algebra and topology to quantum physics, statistics, population genetics, IT & cryptology (e.g., A. Turing used the theory of permutations to break the Enigma code during World War II). This classic research topic dates back to Euler, Cauchy, Cayley, Lagrange, Hardy and Ramanujan. The modern statistical approach is to treat these structures as a random ensemble endowed with a suitable probability measure. The uniform (equiprobable) case is well understood but more interesting models (e.g., with certain weights on the components) are mathematically more challenging.

The main thrust of this PhD project is to tackle open and emerging problems about asymcptotic properties of "typical" structures of big size. The focus will be on macroscopic features of the random structure, such as its limit shape. It is also important to study extreme values, in particular the possible emergence of a giant component which may shed light on the Bose-Einstein condensation of quantum gas, predicted in 1924 but observed only recently (Nobel Prize in Physics 2001).

A related direction of research is the exploration of a deep connection with different quantum statistics; specifically, the ensemble of uniform integer partitions may be interpreted as the ideal gas of bosons (in two dimensions), whereas partitions with distinct parts correspond to fermions. In this context, an intriguing problem is to construct suitable partition classes to model the so-called anyons obeying fractional quantum statistics (also in 2D!). Furthermore, an adventurous idea may be to look for suitable partition models to mimic the unusual properties of graphene (Nobel Prize in Physics 2010), a newly discovered 2D quantum structure with certain hidden symmetries.**Hirsch's citation index and limit shape of random partitions**

Dr. L.V. Bogachev & Dr. J. Voss

Integer partitions appear in numerous areas of mathematics and its applications --- from number theory, algebra and topology to quantum physics, statistics, population genetics, and IT. This classic research topic dates back to Euler, Cauchy, Cayley, Lagrange, Hardy and Ramanujan. The modern statistical approach is to treat partitions as a random ensemble endowed with a suitable probability measure. The uniform (equiprobable) case is well understood but more interesting models (e.g., with certain weights on the components) are mathematically more challenging.

Hirsch [3] introduced his h-index to measure the quality of a researcher's output, defined as the largest integer such that the person has h papers with at least h citations each. The h-index has become quite popular (see, e.g., 'Google Scholar' or 'Web of Science'). Recently, Yong [6] proposed a statistical approach to estimate the h-index using a natural link with the theory of integer partitions [1]. Namely, identifying an integer partition with its Young diagram (with blocks representing parts), it is clear that the h-index is the size of the largest h x h square that fits in.

If partitions of a given integer N are treated as random, with uniform distribution (i.e., all such partitions are assumed to be equally likely), then their Young diagrams have "limit shape" (under the suitable scaling), first identified by Vershik [5]. Yong's idea is to use the limit shape to deduce certain statistical properties of the h-index. In particular, it follows that the "typical" value of Hirsch's index for someone with a large number N of citations should be close to 0.54 N.

However, the assumption of uniform distribution on partitions is of course rather arbitrary, and needs to be tested statistically. This issue is important since the limit shape may strongly depend on the distribution of partitions [2], which would also affect the asymptotics of Hirsch's index. Thus, the idea of this project is to explore such an extension of Yong's approach. To this end, one might try to apply Markov chain Monte Carlo (MCMC) techniques [4], whereby the uniform distribution may serve as an "uninformed prior".

These and similar ideas have a potential to be extended beyond the citation topic, and may offer an interesting blend of theoretical and more applied issues, with a possible gateway to further applications of discrete probability and statistics in social sciences.

References:

[1] Andrews, G.E. and Eriksson, K. Integer Partitions. Cambridge Univ. Press, Cambridge, 2004.

[2] Bogachev, L.V. Unified derivation of the limit shape for multiplicative ensembles of random integer partitions with equiweighted parts. Random Struct. Algorithms, 47 (2015), 227-266. (doi:10.1002/rsa.20540)

[3] Hirsch, J.E. An index to quantify an individual's scientific research output. Proc. Natl. Acad. Sci. USA, 102 (2005), 16569-16572. (doi:10.1073/pnas.0507655102)

[4] Markov Chain Monte Carlo in Practice (W.R. Gilks, S. Richardson and D.J. Spiegelhalter, eds.). Chapman & Hall/CRC, London, 1996.

[5] Vershik, A.M. Asymptotic combinatorics and algebraic analysis. In: Proc. Intern. Congress Math. 1994, vol. 2. Birkhäuser, Basel, 1995, pp. 1384-1394. (www.mathunion.org/ICM/ICM1994.2/Main/icm1994.2.1384.1394.ocr.pdf)

[6] Yong, A. Critique of Hirsch's citation index: a combinatorial Fermi problem. Notices Amer. Math. Soc., 61 (2014), 1040-1050. (doi:/10.1090/noti1164)

**The archetypal equation with rescaling and related topics**

Dr. L.V. Bogachev

The theory of functional equations is a growing branch of analysis with many deep results and abundant applications (see [1] for a general introduction). A simple functional-differential equation with rescaling is given by y'(x) + y(x) = p y(2x) + (1-p) y(x/2) (0<p<1), which describes e.g. the ruin probability for a gambler who spends his capital at a constant rate (starting with x pounds) but at random time instants decides to bet on the entire current capital and either doubles it (with probability p) or loses a half (with probability 1-p). Clearly, y(x)=const is a solution, and the question is whether or not there are any other bounded, continuous solutions. It turns out that such solutions exist if and only if p<0.5; remarkably enough, this analytic result is obtained using the martingale techniques of probability theory [2].

The equation above exemplifies the "pantograph equation" introduced by Ockendon & Tayler [6] as a mathematical model of the overhead current collection system on an electric locomotive. In fact, the pantograph equation and its various ramifications have emerged in a striking range of applications including number theory, astrophysics, queues & risk theory, stochastic games, quantum theory, population dynamics, imaging of tumours, etc.

A rich source of functional and functional-differential equations with rescaling is the "archetypal equation" y(x)=E[y(a(x-b))], where a, b are random coefficients and E denotes expectation [3]. Despite its simple appearance, this equation is related to many important topics, such as the Choquet-Deny theorem, Bernoulli convolutions, self-similar measures and fractals, subdivision schemes in approximation theory, chaotic structures in amorphous materials, and many more. The random recursion behind the archetypal equation, defining a Markov chain with jumps of the form x -> a(x-b), is known as the 'random difference equation', with numerous applications in control theory, evolution modelling, radioactive storage management, image generation, iterations of functions, investment models, mathematical finance, perpetuities (pensions), environmental modelling, etc. (see [4]).

In brief, the main objective of this PhD project is to continue a deep investigation of the archetypal equation and its generalizations. Research will naturally involve asymptotic analysis of the corresponding Markov chains, including characterization of their harmonic functions [7]. The project may also include applications to financial modelling based on random processes with multiplicative jumps (cf. [5]).

References:

[1] Aczél, J. and Dhombres, J. Functional Equations in Several Variables, with Applications to Mathematics, Information Theory and to the Natural and Social Sciences. Cambridge Univ. Press, Cambridge, 1989.

[2] Bogachev, L., Derfel, G., Molchanov, S. and Ockendon, J. On bounded solutions of the balanced generalized pantograph equation. In: Topics in Stochastic Analysis and Nonparametric Estimation (P.-L. Chow et al., eds.), pp. 29-49. Springer, New York, 2008. (doi:10.1007/978-0-387-75111-5_3)

[3] Bogachev, L.V., Derfel, G. and Molchanov, S.A. On bounded continuous solutions of the archetypal equation with

rescaling. Proc. Roy. Soc. A, 471 (2015), 20150351, 1-19. (doi:10.1098/rspa.2015.0351)

[4] Diaconis, P. and Freedman, D. Iterated random functions. SIAM Reviews, 41 (1999), 45-76. (doi:10.1137/S0036144598338446)

[5] Kolesnik, A.D. and Ratanov, N. Telegraph Processes and Option Pricing. Springer, Berlin, 2013. (doi:10.1007/978-3-642-40526-6)

[6] Ockendon, J.R. and Tayler, A.B. The dynamics of a current collection system for an electric locomotive. Proc. Roy. Soc. London A, 322 (1971), 447-468. (doi:10.1098/rspa.1971.0078)

[7] Revuz, D. Markov Chains, 2nd edn. North-Holland, Amsterdam, 1984.**Prior distributions for stochastic matrices**

Dr. J.P. Gosling

Right-stochastic matrices are used in the modelling of Markov processes (transition matrices) and of misclassification proportions (confusion matrices). The key property of these square matrices is that all the elements are non-negative and each row sums to one. If we consider the problem of estimating these probabilities from a Bayesian standpoint, we should model our prior beliefs about the matrices before we observe any data. We are interested in constructing sensible probability distributions that can be used to encapsulate beliefs about such structured matrices and compare the properties of the distributions and their effects when exposed to data. This research strand draws heavily on the modelling approaches adopted for compositional data and we are also interested in extending our findings to accommodate doubly-stochastic matrices (both rows and columns sun to one).**Accounting for uncertainty in the building blocks of statistical inference**

Dr. J.P. Gosling

All mathematical representations of inference and estimation are approximations. It can be shown that the Bayesian approach to inference is perfect under some strong assumptions about rationality, consistency of judgements and the representativeness of the joint distribution of the data and model parameters. Concentrating on the final assumption, effort needs to be made by statisticians to make sure that their choices for likelihoods, summary statistics, prior distributions, etc. are adequately representative of reality. When making these choices, the statistician often has many options that could work well, but they will often pick the most computationally convenient. In this project, we aim to investigate these choices and the impact that accounting for this additional uncertainty may have. To achieve this, we will investigate accommodating uncertainty in probability functions and likelihood-free approaches such as Bayes linear and ABC methods.**Modelling asymmetric decision problems**

Dr. P. Thwaites

The oldest graphical model for the representation & analysis of multistage decision problems under uncertainty is the decision tree. Unfortunately trees can get very complicated, and cannot be read explicitly for their conditional independence structure. Influence Diagrams (IDs - the decision-theoretic analogue of Bayesian Networks) have been used to combat the complexity issue, but many decision problems are asymmetric in the sense that different decisions can result in different choices in the future, and although decision trees handle this readily, IDs do not adapt easily to these sorts of problems.

There have been many attempts to adapt IDs for use with asymmetric decision problems, or to develop techniques which use both IDs and decision trees. There have also been several new structures suggested. A good overview of these developments is given in Bielza & Shenoy (1999), where it is noted that none of the methods available is consistently better than the others. The Chain Event Graph (CEG) of Smith & Thwaites (2006 onwards) has recently been used to model asymmetric decision problems, and appears to be an ideal graphical model for these types of problem. This project focuses on developing a formal semantics for decision CEGs, creating efficient algorithms for determining optimal decision strategies, and assessing CEGs against the currently available methods.

References:

C. Bielza & P.P. Shenoy: A comparison of graphical techniques for asymmetric decision problems, Management Science 45, 1999.

F.V. Jensen, T.D. Nielsen & P.P. Shenoy: Sequential influence diagrams: a unified asymmetry framework, International Journal of Approximate Reasoning 42, 2006.

P.A Thwaites & J.Q. Smith (2015). A New Method for tackling Asymmetric Decision Problems, Proceedings of WUPES'15.

**Modelling asymmetric problems which evolve over time**

Dr. P. Thwaites

It is common practice in many disciplines to model statistical problems using directed graphs such as Bayesian Networks. In doing so, analysts make assumptions about symmetry which can lead to efficient analytical algorithms. In many circumstances however, such symmetry assumptions are inappropriate - if a patient makes a full recovery following Treatment 1, she is not given Treatment 2, for example. For these types of problem using an off-the-peg directed graph can be very inefficient.

A number of modifications of directed graphs, and alternative graphical structures have been proposed to deal with this situation, one of the most recent of which is the Chain Event Graph of Smith & Thwaites (2006 onwards). This project focuses on creating efficient algorithms for updating beliefs and learning structure, when modelling asymmetric problems where the underlying structure may be evolving over time.

References:

J. Tian: Identifying dynamic sequential plans, Uncertainty in Artificial Intelligence 24, 2008.

A.P. Dawid & V. Didelez: Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview, Statistics Surveys vol 4, 2010.

G. Freeman & J.Q. Smith: Bayesian MAP model selection of Chain Event Graphs, Journal of Multivariate Analysis 102, 2011.

L.M. Barclay, R.A. Collazo, J.Q. Smith, P.A. Thwaites & A.E. Nicholson: The Dynamic Chain Event Graph, Electronic Journal of Statistics 9(2), 2015.

**Diversity and collective intelligence**

Dr. R.P. Mann

Effective collective behaviour, either by animal groups or human societies, depends on some degree of conformity between individuals. However, within coherent groups, a diversity of information, personality and attributes can make the group perform better. In this project we will use data analytics, game theory and evolutionary theory to investigate how diversity is maintained in collectives by evolutionary and strategic imperatives, how diversity affects group behaviour and efficacy, and how optimal diversity can be achieved.**Linking animal movement and cognition with data**

Dr. R.P. Mann

Most animals must move to feed and find mates. Animal motion is a complex mixture of cognitive and physical processes, optimised by evolution to help the animal survive and reproduce. A decade of technological development has made GPS, radio and video tracking of many animal species routine, but analysis of animal movements is lagging behind the technology. In this project we will use mathematical models to link animal movement data to cognitive processes and decision-making, to understand how and why animals make the choices they do.**Optimal search, learning and foraging: alone and together**

Dr. R.P. Mann

Efficiently learning about and exploiting an environment is a complex computational or cognitive task, yet it is one that many animals effectively solve every day. Optimal search algorithms require a statistical model of how resources are distributed, a selection policy for where to look and foresight to see how choice made now will affect future knowledge and options. In this project we will look at how optimal search theory can be used to understand animal foragers, how agents acting in parallel can best combine their search efforts and how animal strategies can stimulate new advances in computational search algorithms.**Analysing player behaviour in virtual worlds**

Dr. R.P. Mann

Virtual worlds such as Minecraft, World of Warcraft and Eve Online provide a stimulating and social entertainment experience for millions of players worldwide. They are also potentially a game-changing new resource for social scientists and psychologists. Thousands of new `societies' can be observed at a time at little or no cost, players can adopt new identities that would be impossible in the real world, and the success or failure of these societies can be quantified easily. Using our HEAPCRAFT data collecting framework developed with Disney Research, Rutgers University and ETH Zurich, this project will collect and analyse data from Minecraft servers to investigate how players compete and collaborate, and how societies succeed or fail.**Parallelisation and active machine learning for Big Data**

Dr. G. Aivaliotis & Dr. J. Palczewski

The immense scale of modern data calls for new methods and algorithms that are parallel in the instances and variable dimensions, and self-adjust according to characteristics of the data to reduce computational complexity. Techniques such as parallel boosting (see [3] and references therein; [2] for Google's use of the technique) or algorithmic leveraging [4] rest on randomly sampling small subsets. Boosting is a sequential procedure with further models improving on the previous ones to construct a strong model from weak learners.

In this project, using techniques from the control and stopping of partially observable stochastic dynamical systems [1], you will extend ideas behind parallel versions of boosting. This can include design of adaptive algorithms that self-adjust according to the complexity of the data and the performance of already trained model parts, thereby informing trade- offs between speed and accuracy. You will develop new mathematical techniques and new algorithms. Medical and consumer data sets available in Leeds Insitute of Data Analytics will form your test bed.

This project will appeal to applicants with a strong statistical and/or mathematical background who enjoy programming.

References:

[1] Bensoussan, A. (2004). Stochastic control of partially observable systems. Cambridge University Press.

[2] Chandra, T. (2014) Sibyl: A system for large scale machine learning at Google, 2014.dsn.org/keynote.shtml

[3] Kamath U. et al. (2015) Theoretical and Empirical Analysis of a Parallel Boosting Algorithm, arXiv:1508.01549

[4] Ma, P., et al. (2015) A Statistical Perspective on Algorithmic Leveraging, Journal of Machine Learning Research, 16, 861-911**Extracting knowledge from longitudinal data**

Dr. J. Palczewski & Dr. G. Aivaliotis

In the recent years, electronic transactions and record keeping together with cheap data storage has resulted in accumulation of massive amounts of data containing valuable information. Examples of such vast data sets are electronic patient records in the health sector or loyalty card data in the retail sector. The wealth of information in these data sets is at the moment under-explored as any analysis and inference is hampered by the size of the datasets.

To address this issue, machine learning techniques in biomedical, consumer data and other fields have seen rapid advances of algorithms and their implementations [2]. However, very little progress has been made in analysis of longitudinal data, i.e., the data with events and observations spaced irregularly over time [1]. Existing statistical and machine learning methods cannot be directly applied since that data are heterogeneous with a varying number of events per record. Available algorithms are computationally very intensive, not scalable and not applicable to large data sets (see, e.g., [3]). This project will attempt to change this by looking at the data from a new

point of view: a probabilistic approach that represents longitudinal data as a continuous-time stochastic processes and links to a large theory of stochastic processes. This modeling approach promises to be

- robust with respect to temporal inaccuracies often encountered in this type of data,
- scalable allowing for exploration of large data sets.

The aim of this project is to realise this potential. You will develop new mathematical techniques and new algorithms and apply them to medical and consumer data sets available in Leeds Insitute of Data Analytics.

This project is suitable for applicants with a strong statistical and/or mathematical background who enjoy programming.

References:

[1] Bellazzi, R., et al. (2011). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 416-430.

[2] Herland, M., et al. (2014). Journal of Big Data, 1(1), Article 2.

[3] Moskovitch, R. and Y. Shahar. (2014). Data Mining and Knowledge Discovery, 1-43.

Statistical methods for analysis of electronic health record (EHR) data

Prof. J.J. Houwing-Duistermaat

The falling cost of high throughput techniques such as proteomics and whole genome sequencing, the widespread adoption of clinical e-health record systems and the rise in collection of diverse personal data, implies that we now have `big data' in the health sector. It is often hypothesized that these data sources can be used to improve health care and to reduce costs further. Improvement of health care can be obtained by better stratification of patients with regard to underlying biological mechanisms and/or risk leading to personalized or `precision' medicine. With regard to research, reduction in costs and acceleration in research outcomes can be obtained by using available data to improve study design. The availability of statistical methods for EHR data is however limited. In addition to dealing with `big data', the statistical challenges are: messy and noisy, heterogeneous and incomplete data. This interdisciplinary project aims to address a part of these challenges and to assess the usefulness of HER data in precision medicine and study design.

The project is motivated by data from General Practices (ResearchOne) and a recent paper presenting data on two outcomes namely diagnosis of depression related outcomes and description of antidepressants (McLintock et al, BMJ 2014). The data are from 112 general practices over a time period of 12 years. We will use extensions of generalized linear models in which random components are included to model correlation between two different outcomes and between the same outcomes at two time points. A starting point is to use such a model for the joint distribution of diagnosis of depression and of description of antidepressants per practice. Such an analysis might provide insight in the underlying mechanisms and structure across the various practices. A second topic is to combine data from a relatively small well defined epidemiological cohorts with big noisy data from EHRs. By using penalized likelihood information from the various cohorts can be borrowed while addressing heterogeneity. Here, the size of the penalty reflects the amount of similarity between the two cohorts (Tsonaka et al, Biostatistics 2013; Biometrics 2015). This method shows similarity with Bayesian approaches, since penalizing can be interpreted as using prior information. Finally we will consider designing a new study based on the EHR data. The idea is that the EHR database might provide information on relevant strata and via simulations gain in efficiency can be assessed if certain strata are oversampled. Oversampling increases the information with regard to the relationship between the outcome and a risk factor but the variance of the parameter estimates increases also due to the weighting which is needed to account for the oversampling.**Phenotypic prediction based on multiscale copy number alteration profiles**

Dr. A. Gusnanto, Dr. S. Barber & Dr H. Wood

A cancer phenotype is the specific form of cancer affecting a patient. Phenotypic prediction of cancer patients is a critical task in the context of stratified medicine. This is because the type of treatment, or combination of treatments, that are administered to cancer patients depend on the phenotypes, such as cancer histological subtypes or metastatic status. Accurate prediction means that we can minimise the risk of patients going through unnecessary treatment that can be extremely painful.

In this regard, the patients' genomic data or information have been increasingly seen as a valuable predictor. Specifically, genome-wide copy number alteration (CNA) profiles of the patients, generated using the next-generation sequencing (NGS) technology, contain valuable information which we can utilize to make phenotypic prediction.

The CNA profile of a patient comes in the form of gains (`jumps') and losses (`drops') from the normal two copies along the genomes. The gains and losses are in segments, i.e. the estimates tend to be the same along the neighbouring genomic regions, and the segments can be short or long. From statistical point-of-view, the pattern of gains and losses along the genome can be considered as a series. CNA profile from a patient can then be summarized using a mathematical transformation called wavelet ('little wave') to produce wavelet coefficients. These coefficients represent the frequency of gains and losses at multiple scales, which can be considered to correspond to genome, chromosome, chromosomal arm, region, short region, and random error levels.

This project will take advantage of the wavelet transformation, and develop new statistical methodology for improved prediction of the phenotypes. The main characteristic of the wavelet transformation is that we can perform thresholding; `switching off' some wavelet coefficients that are expected to correspond to either random error or variability that is not directly related to the phenotype of interest. After the thresholding, we are then left with CNA information that is more directly associated with the phenotypes. This project aims to investigate and develop optimal wavelet transformation and thresholding methods for an improved prediction of phenotypes.

© Copyright Leeds 2011