I am a Canadian data scientist and software engineer currently living in Australia. I have extensive experience in applied statistics, machine learning, lean and agile methodologies, R, Bayesian analysis, and data pipelines.
For the last six years, I developed solutions for market research and business intelligence software that enabled customers to automate the cleaning, statistical analysis, visualization, and reporting of their survey data.
I am an experienced Scrum Master and Agile practioner, certified as a SAFe Practice Consultant (SPC) and RTE. Please contact me if you are interested in training or coaching.
Previously, I was an academic researcher in applied statistics specializing in statistical computing, semiparametric and high-dimensional regression, time series analysis, and hierarchical Bayesian modelling. I am passionate about software development and love working with data from a diverse range of fields: market research, finance, nutrition, and many others.
I obtained a PhD in Operations Research with a concentration in Applied Probability and Statistics in 2013 from the School of ORIE at Cornell University under the supervision of David Ruppert and Giles Hooker. After Cornell, I completed postdocs with distinguished professors Ray Carroll at Texas A&M University and Matt Wand at University of Technology Sydney.
M. W. McLean,
M. P. Wand.
Variational Message Passing for Elaborate Response
Regression Models
.
In: Bayesian Analysis 14.2 (Jun. 2019), pp. 371–398.
[PDF]
K. Triff, M. W. McLean,
E. Callaway, J. Goldsby, I. Ivanov,
and R. S. Chapkin.
Dietary Fat and Fiber Interact to Uniquely Modify Global
Histone Post-Translational Epigenetic Programming in
a Rat Colon Cancer Progression Model
.
In: International Journal of Cancer 143.6 (May 2018), pp. 1402–1415.
M. W. McLean (2017). RefManageR: Import and Manage
BibTeX and BibLaTeX References in R
. Journal of Open Source
Software. R package also accepted
to ROpenSci.
[JOSS |
GitHub
| Reports]
K. Triff, M. W. McLean, K. Kranti, J. Pang,
E. Callaway, B. Zhou, I. Ivanov,
and R. S. Chapkin. Assessment of
Histone Tail Modifications and Transcriptional Profiling
During Colon Cancer Progression Reveals a Global Decrease in
H3K4me3 Activity
.
In: BBA - Molecular Basis of Disease 1863.6 (Jun. 2017), pp. 1392–1402.
[Elsevier]
M. W. McLean, G. Hooker,
D. Ruppert. Restricted Likelihood Ratio
Tests for Linearity in Scalar-on-Function Regression
.
In: Statistics and Computing 25.5 (Sep. 2015),
pp. 997–1008.
[arXiv
| Springer]
M. W. McLean, G. Hooker,
A.-M. Staicu,
F. Scheipl, D. Ruppert. Functional
Generalized Additive Models
. In: Journal of Computational
and Graphical Statistics 23.1 (Feb. 2014),
pp. 249–269.
[T & F Online
| Supplementary
Materials]
D. S. Matteson,
M. W. McLean,
D. B. Woodard
,
S. G. Henderson. Forecasting Emergency Medical
Service Call Arrival Rates
. In: Annals of Applied
Statistics 5.2B (Jun. 2011), pp. 1379–1406.
[ProjectEuclid
| arXiv
| Supplementary Materials]
M. W. McLean, C. J. Oates,
M. P. Wand (2017). Real-Time
Semiparametric Regression via Sequential Monte Carlo.
[PDF]
M. W. McLean, F. Scheipl, G. Hooker, S. Greven, D. Ruppert (2017). Bayesian
Functional Generalized Additive Models for Sparsely Observed
Covariates
.
[arXiv]
M. W. McLean (2014). Straightforward Bibliography
Managament in R Using the RefManageR Package
.
[arXiv]
On Generalized Additive Models for Regression with
Functional Data (2013). School of Operations Research and
Information Engineering, Cornell University.
[PDF]
Note: The HTML5
slides below may be printed in any decent browser by 'right-clicking' on any slide and selecting 'print'.
2014 WNAR/IMS Conference, Honolulu, HI (2014)
2014 ENAR Spring Meeting, Baltimore, MD (2014)
R
IAMCS Machine Learning and Applied Statistics Workshop
Series, College Station, TX (2014)
[Slides (HTML)]
(These slides also contain a brief introduction to
my R
package RefManageR
)
R
Programming and Authoring R
Packages IAMCS Machine Learning and Applied Statistics Workshop Series, College Station, TX (2014)
[Slides (HTML)]
Department of Statistics, University of Auckland, Auckland, New Zealand, August, 2017
Annual Meeting of the German and Austrian Statistical
Associations, Vienna, Austria (2012)
[Slides
(PDF)]
ORIE PhD Colloquium, Ithaca, NY (2012)
[Slides (PDF)]
ORIE PhD Reunion / Jack Muckstadt Retirement Celebration, Ithaca, NY (2012)
[Poster (PDF)]
Statistical Methods for Very Large Datasets Conference, Baltimore, MD (2011)
SAMSI Analysis of Object Data Closing Workshop, Research Triangle Park, NC (2011)
Imaging, Communications and Finance: Stochastic Modeling of Real-world Problems, New York, NY (2011)
[Poster (PDF)]
My postdoctoral research focused on building models
and algorithms for high-dimensional and streaming data
applications. Specifically, approximations
to fully Bayesian inference such as variational Bayesian
methods. Additonally, I worked on developing semiparametric models to
analyze longitudinal and functional data. I have a passion
for statistical computing and reproducible research, and have
contributed to a number of R
packages
on CRAN
.
Previous areas of application have included diffusion tensor imaging, forecasting emergency medical service call arrival rates, differential expression analysis of RNA-Seq data, analyzing step counts from wearable activity trackers, measurement error modelling for dietary intake data, and financial time series. Below is an introduction to a couple of applied problems I have worked on.
We have data from a study comparing certain white matter tracts of multiple sclerosis (MS) patients with control subjects. MS is a central nervous system disorder that leads to lesions in the white matter of the brain which disrupts the ability of cells in the brain to communicate with each other. Below is a human brain with two major white matter tracts, the corpus callosum and corticospinal tracts in red and blue, respectively (thanks to Jeff Goldsmith).
Using a Magnetic Resonance Imaging technique called Diffusion Tensor Imaging we obtain several measurements at each location in the white matter tracts. Here it makes sense to treat these measurements as being from a continuous function X(t) where t is location along the tract and use tools from functional data analysis. We wish to use the functions X(t) to predict a scalar health outcome Y. For example, Y might be score on a cognitive test or disease status (case or control). The typical approach in functional data analysis is to use the Functional Linear Model (FLM)
where β(t) is an unknown coefficient function that needs to be estimated and g(•) is a known monotonic function called the link function. Below is a plot of some of the curves Xi(t) as well as a corresponding estimated β(t).
However, this model is often not flexible enough to model the true underlying relationship between the response and predictor. We propose the new model
where F(x,t) is an unknown smooth bivariate function that we estimate using penalized regression splines. The model is much more general than the FLM, but retains the ease of interpretation. Below is an example estimated surface.
Here Y is 1 if the subject has MS and 0 if they do not and g[E(Y|X)] gives the estimated probability that subject i has MS. The red plotted curve is X(t) for one subject with MS and blue curve is for a subject without MS. We see that since the red curve traces out a higher path on the surface than the red curve, the subject corresponding to the red curve is more likely to be classified as having MS.
You may fit the FGAM and FLM yourself to several different data sets using a Shiny app I created. See the Software tab.
In this project we are interested in determining how to best deploy ambulances in a given city to minimize response times to emergency calls while keeping costs low. More specifically, we have data on every ambulance trip and every emergency call received by Toronto EMS between January 2007 and December 2008. The operations research models that address these deployment problems require accurate estimates of the number of calls that will be received during each time period. Below is a plot of the mean number of calls per hour for each day of the week.
Methods used in practice to obtain estimates are quite ad hoc. Instead, what we do is build what is called a factor model with constraints. This provides dimension reduction, accounts for seasonal and intra-day patterns in the call arrival rate process, and lets us incorporate additional covariates. The factor model is estimated using penalized regression splines so that the factors and loading in the factor model vary smoothly over time. Using the factor model we obtain smooth estimates for the arrival process for every day of 2008 which is plotted below. As is to be expected, there is a very noticeable difference between the estimates for weekdays and weekends.
Finally, an integer-value GARCH model is fit to the residual process to account for any remaining dependence. Below is a plot of the estimated number of calls for each hour of weeks 8 and 9 of 2007 as well as a plot of the estimated integer-value GARCH model fit to the residual process.
How do we check whether our model improves staffing levels and response times?
By simulating an M/M/s queueing system where the servers represent ambulances, calls for emergency medical services in hour, t, occur at rate λt, and each ambulance services callers at rate ν. We use our model to obtain estimates of the number of callers for each period and these determine the number of ambulances to staff, say st, in each period. After fitting the model to 2007 data, to initialize the queueing system we assume the number of arrivals each hour is the observed number of calls for the corresponding hour for 2008 and simulate inter-hour arrival times and service times for each caller.
R
Packages
I have authored and help maintain several
The following functions were contributed to
the Provided assistance getting
the the R
packages for Displayr. These include packages for data
cleaning, regression, machine learning, and choice
modeling. Most of the packages are publicly available on GitHub here.
RefManageR
- a Reference Manager for R
RefManageR
provides tools for importing and
working with bibliographic references. It greatly enhances
the bibentry
class by providing
a class BibEntry
which
stores BibTeX
and
BibLaTeX
references,
supports UTF-8 encoding, and can be easily searched by any
field, by date ranges, and by various formats for name lists
(author by last names, translator by full names, etc.).
Entries can be updated, sorted, combined, printed in a number
of styles, and exported. BibTeX
and BibLaTeX
.bib
files can be read
into R and converted to BibEntry objects. Interfaces to
NCBI's Entrez
,
CrossRef
and
Zotero
are provided for importing references
and references can be created from locally stored .bib
files or PDFs
using
Poppler
. The package can
also be used in RMarkdown
and RHTML
documents for including citations and printing a bibliography
of all cited entries. Both the bibliography and the
citations can include automatically generated hyperlinks.
The package
is available now on CRAN
and
a vignette is available
here. The package was peer-reviewed and
accepted by ROpenSci
.
refund
- REgression with FUNctional Data refund
package available on
CRAN
and GitHub
:
To fit FGAM in the latest versions of fgam
- a wrapper for the
gam
function in package mgcv
to fit
functional generalized additive models. predict.fgam
- wrapper for the
predict.gam
function in package mgcv
for prediction with fgam
fits. vis.fgam
- for visualizing estimated
surfaces produced by fgam
. af
- internal function for building fgam
terms specified in model formulas passed to
fgam
. lf
- internal function for building
functional linear model terms specified in model formulas
passed to fgam
. refund
, the
function fgam
is deprecated in favour of the
function pfr
, which can fit several different scalar
on function regression models including FGAM.
curvHDR
R
package accepted to CRAN, including setting up package
namespace and including compiled code.
Shiny
Apps Shiny
Below is a Shiny
app I created that performs estimation,
visualization, inference, and prediction for the functional
generalized additive model (FGAM) and the functional linear model
(FLM). More information on these models can be found in the Research
tab and in my papers from the Papers tab. Begin by selecting a data
set and choosing model parameters and an estimation method (if you are
familar with penalized splines). You may also specify options for
displaying the fits, conducting a hypothesis test of FLM vs. FGAM, and
performing out-of-sample prediction. When you have selected the
setting you want, you may fit the models by clicking the "Fit
Models" button. Source code for the app is available
on GitHub.
splines
package in R
means this code no longer runs in current versions
of R
. Those interested in fitting an FGAM should
see the pfr
function in the
package refund
.
Below is a sampling of blog posts I wrote in a previous role. They aim to describe aspects of R, choice modelling, and using the software Q and Displayr for a general audience of non-programmers and data scientists.
A complete list of my posts with better styling can be found on the Displayr blog.