10.5061/DRYAD.737GK
Roberts, David R.
University of Freiburg
Bahn, Volker
Wright State University
Ciuti, Simone
University of Freiburg
Boyce, Mark S.
University of Alberta
Elith, Jane
University of Melbourne
Guillera-Arroita, Gurutzeta
University of Melbourne
Hauenstein, Severin
University of Freiburg
Lahoz-Monfort, José J.
University of Melbourne
Schröder, Boris
University of Freiburg
Thuiller, Wilfried
University of Melbourne
Warton, David I.
UNSW Sydney
Wintle, Brendan A.
University of Melbourne
Hartig, Florian
University of Freiburg
University of Regensburg
Dormann, Carsten F.
University of Freiburg
Data from: Cross-validation strategies for data with temporal, spatial,
hierarchical, or phylogenetic structure
Dryad
dataset
2016
Extrapolation
overfit
Autocorrelation
2016-12-08T15:55:33Z
2016-12-08T15:55:33Z
en
https://doi.org/10.1111/ecog.02881
10631282 bytes
1
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Ecological data often show temporal, spatial, hierarchical (random
effects), or phylogenetic structure. Modern statistical approaches are
increasingly accounting for such dependencies. However, when performing
cross-validation, these structures are regularly ignored, resulting in
serious underestimation of predictive error. One cause for the poor
performance of uncorrected (random) cross-validation, noted often by
modellers, are dependence structures in the data that persist as
dependence structures in model residuals, violating the assumption of
independence. Even more concerning, because often overlooked, is that
structured data also provides ample opportunity for overfitting with
non-causal predictors. This problem can persist even if remedies such as
autoregressive models, generalized least squares, or mixed models are
used. Block cross-validation, where data are split strategically rather
than randomly, can address these issues. However, the blocking strategy
must be carefully considered. Blocking in space, time, random effects or
phylogenetic distance, while accounting for dependencies in the data, may
also unwittingly induce extrapolations by restricting the ranges or
combinations of predictor variables available for model training, thus
overestimating interpolation errors. On the other hand, deliberate
blocking in predictor space may also improve error estimates when
extrapolation is the modelling goal. Here, we review the ecological
literature on non-random and blocked cross-validation approaches. We also
provide a series of simulations and case studies, in which we show that,
for all instances tested, block cross-validation is nearly universally
more appropriate than random cross-validation if the goal is predicting to
new data or predictor space, or for selecting causal predictors. We
recommend that block cross-validation be used wherever dependence
structures exist in a dataset, even if no correlation structure is visible
in the fitted model residuals, or if the fitted models account for such
correlations.
ECOG-02881.R1Appendix 6 - R code and data for case studies and simulations
(as presented in Boxes 1-4 in manuscript).