Validation leakage in microbiome machine learning
Francesc Catala-Moll
IrsiCaixa AIDS Research InstituteSource:
vignettes/leakage-ml.Rmd
leakage-ml.RmdWhy Leakage Matters
Microbiome machine-learning models often work with samples, but the scientific unit of generalization may be a subject, hospital, sequencing run, visit, or study site. If train and test folds split those units incorrectly, the model can learn repeated-subject structure or batch signatures instead of biology.
This vignette focuses on three common leakage patterns:
- repeated measurements from the same subject,
- outcome labels aligned with batch or center,
- time-ordered samples from the same subjects.
MOAT does not train the machine-learning model. It audits metadata before modeling and recommends validation schemes that make leakage harder to miss.
Row-Wise Cross-Validation Can Be Overoptimistic
Row-wise cross-validation treats each sample as independent. That assumption is fragile when several rows come from the same person. In the small simulated dataset below, every subject contributes two samples.
repeated_biome <- toy_moat
SummarizedExperiment::colData(repeated_biome)$subject <- rep(
paste0("S", seq_len(ncol(repeated_biome) / 2)),
each = 2
)
SummarizedExperiment::colData(repeated_biome)$timepoint <- rep(
c("baseline", "followup"),
times = ncol(repeated_biome) / 2
)
metadata <- as.data.frame(SummarizedExperiment::colData(repeated_biome))
head(metadata)
#> sample_id outcome batch subject timepoint
#> S01 S01 Control Batch_1 S1 baseline
#> S02 S02 Control Batch_1 S1 followup
#> S03 S03 Control Batch_1 S2 baseline
#> S04 S04 Disease Batch_1 S2 followup
#> S05 S05 Control Batch_1 S3 baseline
#> S06 S06 Disease Batch_1 S3 followupIf ordinary row-wise folds are used, one sample from a subject can be in the training fold while another sample from the same subject is in the test fold. The test fold is then no longer an independent subject-level evaluation.
repeated <- check_repeated_measures(metadata, subject = "subject")
repeated[c(
"n_samples",
"n_subjects",
"n_repeated_subjects",
"max_samples_per_subject",
"risk",
"recommended_cv"
)]
#> $n_samples
#> [1] 40
#>
#> $n_subjects
#> [1] 20
#>
#> $n_repeated_subjects
#> [1] 20
#>
#> $max_samples_per_subject
#> [1] 2
#>
#> $risk
#> [1] "high"
#>
#> $recommended_cv
#> [1] "grouped_cv_by_subject"The actionable recommendation is to keep samples from the same subject together inside each resampling split.
head(repeated$samples_per_subject)
#> subject n_samples repeated
#> 1 S1 2 TRUE
#> 2 S10 2 TRUE
#> 3 S11 2 TRUE
#> 4 S12 2 TRUE
#> 5 S13 2 TRUE
#> 6 S14 2 TRUEUse Grouped Cross-Validation
check_leakage() combines repeated-measure, batch, and
temporal diagnostics. For repeated measures without an explicit time
variable, the recommended scheme is grouped cross-validation by
subject.
leakage_grouped <- check_leakage(
metadata,
outcome = "outcome",
subject = "subject"
)
leakage_grouped$recommended_cv
#> [1] "grouped_cv_by_subject"
leakage_grouped$recommendations
#> [1] "Overall leakage risk is high."
#> [2] "Multiple samples share subject IDs; use grouped cross-validation by subject."
#> [3] "No batch variable provided; batch-driven validation leakage was not evaluated."
#> [4] "No time variable provided; temporal leakage was not evaluated."In downstream modeling, that means fold assignment should happen at subject level first. All rows from a subject should move together into train or test.
Leave-One-Center-Out For Batch Leakage
Another failure mode appears when a center, site, or sequencing batch is aligned with the outcome. A model can then perform well by recognizing center-specific microbiome or technical signatures.
confounded_metadata <- data.frame(
outcome = rep(c("Control", "Disease"), each = 10),
batch = rep(c("Center_A", "Center_B"), each = 10)
)
table(confounded_metadata$batch, confounded_metadata$outcome)
#>
#> Control Disease
#> Center_A 10 0
#> Center_B 0 10Here each batch level contains only one outcome group. A random row-wise split can put both centers into both train and test, making center recognition look like outcome prediction.
batch_leakage <- check_leakage(
confounded_metadata,
outcome = "outcome",
batch = "batch"
)
batch_leakage$risk
#> [1] "high"
batch_leakage$recommended_cv
#> [1] "leave_one_batch_out_cv"
batch_leakage$batch_leakage$summary
#> batch n n_batch_levels n_outcome_levels empty_cells
#> 1 batch 20 2 2 2
#> batch_levels_single_outcome positivity_score effect_size effect_size_name
#> 1 2 0.5 1 cramers_v
#> risk
#> 1 highFor this scenario, a leave-one-batch-out strategy is a useful stress test. It asks whether the model still works when a whole center or batch is held out.
Time-Aware Validation
Longitudinal data add another constraint: train folds should not silently use future samples from the same subject to predict earlier or held-out samples. When repeated subjects span timepoints, MOAT recommends a grouped time-aware strategy.
leakage_time <- check_leakage(
metadata,
outcome = "outcome",
subject = "subject",
time = "timepoint"
)
leakage_time$risk
#> [1] "high"
leakage_time$recommended_cv
#> [1] "grouped_time_aware_cv_by_subject"
leakage_time$temporal_leakage[c(
"n_timepoints",
"subjects_with_multiple_timepoints",
"risk"
)]
#> $n_timepoints
#> [1] 2
#>
#> $subjects_with_multiple_timepoints
#> [1] 20
#>
#> $risk
#> [1] "high"This does not replace a full temporal validation design. It flags that ordinary row-wise cross-validation is not enough for the metadata structure supplied.
Full Audit Integration
The same leakage recommendations flow through moat() and
plan_analysis(). This lets the validation choice sit next
to design, batch, and correction diagnostics instead of being handled as
an afterthought.
audit <- moat(
repeated_biome,
outcome = "outcome",
subject = "subject",
time = "timepoint",
distances = "bray",
n_perm = 99,
verbose = FALSE
)
summary(audit)
#>
#> ── MOAT audit ──────────────────────────────────────────────────────────────────
#> ℹ Overall risk: HIGH
#>
#> ── Main warnings ──
#>
#> • Overall leakage risk is high.
#> • Multiple samples share subject IDs; use grouped cross-validation by subject.
#> • No batch variable provided; batch-driven validation leakage was not
#> evaluated.
#> • Repeated subjects span multiple timepoint values; use grouped time-aware
#> validation by subject.
#>
#> ── Recommended next steps ──
#>
#> • Overall leakage risk is high.
#> • Multiple samples share subject IDs; use grouped cross-validation by subject.
#> • No batch variable provided; batch-driven validation leakage was not
#> evaluated.
#> • Repeated subjects span multiple timepoint values; use grouped time-aware
#> validation by subject.
plan <- plan_analysis(audit)
plan$ml_validation
#> $scheme
#> [1] "grouped_time_aware_cv_by_subject"
#>
#> $reason
#> [1] "Use grouped time-aware validation because repeated subjects span multiple timepoints."
plan$permutation
#> $scheme
#> [1] "restricted_by_subject_and_time"
#>
#> $strata
#> [1] "subject"
#>
#> $reason
#> [1] "Repeated subjects span multiple timepoints; preserve subject grouping and temporal order."Preprocessing Leakage Checklist
Validation leakage is not only about fold assignment. Any operation that learns from the full dataset before splitting can leak information into test folds.
audit$leakage$preprocessing_checklist
#> [1] "Fit transformations, feature filtering, scaling, imputation, and feature selection inside each training fold."
#> [2] "Do not use test-fold labels, batches, subjects, or timepoints when estimating preprocessing parameters."
#> [3] "Report the validation split variable used for grouped, batch-aware, or time-aware resampling."Before fitting a model, check that the following steps are estimated inside each training fold and then applied to the held-out fold:
- feature filtering,
- prevalence filtering,
- normalization or scaling parameters,
- imputation,
- feature selection,
- hyperparameter tuning.
Practical Checklist For ML Users
Use this checklist before interpreting microbiome ML performance:
- Are multiple samples from the same subject kept in the same fold?
- If outcome is aligned with center or batch, is leave-one-center-out or leave-one-batch-out performance reported?
- For longitudinal data, does validation respect subject grouping and time?
- Are preprocessing, filtering, scaling, and feature selection nested inside resampling?
- Are ordinary CV and leakage-aware CV results shown side by side when risk is moderate or high?
Session Information
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] moat_0.99.0 BiocStyle_2.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4
#> [3] SparseArray_1.10.10 lattice_0.22-9
#> [5] digest_0.6.39 magrittr_2.0.5
#> [7] evaluate_1.0.5 grid_4.5.2
#> [9] RColorBrewer_1.1-3 bookdown_0.46
#> [11] fastmap_1.2.0 Matrix_1.7-5
#> [13] jsonlite_2.0.0 BiocManager_1.30.27
#> [15] scales_1.4.0 textshaping_1.0.5
#> [17] jquerylib_0.1.4 abind_1.4-8
#> [19] cli_3.6.6 rlang_1.2.0
#> [21] XVector_0.50.0 Biobase_2.70.0
#> [23] DelayedArray_0.36.1 cachem_1.1.0
#> [25] yaml_2.3.12 otel_0.2.0
#> [27] S4Arrays_1.10.1 tools_4.5.2
#> [29] dplyr_1.2.1 ggplot2_4.0.3
#> [31] SummarizedExperiment_1.40.0 BiocGenerics_0.56.0
#> [33] vctrs_0.7.3 R6_2.6.1
#> [35] matrixStats_1.5.0 stats4_4.5.2
#> [37] lifecycle_1.0.5 Seqinfo_1.0.0
#> [39] S4Vectors_0.48.1 fs_2.1.0
#> [41] htmlwidgets_1.6.4 IRanges_2.44.0
#> [43] ragg_1.5.2 pkgconfig_2.0.3
#> [45] desc_1.4.3 pkgdown_2.2.0.9000
#> [47] pillar_1.11.1 bslib_0.10.0
#> [49] gtable_0.3.6 glue_1.8.1
#> [51] systemfonts_1.3.2 xfun_0.57
#> [53] tibble_3.3.1 GenomicRanges_1.62.1
#> [55] tidyselect_1.2.1 MatrixGenerics_1.22.0
#> [57] knitr_1.51 farver_2.1.2
#> [59] htmltools_0.5.9 rmarkdown_2.31
#> [61] compiler_4.5.2 S7_0.2.2