Why pre-analysis audit changes interpretation

Why Interpretation Needs An Audit

A microbiome result can look biologically plausible while still being driven by study design, batch structure, or validation leakage. MOAT is meant to make those risks visible before the downstream model becomes the story.

This vignette shows three compact examples where a naive interpretation changes after checking the audit:

a beta-diversity signal dominated by batch,
outcome information already present in metadata,
validation leakage from repeated subjects or batch-outcome alignment.

library(moat)
data("toy_moat")

Batch-Dominated Beta Diversity

A common first pass is to ask whether microbiome composition is associated with the outcome. The code below fits an outcome-only PERMANOVA on Bray-Curtis distances.

metadata <- as.data.frame(SummarizedExperiment::colData(toy_moat))
distance <- compute_biome_distance(toy_moat, distance = "bray")

naive <- check_permanova(
  distance,
  metadata = metadata,
  outcome = "outcome",
  n_perm = 99
)

naive$terms[, c("term", "r2", "p_value")]
#>            term          r2 p_value
#> outcome outcome 0.003494571    0.91

Read alone, this can encourage an outcome-centered interpretation. The audit adds the batch variable and reports the outcome and batch contributions together.

audit <- moat(
  toy_moat,
  outcome = "outcome",
  batch = "batch",
  distances = "bray",
  n_perm = 99,
  verbose = FALSE
)

audit$batch$summary[, c(
  "distance",
  "outcome_r2",
  "batch_r2",
  "batch_dominance_score",
  "permanova_risk",
  "order_sensitivity_risk",
  "risk"
)]
#>   distance  outcome_r2  batch_r2 batch_dominance_score permanova_risk
#> 1     bray 0.003494571 0.9477919              271.2184           high
#>   order_sensitivity_risk risk
#> 1                    low high

When batch explains as much or more variation than the outcome, the result should be reported as a batch-sensitive finding rather than a clean biological separation. The practical next step is to carry the audit into the analysis plan.

plan <- plan_analysis(audit)

plan$permanova
#> $formula
#> [1] "distance ~ `outcome` + `batch`"
#> 
#> $display
#> [1] "distance ~ outcome + batch"
#> 
#> $term_order
#> [1] "outcome" "batch"  
#> 
#> $reason
#> [1] "Use the same term order as the batch audit: outcome, then batch, then covariates."
plan$batch_strategy
#> $strategy
#> [1] "sensitivity_required"
#> 
#> $reason
#> [1] "Batch explains substantial microbiome variation; report analyses with explicit batch sensitivity checks."

Metadata-Only Outcome Predictability

Microbiome models can also inherit information from metadata. In this small example, center is strongly aligned with outcome before any microbiome feature is used.

metadata_confounded <- data.frame(
  outcome = rep(c("Control", "Disease"), each = 20),
  center = rep(c("Center_A", "Center_B"), each = 20),
  age_group = rep(c("young", "old"), times = 20)
)

table(metadata_confounded$center, metadata_confounded$outcome)
#>           
#>            Control Disease
#>   Center_A      20       0
#>   Center_B       0      20

check_design() stores a metadata-only predictability diagnostic as an attribute. A high value means downstream microbiome signatures may partly reflect design or validation structure.

design <- check_design(
  metadata_confounded,
  outcome = "outcome",
  batch = "center",
  covariates = "age_group"
)

design[, c("variable", "role", "effect_size_name", "effect_size", "risk")]
#>    variable      role effect_size_name effect_size     risk
#> 1    center     batch        cramers_v           1 critical
#> 2 age_group covariate        cramers_v           0      low

metadata_predictability <- attr(design, "metadata_predictability")
metadata_predictability[c("status", "risk", "cv_balanced_accuracy", "recommendations")]
#> $status
#> [1] "evaluated"
#> 
#> $risk
#> [1] "high"
#> 
#> $cv_balanced_accuracy
#> [1] 1
#> 
#> $recommendations
#> [1] "Metadata alone predicts the outcome strongly; treat downstream microbiome models as high risk for design confounding."
#> [2] "Use validation splits that block or stratify by the predictive metadata variables when possible."

The interpretation changes from “metadata are harmless annotations” to “metadata already carries outcome information”. Any microbiome model should report this risk and use validation that does not let center or batch structure stand in for biology.

Validation Leakage

Leakage appears when the validation split lets the model reuse structure that will not generalize. Repeated subjects are a simple example: if samples from the same subject appear in both train and test folds, row-wise cross-validation can look better than subject-level prediction.

repeated_biome <- toy_moat
SummarizedExperiment::colData(repeated_biome)$subject <- rep(
  paste0("S", seq_len(ncol(repeated_biome) / 2)),
  each = 2
)
SummarizedExperiment::colData(repeated_biome)$timepoint <- rep(
  c("baseline", "followup"),
  times = ncol(repeated_biome) / 2
)

repeated_metadata <- as.data.frame(SummarizedExperiment::colData(repeated_biome))
head(repeated_metadata)
#>     sample_id outcome   batch subject timepoint
#> S01       S01 Control Batch_1      S1  baseline
#> S02       S02 Control Batch_1      S1  followup
#> S03       S03 Control Batch_1      S2  baseline
#> S04       S04 Disease Batch_1      S2  followup
#> S05       S05 Control Batch_1      S3  baseline
#> S06       S06 Disease Batch_1      S3  followup

check_leakage() turns that metadata structure into validation guidance.

leakage <- check_leakage(
  repeated_metadata,
  outcome = "outcome",
  subject = "subject",
  batch = "batch",
  time = "timepoint"
)

leakage[c("risk", "recommended_cv", "recommendations")]
#> $risk
#> [1] "high"
#> 
#> $recommended_cv
#> [1] "grouped_time_aware_cv_by_subject"
#> 
#> $recommendations
#> [1] "Overall leakage risk is high."                                                                  
#> [2] "Multiple samples share subject IDs; use grouped cross-validation by subject."                   
#> [3] "Batch variables appear balanced enough for standard validation."                                
#> [4] "Repeated subjects span multiple timepoint values; use grouped time-aware validation by subject."

The same recommendation is carried by the full audit and the analysis plan.

leakage_audit <- moat(
  repeated_biome,
  outcome = "outcome",
  batch = "batch",
  subject = "subject",
  time = "timepoint",
  distances = "bray",
  n_perm = 99,
  verbose = FALSE
)

leakage_plan <- plan_analysis(leakage_audit)

leakage_plan$ml_validation
#> $scheme
#> [1] "grouped_time_aware_cv_by_subject"
#> 
#> $reason
#> [1] "Use grouped time-aware validation because repeated subjects span multiple timepoints."
leakage_plan$permutation
#> $scheme
#> [1] "restricted_by_subject_and_time"
#> 
#> $strata
#> [1] "subject"
#> 
#> $reason
#> [1] "Repeated subjects span multiple timepoints; preserve subject grouping and temporal order."

Practical Reporting Pattern

For each downstream analysis, report the naive target next to the audit-guided qualification:

report outcome R2 together with batch R2 before interpreting beta diversity,
report metadata-only predictability before claiming microbiome prediction,
report leakage-aware validation when repeated subjects, time, or batch structure are present,
use plan_analysis() to turn the audit into formulas, validation schemes, and sensitivity checks.

Session Information