Test 10 - Calibration Test
Purpose
Run calibration checks for the model on the evaluation cohort. Produces calibration pivot tables and bootstrap calibration graphs showing model calibration across score bins, time windows, and cohorts.
Required Inputs
From configs/env.sh and the test invocation environment:
WORK_DIR: Working folder where outputs will be writtenFIRST_WORK_DIR: Reference run (used to obtain baseline predictions)BT_COHORT_CALIBRATION: Cohorts file for calibration analysisBT_JSON_CALIBRATION: Bootstrap JSON for calibration-specific metrics- Optional:
ALT_PREDS_PATHfor comparator predictions
How to Run
From your TestKit folder, execute:
Or include as part of the full suite:
Check ${WORK_DIR}/calibration for calibration outputs and graphs.
What This Test Does
- Joins samples with predictions similar to Test 07 to create
${WORK_DIR}/bootstrap/all.predsand filters to${WORK_DIR}/bootstrap/eligible_only.preds. - Validates
BT_COHORT_CALIBRATIONcontains yearly breakdowns and warns if not present. - Invokes
TestCalibrationwith the provided cohorts and JSON model to produce calibration outputs under${WORK_DIR}/calibration/test_calibrationand calibration graphs under${WORK_DIR}/calibration/graphs. - Prints a brief numeric summary of mean score and incidence.
Output Location
- Calibration outputs:
${WORK_DIR}/calibration/test_calibration(tables and pivot files) - Calibration graphs:
${WORK_DIR}/calibration/graphs
How to Interpret Results
- Open pivot tables and graphs to inspect calibration curves across bins, cohorts, and time windows.
- Confirm expected calibration behaviour (e.g., predicted probabilities aligned with observed incidence per bin).
Common failure modes and suggestions
- Missing
TestCalibrationutility or incompatible flags:- Ensure
TestCalibrationexists and the JSON/params supplied match the utility's expected schema.
- Ensure
- Insufficient sample size for fine-grained binning:
- Adjust
pred_binning_argorbt_paramsto use fewer bins or fewer bootstrap loops.
- Adjust
Example output snippets
Notes and Implementation Details
- The script uses
pred_binning_argwithiterative_mergeand parameters tuned to create up to 100 bins with minimum counts; it also runsloopCnt=500bootstrap samples by default. - Calibration requires a good spread of predicted scores; if all scores are near a single value, binning will collapse and calibration plots will be uninformative.
Test Results Review
Primary files to inspect after running this test:
${WORK_DIR}/calibration/test_calibration(pivot and summary tables)${WORK_DIR}/calibration/graphs/*