Test 06 - Matrix Feature Statistics & KLD Analysis
Purpose
Compute per-feature statistics comparing the silent/external run matrix against a reference matrix and produce plots and Kullback–Leibler divergence (KLD) metrics for the top features derived from a Shap/ButWhy report. The test writes per-feature HTML graphs and a features_stats.tsv summary.
This is similar to Silent Run Test (no labels) - 10. Now we loaded the repository and have outcome. The stratification and comparasing can also stratify by the outcome.
Required Inputs
From configs/env.sh and the test invocation environment:
WORK_DIR: Working folder where outputs will be writtenFIRST_WORK_DIR: Path to the reference run (used to obtain the original model and shapley report)MODEL_PATH: Path to the model to be used on the repository (by default taken from${FIRST_WORK_DIR}/model/model.medmdl)REFERENCE_MATRIX: CSV path to the reference matrixCONFIG_DIR: Directory containing config files such asfeat_resolution.tsvCMP_FEATURE_RES: Comma-separated list used to determine top features and resolutions
Script-level required params (from the script header):
REQ_PARAMETERS=['WORK_DIR', 'FIRST_WORK_DIR', 'MODEL_PATH', 'REFERENCE_MATRIX', 'CONFIG_DIR']
How to Run
From your TestKit folder, execute:
Or include as part of the full suite:
Check ${WORK_DIR}/outputs/graphs and ${WORK_DIR}/outputs/features_stats.tsv for results.
What This Test Does
- Validates that a Shap/ButWhy report exists under
${FIRST_WORK_DIR}/ButWhy/shapley.reportand fails if not present. - Optionally reads a resolution file
${CONFIG_DIR}/feat_resolution.tsvto apply trimming and resolution rules for plotting. - Reads the top features from the Shapley report (top 25 positive importance features plus
pred_0). - Loads the repository matrix for
${WORK_DIR}/Samples/3.test_cohort.samplesusingget_matrix()and loads the reference CSV. - For each selected feature:
- Applies trimming/resolution rules if provided.
- Computes mean, std, and missing rates overall, for cases, and for controls.
- Writes per-feature HTML bar plots (via
plot_graph) to${WORK_DIR}/outputs/graphs. - Computes KLD between cases and controls and prints the metric.
- Summarizes per-feature stats in
${WORK_DIR}/outputs/features_stats.tsv.
Output Location
- Feature graphs:
${WORK_DIR}/outputs/graphs(one HTML per feature) - Stats TSV:
${WORK_DIR}/outputs/features_stats.tsv
How to Interpret Results
- Open the generated HTML files to visually inspect the distribution of feature values between cases, controls and reference.
- Use
features_stats.tsvto get a tabular summary of mean, std, and missingness across groups. - KLD values printed to stdout indicate how separable feature distributions are between cases and controls; higher KLD suggests more divergence.
Common failure modes and suggestions
- Missing
shapley.reportunder${FIRST_WORK_DIR}/ButWhy:- Run ButWhy analysis (feature importance) before running this test in Silent Run
- Feature name mismatches between the Shap report and repository matrix columns:
- Confirm that feature naming conventions match (the script searches by substring match and expects exactly one match per selected feature).
- Large reference matrices or memory issues when loading matrices:
- Use a machine with sufficient memory or sample/reduce the dataset before running.
Example output snippets
This is similar to Silent Run Test (no labels) - 10, but with stratification by outcome.
Test Results Review
Primary files to inspect after running this test:
${WORK_DIR}/outputs/graphs/*.html${WORK_DIR}/outputs/features_stats.tsv