Test 15 - Features and Flag
Purpose
Compare distributions of important features between flagged and non-flagged samples in both the test and reference datasets. Produces per-feature statistics, per-gender histograms, and ratio plots highlighting where flagged populations differ.
Required Inputs
WORK_DIR: working directory containing compare matrices and prediction filesCMP_FEATURE_RES: comma-separated list of feature identifiers with resolution (used to select and bucket features)TOP_EXPLAIN_PERCENTAGE: used to compute the flagged cutoff (script computes CUTOFF = 100 - TOP_EXPLAIN_PERCENTAGE)- Files expected:
${WORK_DIR}/compare/rep_propensity.matrix${WORK_DIR}/compare/rep_propensity_non_norm.matrix${WORK_DIR}/compare/3.test_cohort.preds${WORK_DIR}/compare/reference.preds
How to Run
What This Test Does
- Reads normalized and non-normalized propensity matrices and prediction files.
- Parses
CMP_FEATURE_RESto extract features and their binning resolution. Rounds feature values to the requested resolution. - Splits samples into flagged vs not-flagged using the percentile cutoff derived from test predictions (
CUTOFF = 100 - TOP_EXPLAIN_PERCENTAGE). - Computes per-feature stats (means, std, missing%) for flagged and not-flagged groups in both test and reference and writes
features_and_flag.csv. - Produces HTML bar charts per-feature showing percentage distributions for Test_Flagged, Test_Not_Flagged, Ref_Flagged, Ref_Not_Flagged and additional ratio charts (files under
${WORK_DIR}/features_and_flag/*.html).
Output Location
${WORK_DIR}/features_and_flag/features_and_flag.csv${WORK_DIR}/features_and_flag/<feature>.htmland<feature>_ratio.htmlper-feature visualization files
How to Interpret Results
- Use
features_and_flag.csvto compare means/stds and missing rates between flagged and not-flagged groups across test and reference. - Use per-feature HTMLs to visually inspect whether flagged subjects have different distributions and whether those patterns are consistent with the reference dataset.
Troubleshooting
- Feature name matching errors: the script searches columns by substring and assumes a single exact match; mismatches will raise errors—confirm
CMP_FEATURE_RESaligns with column names in the matrices. - Missing matrices or predictions: run Tests 03 and 05 first to ensure
${WORK_DIR}/compare/*artifacts exist. - Plot rendering: HTMLs reference
../js/plotly.js; ensure the asset exists or change the path.
Files to inspect
${WORK_DIR}/features_and_flag/features_and_flag.csv${WORK_DIR}/features_and_flag/*.html