The goal is to "feel/taste" the data or what the model does.
We will want to see real data examples of high risk patients report + analyze most common reason for getting flagged.
It will do that analysis on top 1000 patients.
Input
WORK_DIR - output work directory
EXPLAINABLE_MODEL - path for model with explainability
REPOSITORY_PATH - repository path
TEST_SAMPLES - test samples
EXPLAIN_JSON - json for bootstrap filtering
EXPLAIN_COHORT - optional filter of samples to focus on explainability samples
Output
$WORK_DIR/ButWhy/explainer_examples
group_stats*.tsv - Summary table of most common reasons. For example:
We can see the in LungFlag most important risk factor that repeats itself is Smoking - which appears in 99.7% of the times in top 3 reasons - The leading feature inside is Smoking.Smoking_Years
After it we can see COPD diagnosis that appears 53.8% of the times in top 3 and than BMI - 40.5% and then WBC 28.2%
test_report.*.tsv - report example of high risk patients each several grouped rows described the same patient but with different risk factor from most important to least important. Example:
We can see a single patient 100192 that recieved score 0.445575 on time 20100913 and is indeed a case (outcome is 1). The main reason is Smoking, shap value of 1.51 (27.38% of the shap values sum in absolute). the main feature is Smoking_Years which is 40.13 and Never_smoker is 0 so he is current or past smoker.
Then WBC with 0.708 of shap value (12.81%), the minimum WBC was 12.6 which is quite high and last value was 17.5.
We can see for example that "ICD9_Diagnosis.ICD9_CODE:786" is negative - protective with low value of -0.11. The feature ICD9_Diagnosis.category_dep_set_ICD9_CODE:7866.win_0_10950 has value of "0" So patient doesn't have this diagnosis in the past 10 years.