ButWhy experiments results

Expirement models
- NWP_Flu
- CRC
- Pre2D
Conclusions
Appendix - Gibbs in Flu NWP

Expirement models

NWP_Flu

The model has 22 features, most of them are binary (Drugs, Diagnosis category_set). The non categorical are: Age, Smoking, SpO2, Resp_Rate, Flu.nsamples, Complications.nsamples, Memebership. In this run, Added Shapley gibbs explainer (22 features is OK for shapley gibbs -see apendix for more details). scores are the higher the better from 1-5. this is the histogram of 18 examples of flu. Also added average on sqrt of the 1-5 scores to increase the importance for improving on low scores compare to higher scores - no big change here.

Explainer_name	1	2	3	4	5	Mean_Score	Mean_of_Sqrt_Score
Tree_with_cov	0	2	3	8	5	3.888889	1.955828857
Tree	0	2	3	10	3	3.777778	1.929599082
SHAP_Gibbs_LightGBM	0	1	7	8	2	3.611111	1.889483621
missing_shap	0	1	9	4	4	3.611111	1.885941263
LIME_GAN	1	4	7	4	2	3.111111	1.736296992
SHAP_GAN	2	4	6	6	0	2.888889	1.669397727
knn	0	7	6	5	0	2.888889	1.682877766
knn_with_th	8	2	5	2	1	2.222222	1.42915273

Summary - in the simple case of 22 features Tree_with_Covariance preforms the best and than the regular tree. Not far behind SHAP_Gibbs_LightGBM and missing_shap which preforms similarly. Reference to expirement results:

compare_blinded.tsv - the blinded experiment - for each sample random shuffle of explainers outputs. and in xlsx format: compare_blinded.xlsx
map.ids.tsv - the order of each explainer
summary.tsv - results for each sample - with explainers aligned (not blinded) - after joining map.ids.tsv with compare_blinded.tvs

CRC

Explainer_name	1	2	3	4	5	<EMPTY>	Score	Score_Sqrt
Tree_with_cov	0	1	13	21	3	1	3.684211	1.911555
Tree	0	3	16	17	2	1	3.473684	1.853358
LIME_GAN	0	15	11	12	0	1	2.921053	1.691204
SHAP_GAN	0	14	16	6	2	1	2.894737	1.683788
missing_shap	4	18	7	8	1	1	2.578947	1.574112
knn	6	17	9	6	0	1	2.394737	1.516581
knn_with_th	6	18	10	4	0	1	2.315789	1.494115

Reference to expirement results:

compare_blinded.tsv - the blinded experiment - for each sample random shuffle of explainers outputs. and in xlsx format: compare_blinded_CRC.xlsx
map.ids.tsv - the order of each explainer
summary.sum.tsv - results for each sample - with explainers aligned (not blinded) - after joining map.ids.tsv with compare_blinded.tvs

Pre2D

Explainer_name	1	2	3	4	5	<EMPTY>	Score	Score_of_Sqrt
SHAP_GAN	0	4	43	86	5	2	3.666667	1.908082456
LIME_GAN	0	3	49	78	7	3	3.649635	1.903398585
Tree	0	5	48	82	3	2	3.601449	1.890708047
Tree_with_cov	0	10	63	63	2	2	3.413043	1.838648351
missing_shap	1	26	76	34	0	3	3.043796	1.732886234
knn	3	43	61	29	2	2	2.884058	1.680713177
knn_with_th	22	36	56	22	2	2	2.608696	1.582454126

Reference to expirement results:

compare_blinded.tsv - the blinded experiment - for each sample random shuffle of explainers outputs. and in xlsx format:
map.ids.tsv - the order of each explainer
summary.tsv - results for each sample - with explainers aligned (not blinded) - after joining map.ids.tsv with compare_blinded.tvs

Conclusions

Summary Table all expirements:

Method	Flu 1	Flu 0.5	CRC 1	CRC 0.5	Diabetes 1	Diabetes 0.5	L1	L0.5
Tree_with_cov	3.888889	1.955828857	3.684211	1.912	3.413043	1.8386484	3.662048	1.902011
Tree	3.777778	1.929599082	3.473684	1.853	3.601449	1.890708	3.617637	1.891222
SHAP_Gibbs_LightGBM	3.611111	1.889483621					3.611111	1.889484
LIME_GAN	3.111111	1.736296992	2.921053	1.691	3.649635	1.9033986	3.227266	1.776967
SHAP_GAN	2.888889	1.669397727	2.894737	1.684	3.666667	1.9080825	3.150098	1.753756
missing_shap	3.611111	1.885941263	2.578947	1.574	3.043796	1.7328862	3.077951	1.73098
knn	2.888889	1.682877766	2.394737	1.517	2.884058	1.6807132	2.722561	1.626724
knn_with_th	2.222222	1.42915273	2.315789	1.494	2.608696	1.5824541	2.382236	1.501907

The tree algorithm works the best in gerneal when the predictor is tree based. the covariance fix also improves it slightly.
The LIME\SHAP are pretty similar. The LIME is slightly better and faster so it's preferable over SHAP. They are also model agnostic, but hareder to train. Gibbs might imporve the results (but be much slower) and might be usefull if we use it on not too many features/groups of features.
The missing_shap - very simple and fast model (also model agnostic). It preforms good in some problems, but has some train parameters the are important to tune. In previous experiments in Pre2D it was much better (used different train parameters that made it worse comapre to the previous run). After runing with better params, I see it's even better than Shapley,LIME methods..BUG found in paramters in missing_shap that had disabled the grouping and cause problem - need to run again (Can't run with Grouping and "group_by_sum=1", should use the grouping mechanisim in missing_shap)...Bug found when training with wrong weights in missing_shap when using groups!
KNN - should be used without threshold. For now, it haven't prove itself enougth to be used.
If we use Trees predictors without groups - the shapley values should do the job without covariance fix. It's a unique solution that preserve fairness.

Appendix - Gibbs in Flu NWP

The Gibbs shows seperation of 0.6 between the generated samples and the real ones (when using random mask with probability 0.5 for each feature) and seperation of 0.719 when generating all features. This is high quality generation of matrix. For example GAN show seperation of 0.99 when generating all features and 0.74 when choosing random masks. Test gibbs script

$MR_ROOT/Projects/Shared/But_Why/Linux/Release/TestGibbs --rep /home/Repositories/KPNW/kpnw_jun19/kpnw.repository --train_samples /server/Work/Users/Alon/But_Why/outputs/explainers_samples/flu_nwp/train.samples --test_samples /server/Work/Users/Alon/But_Why/outputs/explainers_samples/flu_nwp/validation_full.samples --model_path /server/Work/Users/Alon/But_Why/outputs/explainers/flu_nwp/base_model.bin --run_feat_processors 1 --save_gibbs /server/Work/Users/Alon/But_Why/outputs/explainers/flu_nwp/gibbs_tests/test_gibbs.bin --save_graphs_dir /server/Work/Users/Alon/But_Why/outputs/explainers/flu_nwp/gibbs_tests/gibbs_graphs --gibbs_params "kmeans=0;select_with_repeats=0;max_iters=0;predictor_type=lightgbm;predictor_args={objective=multiclass;metric=multi_logloss;verbose=0;num_threads=0;num_trees=80;learning_rate=0.05;lambda_l2=0;metric_freq=50;is_training_metric=false;max_bin=255;min_data_in_leaf=30;feature_fraction=0.8;bagging_fraction=0.25;bagging_freq=4;is_unbalance=true;num_leaves=80};num_class_setup=num_class;calibration_string={calibration_type=isotonic_regression;verbose=0};calibration_save_ratio=0.2;bin_settings={split_method=iterative_merge;min_bin_count=200;binCnt=150};selection_ratio=1.0" --predictor_type xgb --predictor_args "tree_method=auto;booster=gbtree;objective=binary:logistic;eta=0.1;alpha=0;lambda=0.1;gamma=0.1;max_depth=4;colsample_bytree=1;colsample_bylevel=0.8;min_child_weight=10;num_round=100;subsample=0.7" --gibbs_random_range 1 --gibbs_sampling_params "burn_in_count=500;jump_between_samples=20;samples_count=50000;find_real_value_bin=1"   --test_random_masks 0 

Some Feature graphs examples - gibbs generated VS real: Binary features: Gender - Males rate - 47.88% in real VS 47.68% in generated. Diagnosis.Asthma rate in real 17.17% VS 17.08%. Admission hostpital_observation rate in real 2.98% VS 3.15% In Age, there seems like there is binning issue in gibbs for odd age values. In smoking, there are a lot of unique values and the gibbs only returns 1 out of 150 bins (should bin the feature before trying to seperate. not done yet) - might cause some of the power to seperate between real and the generated... All the others looks very good. Some graphs (real date is in blue, the generated gibbs is in orange): Click here to expand...