Experiments - Stage C (Freeze Version 1)
After some discussions and ideas to improve the covariance fix in order to better handle with similar and dependent features. As you probably remember, Shapley Values should split the contribution equally to features that are the same, so if important features has many similar features, it may result in a wrong ButWhy report. This method is used togheter with the "iterative" method. The current method is to calculate the covariance matrix of the features and multiply it by the contributions. The problem arises when you have groups. When using groups those are the new options (the first option is what we had in the last experiment): In all options these are the definitions: Let's mark the features covariance matrix as F(i,j) is size NxN (N is the number of features). Build "covariance matrix" for groups (if we have G groups, the matrix size is GxG), lets mark it C matrix. C(i,i)=1. C(i,j)=C(j,i) the matrix is symmetric
- C(i,j) := max{ F(k,l) | k is feature in group i, l is feature in group j }
- Let's mark feature contribution for the prediction as vector T in size N as the number of features (taking the contribution of the features without iterations) .C(i,j) := Sigma( k is feature in group i, l is feature in group j) { F(k,l)T(k)T(l) } / Sigma( k is feature in group i, l is feature in group j) { 1T(k)T(l) } The advantage in equation #2 is that it should maybe better since it's not taking "max" and using specific feature contributions. Added another options to the calculation of the "covariance" matrix of features differently by using mutual information between features instead of correlation. The idea is to catch better non-linear behaviors like BMI and some other and more complicate feature dependencies that linear model can miss. Used normalization factor to control the values to be between 0-1. 0 - the features are independent, 1 - you can determine the second feature from the first feature. The equation for mutual information is KLD between the joint features distribution and calculation of the joint probability assuming they are independent (you measure the information gain between the assumption of in-dependency to what you observe in the data). Normalization is done by dividing with the entropy of the joint distribution. The normalization causes all number to be between 0-1 as we want and duplicate features will get 1. The following file has 4 methods to compare (covariance or mutual information and equation 1 or equation 2) W:\Users\Alon\But_Why\outputs\Stage_B\explainers\crc\reports\compare_new_cov_fix\compare_all.xlsx. I did it for CRC which is the most challenging problems since we have many features and similar once.
Results
Alon Results compare_all.Alon.xlsx
Method/score | 1 | 2 | 3 | 4 | 5 | Average | Average_0.5 |
---|---|---|---|---|---|---|---|
Tree_iterative_covariance(New equation) | 0 | 0 | 0 | 2 | 20 | 4.909091 | 2.214607252 |
Tree_iterative_mutual_information(New equation) | 0 | 0 | 0 | 3 | 19 | 4.863636 | 2.20387689 |
Tree_iterative_mutual_information(MAX) | 0 | 0 | 0 | 10 | 12 | 4.545455 | 2.128764351 |
Tree_iterative_covariance(MAX) | 0 | 0 | 7 | 9 | 6 | 3.954545 | 1.979125614 |
Coby results compare_all - Coby.xlsx:
Method/score | 1 | 2 | 3 | 4 | 5 | Average | Average_0.5 |
---|---|---|---|---|---|---|---|
Tree_iterative_covariance(New equation) | 0 | 0 | 3 | 11 | 7 | 4.19047619 | 2.04041087 |
Tree_iterative_mutual_information(New equation) | 0 | 0 | 8 | 10 | 3 | 3.761904762 | 1.931648114 |
Tree_iterative_mutual_information(MAX) | 0 | 1 | 5 | 5 | 3 | 3.714285714 | 1.913047967 |
Tree_iterative_covariance(MAX) | 0 | 6 | 10 | 4 | 0 | 2.9 | 1.690289472 |
Conclusions
- The new equation seem to improve the results. Use it instead of max, that's the default in ExplainProcessings (use_max_cov=0)
- The mutual information might improve the results. I can't see it when using the new equations since the average score is in saturation (almost all recieved 5 out of 5). Coby and I noticed a huge improvement for the mutual information compared to the covariance when using MAX instead of new equation. Yet, we both got that the best results for CRC are new equations with covaraince.Waiting for Avi Shoshan and Yaron to review the file and grade the results themselves if they want (maybe they will see different things). Currently my recommandation is to use the new equation with covaraince (also faster learning). To create a nice ButWhy report, you might use:
- adjust model app to add post_processor with explainer to the model. Later you can change some parameters if needed using change_model (without relearn)
- CreateExplainnReport app to generate a nice report