The Feature Importance is virtual function of MedPredictor called calc_feature_importance.
If you want to support Feature Importance for your predicotr you need to implement the virtual function: calc_feature_importance
Some notes:
This function must be called after learn method only! otherwise and exception will be raised.
The functions returns the features importance score in vector with same order of the learned features matrix order (sorted by ABC because we use map object).
Currently the method is implemented by:
QRF - no additional parameters are required to run
LightGBM - has a parameter "importance_type" which has 2 options:
"gain" - the average gain of the feature when it is used in trees (in each split of tree node we have loss gain - averaging that)
"frequency" - the number of times a feature is used to split the data across all trees
XGB - has a parameter "importance_type" which has 3 options:
"gain" - the average gain of the feature when it is used in trees (in each split of tree node we have loss gain - averaging that)
"gain_total" - sum of gain the of the feature when it is used in trees (not normalized by number of appearances)
"weight" - the number of times a feature is used to split the data across all trees
"cover" - the average coverage of the feature when it is used in trees. it sums the number of samples in the leaves that in their path have this feature (so this feature was used to split and deciede on this observations). it calc's the average coverage
Example run for XGB
vector<float>feautres_scores;MedPredictor*predictor=MedPredicotr::make_predictor("xgb");//load trained modelMedFeaturesdata;//data.read_from_file(matrix_file); //load dataMatrix or already trained model//predictor->learn(data); //or load trained model, otherwise it will throw exception//now do learn and run againpredictor->calc_feature_importance(feautres_scores,"importance_type=gain");//the second argument is additional parameters for feature importance//sort and print featuresmap<string,vector<float>>::iteratorit=data.data.begin();vector<pair<string,float>>ranked((int)feautres_scores.size());for(size_ti=0;i<feautres_scores.size();++i){ranked[i]=pair<string,float>(it->first,feautres_scores[i]);++it;}sort(ranked.begin(),ranked.end(),[](constpair<string,float>&c1,constpair<string,float>&c2){returnc1.second>c2.second;});for(size_ti=0;i<ranked.size();++i)printf("FEATURE %s : %2.3f\n",ranked[i].first.c_str(),ranked[i].second);//QRF- have no additional options - empty string//XGB - has parameter - "importance_type" with those options: "gain,weight,cover"//LightGBM - has parameter "importance_type" with those options: "frequency,gain". gain is like xgboost and frequency is like weight in xgboost
Using FeatureImportance as Selector:
cat /server/Work/Users/Alon/UnitTesting/examples/"general config files"/importance_example.json
"process":{"process_set":"3","fp_type":"importance_selector","duplicate":"no","numToSelect":"10","predictor":"qrf","predictor_params":"{type=categorical_entropy;ntrees=100;min_node=20;n_categ=2;get_only_this_categ=1;learn_nthreads=40;predict_nthreads=40;ntry=100;maxq=5000;spread=0.1}","importance_params":"{}"//you may pass other parameters for feature importance in here. for example importance_type=gain}