FeatureProcessor practical guide
FTR_PROCESS_NORMALIZER | "normalizer" to create FeatureNormalizer |
FTR_PROCESS_IMPUTER | "imputer" to create FeatureImputer |
FTR_PROCESS_DO_CALC | "do_calc" to create DoCalcFeatProcessor |
FTR_PROCESS_UNIVARIATE_SELECTOR | "univariate_selector" to create UnivariateFeatureSelector |
FTR_PROCESSOR_MRMR_SELECTOR | "mrmr" or "mrmr_selector" to create MRMRFeatureSelector |
FTR_PROCESSOR_LASSO_SELECTOR | "lasso" to create LassoSelector |
FTR_PROCESSOR_TAGS_SELECTOR | "tags_selector" to create TagFeatureSelector |
FTR_PROCESSOR_IMPORTANCE_SELECTOR | "importance_selector" to create ImportanceFeatureSelector |
FTR_PROCESSOR_ITERATIVE_SELECTOR | "iterative_selector" applies bottom-up or top-down iteration for feature selection. Creates IterativeFeatureSelector |
FTR_PROCESS_REMOVE_DGNRT_FTRS | "remove_deg" to create DgnrtFeatureRemvoer |
FTR_PROCESS_ITERATIVE_IMPUTER | "iterative_imputer" to create IterativeImputer |
FTR_PROCESS_ENCODER_PCA | "pca" to create FeaturePCA |
FTR_PROCESS_ONE_HOT | "one_hot" to create OneHotFeatProcessor - make one-hot features from a given feature |
FTR_PROCESS_GET_PROB | "get_prob" to create GetProbFeatProcessor - replace categorical feature with probability of outcome in training set |
FTR_PROCESS_PREDICTOR_IMPUTER | "predcitor_imputer" to create PredictorImputer |
FTR_PROCESS_MULTIPLIER | "multiplier" to create MultiplierProcessor - to multiply feature by other feature |
FTR_PROCESS_RESAMPLE_WITH_MISSING | "resample_with_missing" to create ResampleMissingProcessor - adds missing values to learn matrix |
FTR_PROCESS_DUPLICATE | "duplicate" to create DuplicateProcessor - duplicates samples in order to do multiple imputations. |
FTR_PROCESS_MISSING_INDICATOR | "missing_indicator" to create MissingIndicatorProcessor - creates a feature that indicates if a feature is missing or not |
FTR_PROCESS_BINNING | "binning" to create BinningFeatProcessor - binning with one hot on the bins |
- "remove_deg" - removes features that most of the time are same value (or missing)
- importance_selector - selection of features based on most important features in a model that is trained on the data.
- iterative_selector - please use the tool to do it, it takes forever!! Iterative Feature Selector
- resample_with_missing - used in training to "generate" more samples with missing values and increasing data size (not doing imputations, that
No need for duplicate, it is scanning the features by it's own and operating on "selected_tags". add_new_data- - how many new data points to add. grouping is used to generate masks of missing values in groups and not feature by features. This is another feature processor job). similar to data augmentation in imaging
-
"binning" - binning feature value - can be specified directly the cutoffs or using some binning_method, equal width, minimal observations in each bin, etc.
-
"predcitor_imputer" - much more complicated/smart imputer based on model. Gibbs samplings, masked GAN, univariate sampling from features distributions, etc..
Instead of GIBBS, GAN, can select: -
RANDOM_DIST - random value from normal dist around 0,5 (not related to feature dist)
- UNIVARIATE_DIST - strata by some features, store distribution in each strata. In apply, find strata and select value randomly from dist
- MISSING - put missing value
- GAN - generator_args is path to trained model. Please refer to this path to train: TrainingMaskedGAN
- GIBBS - arguments
- sampling_args -
- burn_in_count - how many round to do in the start and to ignore them till stablize on reasonable vector.
- jump_between_samples - how many rounds to do before generating new sample. when continuing to iterate in rounds, after several loops we end up with different sample
- find_real_value_bin - if true will round values to existing values only from feature values. When you try to see if model can discriminate between real data and generated data, the resolution of the feature values is important. tree can detect different between 3 nd 3.000001. If true this will cause 3.00001 to be 3. No good reason why to turn off
- samples_count - how many samples to extract
- generator_args
- calibration_save_ratio - what percentage of data to keep for clibration to probability. 0.2 (20%) is good number
- bin_settings - how to split the feature value into bins. The prediction problem will be multi category to those binned values. For example, hemoglobin last : 13.1-13.3, 13.4-13.6, etc... The target will be what's the probability to have hemoglobin in each value range. after having this, we can sample from this distribution bin value for the feature.
- calibration_string - how to calibarte. keep as isotonic_regression, it's good
- predictor_type - predictor type for the multi class prediction
- predictor_args - arguments for the predictor. please pay attention this is multi category prediction! For example objective for lightGBM is "objective=multiclass"
- num_class_setup - since this is multiclass, some predictors requires to setup how many classes there are. This argument controls the name of this parameters. In LightGBM for example it's called "num_class"
- selection_count - down sampling for training those models to speedup
- sampling_args -