Guide for Common Actions
Overview
This guide summarizes frequent tasks and workflows for working with MedSamples, models, and repositories. Each section provides a brief description and links to detailed instructions or related tools.
1. Match MedSamples by Year or Other Criteria
Subsample your data by matching medical samples based on year or other criteria. This helps remove temporal bias, ensuring your model does not learn from the sample collection time, but instead relies on independent features.
This approach is also useful for evaluating model performance when removing the information gain of a specific signal. For example, matching by age allows you to test model performance when age cannot be directly exploited as a predictor. The model still sees age, but conditioning on its value equalizes the probability of being a case, so age cannot be used for performance gain.
See: Using Flow To Prepare Samples and Get Incidences
2. Train a Model from JSON
See: Flow
3. Calculate Model Score on Samples
See: Flow
4. Create Feature Matrix for Samples
See: Flow
5. Adjust Model
Add or retrain rep_processor
or post_processor
components for calibration, explainability, or to modify an existing model.
See: adjust_model
6. Change Model
Remove or modify model components (e.g., enable debug logs, limit memory usage by setting smaller batch sizes). See: change_model
7. Simplify Model / Remove Signals
Iteratively add or remove signals to simplify the model. See: Iterative Feature Selector
8. Analyze Feature Importance & Model Behavior
Analyze global feature importance, feature interactions, and the effect of each important feature or signal on model output. See: Feature Importance
Automated tests for feature importance are available: Feature Importance Test
You can also use model_signal_importance. This tool keeps the model fixed (no retraining or signal changes), but evaluates the effect of providing or removing specific signals from the input. This is useful for frozen models to assess the impact of signal availability (e.g., if a client can or cannot provide certain inputs).
9. Bootstrap Performance Analysis
See: bootstrap_app
10. Compare or Estimate Model Performance on a Different Repository, Comparre samples
TestModelExternal is a tool designed to compare differences between repositories or sample sets when applying a model. It builds a propensity model to distinguish between repositories or samples, revealing differences and enabling straightforward comparison of feature matrices. The main goal is to identify complex patterns when comparing data. See: TestModelExternal
11. Create and Load Repository from Files
See: Load new repository
12. Create Random Splits for Train/Test/All Patients
See: Using Splits
13. Filter Train/Test by TRAIN Signal
TRAIN == 1
: Training set (70%)TRAIN == 2
: Test set (20%)TRAIN == 3
: Validation set (10%)
14. Print Model Info
See: Flow Model Info
15. Filter Samples by BT Cohort
Include json_mat
even if not required by definition.
--filter_by_bt_cohort
syntax.
16. Check Model Compatibility with Repository / Suggest Adjustments
When applying a model to a different repository, some adjustments may be needed.
For example, MedModel strictly checks for required signals. If a signal is missing but not critical, you can mark it as acceptable by adding a rep processor for an "empty" signal. For more information, see Flow fit_model_to_rep.
Fixing Missing Dictionary Definitions
When training on one repository and testing on another, you may encounter missing diagnoses.
To find missing codes:
To resolve:
- Add missing codes to the target dictionary, matching SECTION codes as needed.
- For example:
- Add
ICD9_CODE:786.09
if missing. - Add
ICD9_CODE:420-429.99
for range codes. - For named codes (e.g.,
MALIGNANT_NEOPLASM_OF_LIP_ORAL_CAVITY_AND_PHARYNX
), find the equivalent numeric code (e.g.,140-149
) and add it.
Can be compiled in AllTools.