MedModel JSON Format
This guide explains the structure and usage of the MedModel JSON format for defining machine learning pipelines within the Medial infrastructure. The JSON file orchestrates all model steps, from raw data processing to prediction and post-processing, making your workflow modular, reproducible, and easy to configure.
What is a MedModel JSON?
A MedModel JSON file describes:
- The pipeline of data processing and machine learning components (like data cleaning, feature generation, modeling, etc.)
- The order and configuration of each processing step
- The parameters for each component (as key-value pairs). This will call the component "init" function with the key value pairs to initialize the component. This allows a simpler way to add components and pass arguments to them from the json.
- How to reference additional configuration files or value lists
This enables flexible, versioned, and shareable model definitions-ideal for both research and production.
How to Write a MedModel JSON: Step-by-Step
Let’s walk through building a JSON model file, explaining each section.
1. General Fields
These fields configure the overall behavior of the pipeline. Most are optional and have sensible defaults.
Tip: Only model_json_version is required for most users. The rest can typically be left out.
Tip2: Use The $schema for autocomplete and validation, even though it is incomplete.
2. The Pipeline: model_actions
This is the heart of the model definition-a list of components executed in order. Each component is an object specifying:
action_type: What kind of step this is (data cleaning, feature generation, etc.)- Other keys: Parameters specific to the step
Component types:
rep_processororrp_set: Process raw signals- List of available rep_processor: Rep Processors Practical Guide. You should select and sepecify a the type name in
rp_typefield. For example:
- List of available rep_processor: Rep Processors Practical Guide. You should select and sepecify a the type name in
feat_generator: Creates features from cleaned signals- List of available feat_generator: Feature Generator Practical Guide. You should select and sepecify a the type name in
fg_typefield. For example:
- List of available feat_generator: Feature Generator Practical Guide. You should select and sepecify a the type name in
fp_set: Post-processes the feature matrix (imputation, selection, normalization)- List of available feature processors: FeatureProcessor practical guide. You should select and sepecify a the type name in
fp_typefield. For example:
- List of available feature processors: FeatureProcessor practical guide. You should select and sepecify a the type name in
predictor: The machine learning algorithm. List of available predictors: MedPredictor practical guide. you should specify the selected predictor aspredictorand it parameters aspredictor_paramspost_processor: Final calibration or adjustment.- List of available post processors: PostProcessors Practical Guide and specify the type in
post_processor. Example:
- List of available post processors: PostProcessors Practical Guide and specify the type in
Example Walkthrough
Let’s walk through an example and explain each major step:
How it works:
- The pipeline loads additional processors from a separate JSON file (for modularity).
- It generates demographic and behavioral features.
- It creates diagnosis-based categorical features.
- It computes statistical features over defined time windows.
- It removes features with little variation.
- It imputes missing values using well-defined rules.
- It normalizes certain features.
- It trains the model using XGBoost, with custom parameters.
This example uses an additional file "full_rep_processors.json" next to it. Here is the content inside
full_rep_processors.json
- It uses
conf_clnfor configuring simple and fixed outliers by valid range and configuration filecleanDictionary.csvwith those ranges. - It uses
all_rules_sigs.listto list down all the avaible signals are create cleaner for each of those signals. See List Expension fo mkore details - It uses
sim_valto remove inputs on the same date with contradicting values and remove duplicate rows if the values are the sames - It uses
rule_clnto clear outliers based on equations and relations between signals. Gor example: BMI=Weight/Height^2, if a difference of more than tolerance (default is 10%) observed the values will be dropped. We can configure usingruls2Signals.tsvwhat happens if contradiction observed, whather to drop all signals in the relation, or just one specific - for example drop only the BMI. - It uses
complete- to complete missing values in panels from other relational signals. For example BMI is missing and we have Weight, Height. It usescompletion_metadatato control the resulted signals resolution. - It uses
calc_signalsto generate virtual signals for eGFR.
3. Referencing Other Files
To keep your pipeline modular and maintainable, you can reference external files directly in your JSON configuration. Here are the supported reference types:
"json:somefile.json": Imports another JSON file containing additional pipeline components."file_rel:signals.list": Loads a list of values from a file and expands them as a JSON array. Useful for features or signals lists. For details on how lists are expanded, see List Expansion."path_rel:config.csv": Uses a relative path to point to configuration files, resolved relative to the current JSON file's location."comma_rel:somefile.txt": Reads a file line by line and produces a single comma-separated string of values ("line1,line2,..."). Unlike"file_rel", this will not create a JSON list, but rather a flat, comma-delimited string.
These options allow you to keep configuration modular, re-use existing resources, and simplify large or complex pipelines.
4. Advanced: List Expansion
If you use lists for fields (e.g., multiple signals or time windows), the pipeline automatically expands to cover all combinations (Cartesian product).
This generates steps for each type × window combination.5. Reference Lists
At the end of your JSON, you can define reusable value lists, such as drug codes or signals:
"ref:diabetes_drugs".
Ready to Write Your Own?
By following this walkthrough, you can confidently define new model JSON files:
- Start with the general fields
- List your pipeline steps in
model_actions - Modularize and reuse with references
- Expand lists for coverage
- Define your predictor and parameters
Tip: For a new project, copy and adapt the example above to fit your own signals, features, and model goals.