MedModel JSON Format
This guide explains the structure and usage of the MedModel JSON format for defining machine learning pipelines within the Medial infrastructure. The JSON file orchestrates all model steps, from raw data processing to prediction and post-processing, making your workflow modular, reproducible, and easy to configure.
What is a MedModel JSON?
A MedModel JSON file describes:
- The pipeline of data processing and machine learning components (like data cleaning, feature generation, modeling, etc.)
- The order and configuration of each processing step
- The parameters for each component (as key-value pairs). This will call the component "init" function with the key value pairs to initialize the component. This allows a simpler way to add components and pass arguments to them from the json.
- How to reference additional configuration files or value lists
This enables flexible, versioned, and shareable model definitions—ideal for both research and production.
How to Write a MedModel JSON: Step-by-Step
Let’s walk through building a JSON model file, explaining each section.
1. General Fields
These fields configure the overall behavior of the pipeline. Most are optional and have sensible defaults.
Tip: Only model_json_version
is required for most users. The rest can typically be left out.
2. The Pipeline: model_actions
This is the heart of the model definition—a list of components executed in order. Each component is an object specifying:
action_type
: What kind of step this is (data cleaning, feature generation, etc.)- Other keys: Parameters specific to the step
Component types:
rep_processor
orrp_set
: Cleans or derives raw signalsfeat_generator
: Creates features from cleaned signalsfp_set
: Post-processes the feature matrix (imputation, selection, normalization)predictor
: The machine learning algorithmpost_processor
: Final calibration or adjustment
Example Walkthrough
Let’s walk through an example and explain each major step:
How it works:
- The pipeline loads additional processors from a separate JSON file (for modularity).
- It generates demographic and behavioral features.
- It creates diagnosis-based categorical features.
- It computes statistical features over defined time windows.
- It removes features with little variation.
- It imputes missing values using well-defined rules.
- It normalizes certain features.
- It trains the model using XGBoost, with custom parameters.
This example uses an additional file "full_rep_processors.json" next to it. Here is the content inside
full_rep_processors.json
- It uses
conf_cln
for configuring simple and fixed outliers by valid range and configuration filecleanDictionary.csv
with those ranges. - It uses
all_rules_sigs.list
to list down all the avaible signals are create cleaner for each of those signals. See List Expension fo mkore details - It uses
sim_val
to remove inputs on the same date with contradicting values and remove duplicate rows if the values are the sames - It uses
rule_cln
to clear outliers based on equations and relations between signals. Gor example: BMI=Weight/Height^2, if a difference of more than tolerance (default is 10%) observed the values will be dropped. We can configure usingruls2Signals.tsv
what happens if contradiction observed, whather to drop all signals in the relation, or just one specific - for example drop only the BMI. - It uses
complete
- to complete missing values in panels from other relational signals. For example BMI is missing and we have Weight, Height. It usescompletion_metadata
to control the resulted signals resolution. - It uses
calc_signals
to generate virtual signals for eGFR.
3. Referencing Other Files
To keep your pipeline modular and maintainable, you can reference external files directly in your JSON configuration. Here are the supported reference types:
"json:somefile.json"
: Imports another JSON file containing additional pipeline components."file_rel:signals.list"
: Loads a list of values from a file and expands them as a JSON array. Useful for features or signals lists. For details on how lists are expanded, see List Expansion."path_rel:config.csv"
: Uses a relative path to point to configuration files, resolved relative to the current JSON file's location."comma_rel:somefile.txt"
: Reads a file line by line and produces a single comma-separated string of values ("line1,line2,..."
). Unlike"file_rel"
, this will not create a JSON list, but rather a flat, comma-delimited string.
These options allow you to keep configuration modular, re-use existing resources, and simplify large or complex pipelines.
4. Advanced: List Expansion
If you use lists for fields (e.g., multiple signals or time windows), the pipeline automatically expands to cover all combinations (Cartesian product).
This generates steps for each type × window combination.5. Reference Lists
At the end of your JSON, you can define reusable value lists, such as drug codes or signals:
"ref:diabetes_drugs"
.
Ready to Write Your Own?
By following this walkthrough, you can confidently define new model JSON files:
- Start with the general fields
- List your pipeline steps in
model_actions
- Modularize and reuse with references
- Expand lists for coverage
- Define your predictor and parameters
Tip: For a new project, copy and adapt the example above to fit your own signals, features, and model goals.