Infrastructure Home Page
Overview of Medial Infrastructure
Medial Infrastructure is designed to turn the Electronic Medical Record (EMR)-a complex, semi-structured time-series dataset, into a machine-learning-ready resource. Unlike images or free text, EMR data can be stored in countless formats, and its "labels" (the outcomes or targets you want to predict) aren’t always obvious. We address this by standardizing both the storage and the processing of time-series signals. We can think about this infrastructure as "TensorFlow" of medical data machine learning.
Howto Use this
Suppose you're a potential user or client interested in using a specific model. You're not concerned with how the model was built or which tools were used, you simply want to deploy and use it. Please refer to this page: Howto use AlgoMarker
Main contributers from recent years:
Challenges
- Variety of Questions: Risk prediction (e.g., cancer, CKD), compliance, diagnostics, treatment recommendations
- Medical Data Complexity: Temporal irregularity, high dimensionality (>100k categories), sparse signals, multiple data types
- Retrospective Data Issues: Noise, bias, spurious patterns, policy sensitivity
Goals
- Avoid reinventing common methodologies each project. Sometimes complicated code/logic with debugging
- Maintain shareable, versioned, regulatory‑compliant pipelines
- Facilitate reproducible transfer from research to product
- Provide end-to-end support: data import → analysis → productization
Platform Requirements
- Performance: Ultra-efficient in memory & time (>100x compare to native python pandas in some cases, mainly in preprocessing)
- Extensibility: Rich APIs, configurable pipelines, support new data types
- Minimal Rewriting & Ease Of Usage: JSON‑driven configs, unified codebase, python API to the C library
- Comprehensive: From "raw" data to model deployment
- Reproducible & Versioned: Track data, code, models, and parameters
Infrastructure Components
- MedRepository: a high-performance EMR time-series store
- Fast retrieval of any patient’s full record or a specific signal across all patients.
- Unified representation: each signal consists of zero or more time channels plus zero or more value channels, all tied to a patient ID.
- Static example: "Birth year" → no time channels, one value channel.
- Single-time example: "Hemoglobin" → one time channel (test date), one value channel (numeric result).
- Interval example: "Hospitalization" → two time channels (admission and discharge dates).
- Hierarchical support for categorical medical ontologies
- Enables seamless integration and translation between different systems when working with a frozen model or algorithm.
- Example: A query for ICD-10 codes starting with "J" (respiratory diseases) will also automatically map to corresponding categories in systems like Epic. When dictionary of mapping between ICD and Epic is added, no need to change the model.
- Ontology mappings are managed by MedDictionary, which supports many-to-many hierarchical relationships across coding systems.
- Modular processing pipeline (sklearn-style)
- Rep Processors: Clean or derive "raw" virtual signals, while preventing leakage of future data
- Example: Outlier cleaner that omits values only when abnormality is detected by future readings (e.g., a hemoglobin value on 2023-Feb-04 flagged only by a 2023-May-21 test remains until after May 21).
- Example: Virtual BMI signal computed from weight/height, or imputed when only two of three inputs exist
- Feature Generators: Convert cleaned signals into predictive features.
- Examples:
- "Last hemoglobin in past 365 days"
- "Hemoglobin slope over three years"
- "COPD diagnosis code during any emergency admission in last three years"
- Examples:
- Feature Processors: Operate on the feature matrix—imputation, selection, PCA, etc.
- Predictors/Classifiers: LightGBM, XGBoost, or custom algorithms.
- Post-processing: Score calibration, explainability layers, fairness adjustments, etc.
- Rep Processors: Clean or derive "raw" virtual signals, while preventing leakage of future data
- JSON-driven pipeline configuration - Define every processor, feature generator, and model step in a single JSON file. Json Format Example json for training a model:
Click to expend
example json | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
- Comprehensive evaluation toolkit
- Bootstrap-based cohort analysis allows batch testing across thousands of user-defined subgroups (e.g., age 50–80, males only, prediction window of 365 days, COPD patients).
- Automatically extracts AUC, ROC points at each 1% FPR increment, odds ratios, PPV/NPV, and applies incidence-rate adjustments or KPI weights
- Includes explainability and fairness audits
- Unified API wrapper for production deployment
- Ready for productization out of the box, no need to reinvent integration or design a new interface each time. See AlgoMarker
- Packages the entire end-to-end pipeline (raw time-series ingestion through inference) into a single, stable SDK.
- Core infrastructure implemented in C++ for performance and portability, with a lightweight Python wrapper for seamless integration.
- Although powered by C++, the team mainly uses and maintains workflows via the Python SDK, ensuring rapid development and minimal friction. Experienced user might use the C++ API more often, since the python interface is more limited.
Basic Pages
- MedModel learn and apply
- RepProcessors:
- FeatureGenerators:
- FeatureProcessors:
- MedPredictors
- PostProcessors:
Other links
Home page for in depth pages explaining several different aspects in the infrastructure Some interesting pages:
- Setup Environment
- How to Serialize : learn the SerializableObject libarary secrets.
- PidDynamicRecs and versions
- Virtual Signals