Training a Model

This guide covers two primary methods for training a model: using the Python API for a code-driven approach or using command-line tools.

Step 1: Define the Model Architecture

Before training, you must define the model's architecture in a JSON file. This file specifies the entire pipeline, including feature generators, processors, and the algorithm to be used. This is conceptually similar to defining layers and stacking them in a deep learning framework.

For detailed instructions on the file format, see the MedModel JSON format documentation.

Method 1: Training with the Python API

This method provides control over the training process within a Python script.

Prerequisites

Ensure you have completed the following setup steps:

Install Python API: Python API for MES Infrastructure.
Load Data Repository: Follow the ETL Tutorial.
Create Samples: Prepare your training data as described in the Create Samples Tutorial.

Python Training Example

The following script demonstrates the end-to-end process of training a model.

import med

# --- 1. Configuration ---
rep_path = '/path/to/your/repository_file'
model_json_path = '/path/to/your/model_architecture.json'
samples_path = '/path/to/your/training_samples.tsv'
output_model_path = 'trained_model.bin'

# --- 2. Initialize Model and Fit to Repository ---
print("Initializing model and fitting to the repository...")

# Load the model architecture from the JSON file
model = med.Model()
model.init_from_json_file(model_json_path)

# Before loading all data, initialize a repository to analyze its structure
# This helps the model determine which signals are "virtual" (i.e., can be
# generated from other signals, like BMI from Height and Weight) versus
# which need to be fetched directly.
rep = med.PidRepository()
rep.init(rep_path)
model.fit_for_repository(rep)

# Get the list of signals that must be fetched from the repository
required_signals = model.get_required_signal_names()

# --- 3. Load Data ---
print("Loading training samples and repository data...")

# Load the training samples (patient IDs, dates, and labels)
samples = med.Samples()
samples.read_from_file(samples_path)
# Alternatively, load from a pandas DataFrame:
# samples.from_df(your_dataframe)

# Get the unique patient IDs required for training
patient_ids = samples.get_ids()

# Load the actual data for the required signals and patients
rep = med.PidRepository()
rep.read_all(rep_path, patient_ids, required_signals)

# --- 4. Train the Model ---
print("Starting model training...")
model.learn(rep, samples)
print("Training complete.")

# --- 5. Save the Trained Model ---
model.write_to_file(output_model_path)
print(f"Model saved to {output_model_path}")

# --- Optional: Post-Training Steps ---

# Access the generated feature matrix as a pandas DataFrame
feature_matrix_df = model.features.to_df()

# Apply the trained model to a new set of samples (e.g., a test set)
# test_samples = med.Samples()
# test_samples.read_from_file('path/to/test_samples.tsv')
# predictions = model.apply(rep, test_samples)

Loading a Trained Model

You can load a previously trained binary model file directly without needing the original JSON or data.

import med

model = med.Model()
model.read_from_file('trained_model.bin')

Method 2: Training with Command-Line Tools

For users who prefer command-line operations, MES provides tools to train models without writing Python code.

Prerequisites

First, ensure you have set up the MES Tools for Training and Testing Models.

Training Tools

Flow App: For a straightforward training run without hyperparameter tuning.
- Guide: Training with Flow
Optimizer: For grid-search style hyperparameter tuning with built-in regularization to improve generalization. For more advanced tuning, consider using external libraries like Optuna.
- Guide: Optimizer Tool