Using Flow to Prepare Samples and Calculate Incidences
The Flow app provides powerful tools for selecting, filtering, and matching samples based on a cohort. These tools are essential for preparing sample files for training/validation or for matching populations on specific parameters. Flow also supports incidence estimation and generating incidence files for later analysis (e.g., bootstrap). The instructions below are primarily for date-based cases, but can be adapted for other variables.
Cohort Files
A cohort file defines the patients to use for training/testing, including their entry/exit times and outcome information. Format:
- Tab-delimited
- Lines starting with
#
are comments - Each line contains 5 fields:
pid
: Patient ID (one line per pid)- Entry date to cohort
- End date in cohort
- Outcome date: For controls (outcome=0), same as end date; for cases (outcome=1), the event date (must be ≤ end date)
- Outcome: Typically 0/1 (control/case), but can be regression or multicategory (currently only a single value is supported)
Example cohort file:
Creating a Sample File from a Cohort File
Use Flow to generate a sample file:
All parameters are self-explanatory except cohort_sampling
, which controls how samples are created (for both controls and cases). Key options (with defaults):
min_control_years
(0): Minimum years before outcome for controlsmax_control_years
(10): Maximum years before outcome for controlsmin_case_years
(0): Minimum years before outcome for casesmax_case_years
(1): Maximum years before outcome for casesis_continous
(1): Continuous sampling (1) or on-test (0)min_days_from_outcome
(30): Minimum days before outcome to samplejump_days
(180): Days between sampling periodsmin_year
(1900),max_year
(2100): Year range for samplinggender_mask
(3): 1=male, 2=female, 3=bothtrain_mask
(7): Mask for TRAIN value (bits for TRAIN=1,2,3)min_age
(0),max_age
(200): Age range for samplingstick_to_sigs
: Comma-separated list of signals; only use time points with at least one of these signalstake_closest
(0): Take sample with stick signal closest to each target datetake_all
(0): Take all samples with stick signal in each periodmax_samples_per_id
(2^31-1): Max samples per IDmax_samples_per_id_method
('last'): 'last' or 'rand' (choose last or random samples)
Sample usage:
Filtering and Matching Samples
You can filter and/or match samples from an existing samples file (typically created as above):
Filter Parameters
Filtering allows you to select samples within a date range or based on signal values in a window before the sample time (e.g., only samples with Creatinine < 1.1 in the last 2 years).
min_sample_time
(0): Minimum allowed time (in sample's time unit, usually date)max_sample_time
((1<<30)): Maximum allowed timewin_time_unit
("Days"): Time unit for bfilter windowsbfilter
: Filter on a signal; multiple filters allowed. Parameters (comma-separated):sig_name
: Signal namewin_from
(0): Window start (relative to sample time, backwards)win_to
((1<<30)): Window endmin_val
(-1e10): Minimum allowed valuemax_val
(1e10): Maximum allowed valuemin_Nvals
(1): Minimum number of signal instances in windowtime_channel
(0),val_channel
(0): Channels to consider
min_bfilter
: How many bfilters must pass (default: all)
Examples:
Matching Parameters
Matching ensures the ratio of cases to controls is balanced within defined strata (e.g., by year, gender, age, or signal value). The algorithm tries to maximize the number of samples kept, with a weight parameter to prioritize keeping cases when they are rare. The goal is to control and remove information related directly to those variables. A common case is to match by years to remove temporal information the model might gian from difference in cases, controls ratio in certain years.
priceRatio
(100.0): How many controls to lose per case (suggested: n_controls/n_cases)maxRatio
(10.0): If optimal ratio > maxRatio, sample less (enrich cases)verbose
(0): More outputmatch_to_prior
: Specify target prior directlystrata
: Define stratification (':'-delimited for multiple strata, ','-delimited for parameters):type
: time, age, gender, or signalsignalName
: For time: year/month/days; for signal: name; for age/gender: noneresolution
: Bin size
Examples:
Calculating Incidence for a Cohort
To generate an incidence file for a cohort, use:
cohort_fname
and censor_reg
are MedRegistry objects. You can convert a MedCohort to MedRegistry using:
Incidence Parameters (in cohort_incidence
argument, separated by ;
)
age_bin
: Size of age bins (e.g., 5)min_samples_in_bin
: Small bins are merged with neighborsfrom_year
,to_year
: Year rangestart_date
: Date in year to test (mmyy, e.g., 508=May 8, 1201=Dec 1)gender_mask
,train_mask
: As abovefrom_age
,to_age
: Age rangeincidence_years_window
: Years ahead to calculate incidenceincidence_days_win
: Days ahead to calculate incidence (overrides years if set)
Example:
You can also control sampling directly with --sampler_params
:
start_year
,end_year
orstart_time
,end_time
: Full time/dateprediction_month_day
: Prediction datetime_jump
/day_jump
: Interval between prediction datestime_jump_unit
: Jump unit (e.g., Day, Year)time_range_unit
: Time range unit (e.g., Date, Minutes)