Skip to content

Using Flow to Prepare Samples and Calculate Incidences

The Flow app provides powerful tools for selecting, filtering, and matching samples based on a cohort. These tools are essential for preparing sample files for training/validation or for matching populations on specific parameters. Flow also supports incidence estimation and generating incidence files for later analysis (e.g., bootstrap). The instructions below are primarily for date-based cases, but can be adapted for other variables.

Cohort Files

A cohort file defines the patients to use for training/testing, including their entry/exit times and outcome information. Format:

  • Tab-delimited
  • Lines starting with # are comments
  • Each line contains 5 fields:
    1. pid: Patient ID (one line per pid)
    2. Entry date to cohort
    3. End date in cohort
    4. Outcome date: For controls (outcome=0), same as end date; for cases (outcome=1), the event date (must be ≤ end date)
    5. Outcome: Typically 0/1 (control/case), but can be regression or multicategory (currently only a single value is supported)

Example cohort file:

1
2
3
4
5
# pid 5000009 entered on 20060505, left on 20091023, and had a case outcome at 20091023
5000009 20060505        20091023        20060605        1.000000
# pid 5000014 is a control (0.0 outcome), entered 20150107, left 20160929
5000014 20150107        20160929        20160929        0.000000
5000017 20110826        20160819        20160819        0.000000

Creating a Sample File from a Cohort File

Use Flow to generate a sample file:

Flow --rep <repository> --seed <random seed> --cohort_fname <cohort file> --cohort_sampling <sampling parameters> --out_samples <output samples file>

All parameters are self-explanatory except cohort_sampling, which controls how samples are created (for both controls and cases). Key options (with defaults):

  • min_control_years (0): Minimum years before outcome for controls
  • max_control_years (10): Maximum years before outcome for controls
  • min_case_years (0): Minimum years before outcome for cases
  • max_case_years (1): Maximum years before outcome for cases
  • is_continous (1): Continuous sampling (1) or on-test (0)
  • min_days_from_outcome (30): Minimum days before outcome to sample
  • jump_days (180): Days between sampling periods
  • min_year (1900), max_year (2100): Year range for sampling
  • gender_mask (3): 1=male, 2=female, 3=both
  • train_mask (7): Mask for TRAIN value (bits for TRAIN=1,2,3)
  • min_age (0), max_age (200): Age range for sampling
  • stick_to_sigs: Comma-separated list of signals; only use time points with at least one of these signals
  • take_closest (0): Take sample with stick signal closest to each target date
  • take_all (0): Take all samples with stick signal in each period
  • max_samples_per_id (2^31-1): Max samples per ID
  • max_samples_per_id_method ('last'): 'last' or 'rand' (choose last or random samples)

Sample usage:

# Continuous sampling, random sample every 180 days
# Controls: 1–10 years before end; Cases: up to 2 years before outcome
# Only TRAIN=1, ages 35–90
SAMPLING_PARAMS1="min_control=1;max_control=10;min_case=0;max_case=2;jump_days=180;train_mask=1;min_age=35;max_age=90"

# As above, but on-test and only at Glucose or HbA1C test dates
SAMPLING_PARAMS1="min_control=1;max_control=10;min_case=0;max_case=2;jump_days=180;train_mask=1;min_age=35;max_age=90;is_continous=0;stick_to_sigs=Glucose,HbA1C"

# Run Flow (replace SAMPLING_PARAMS as needed)
Flow --rep /home/Repositories/THIN/thin_mar2017/thin.repository --seed 123 --cohort_fname ./pre2d.cohort --cohort_sampling ${SAMPLING_PARAMS} --out_samples ./temp.samples

Filtering and Matching Samples

You can filter and/or match samples from an existing samples file (typically created as above):

Flow --rep <rep> --seed <random_seed> --filter_and_match --in_samples <input samples file> --out_samples <output samples file> --filter_params <filter params> --match_params <match params>
Filtering is applied first (if specified), then matching (if specified).

Filter Parameters

Filtering allows you to select samples within a date range or based on signal values in a window before the sample time (e.g., only samples with Creatinine < 1.1 in the last 2 years).

  • min_sample_time (0): Minimum allowed time (in sample's time unit, usually date)
  • max_sample_time ((1<<30)): Maximum allowed time
  • win_time_unit ("Days"): Time unit for bfilter windows
  • bfilter: Filter on a signal; multiple filters allowed. Parameters (comma-separated):
    • sig_name: Signal name
    • win_from (0): Window start (relative to sample time, backwards)
    • win_to ((1<<30)): Window end
    • min_val (-1e10): Minimum allowed value
    • max_val (1e10): Maximum allowed value
    • min_Nvals (1): Minimum number of signal instances in window
    • time_channel (0), val_channel (0): Channels to consider
  • min_bfilter: How many bfilters must pass (default: all)

Examples:

# Only samples between 20070101 and 20150101
FILTER1="min_sample_time=20070101;max_sample_time=20150101"

# At least 1 Creatinine test in last 2 years
FILTER2="bfilter=sig,Creatinine,win_from,0,win_to,730,min_Nvals,1"

# As above, but all Creatinine < 0.9
FILTER3="bfilter=sig,Creatinine,win_from,0,win_to,730,min_Nvals,1,min_val=0,max_val=0.9"

# Combined: date range, at least one Glucose in last 2Y, all Glucose < 100 in last 5Y, all HbA1C < 5.7 in last 5Y
FILTER4="min_sample_time=20070101;max_sample_time=20101201;bfilter=sig,Creatinine,win_from,0,win_to,730,min_Nvals,1;bfilter=sig,Creatinine,win_from,0,win_to,1825,min_val,0,max_val,100;bfilter=sig,HbA1C,win_from,0,win_to,1825,min_val,0,max_val,5.7"

Matching Parameters

Matching ensures the ratio of cases to controls is balanced within defined strata (e.g., by year, gender, age, or signal value). The algorithm tries to maximize the number of samples kept, with a weight parameter to prioritize keeping cases when they are rare. The goal is to control and remove information related directly to those variables. A common case is to match by years to remove temporal information the model might gian from difference in cases, controls ratio in certain years.

  • priceRatio (100.0): How many controls to lose per case (suggested: n_controls/n_cases)
  • maxRatio (10.0): If optimal ratio > maxRatio, sample less (enrich cases)
  • verbose (0): More output
  • match_to_prior: Specify target prior directly
  • strata: Define stratification (':'-delimited for multiple strata, ','-delimited for parameters):
    • type: time, age, gender, or signal
    • signalName: For time: year/month/days; for signal: name; for age/gender: none
    • resolution: Bin size

Examples:

# Match and stratify by year
MATCH1="priceRatio=10;maxRatio=4.5;verbose=1;strata=time,year,1"
# Match by year to a prior of 0.1 (10%)
MATCH1="match_to_prior=0.1;maxRatio=4.5;verbose=1;strata=time,year,1"
# Match by gender
MATCH2="priceRatio=10;maxRatio=4.5;verbose=1;strata=gender"
# Match by age (5-year bins)
MATCH3="priceRatio=10;maxRatio=4.5;verbose=1;strata=age,5"
# Match by age, year, and gender together
MATCH4="priceRatio=10;maxRatio=4.5;verbose=1;strata=age,5:time,year,1:gender"
# Match by Glucose (bin=10) and HbA1C (bin=1.0)
MATCH5="priceRatio=10;maxRatio=4.5;verbose=1;strata=signal,Glucose,10:signal,HbA1C,1.0"
For more details, see: MatchingSampleFilter

Calculating Incidence for a Cohort

To generate an incidence file for a cohort, use:

Flow --rep <repository> --cohort_incidence "from_year=2007;to_year=2014;from_age=40;to_age=80;age_bin=40;incidence_days_win=1825" --cohort_fname <cohort file> --cohort_incidence <incidence parameters> --out_incidence <incidence file> --censor_reg <censor registry> --use_kaplan_meir 1
cohort_fname and censor_reg are MedRegistry objects. You can convert a MedCohort to MedRegistry using:

1
2
3
4
# Create MedRegistry from MedCohort
cat <cohort file> | awk '{ if ($NF > 0) { print $1 "\t" $2 "\t" $4 "\t" "0";  print $1 "\t" $4 "\t" $3 "\t" "1" } else { print $1 "\t" $2 "\t" $3  "\t" "0" } }'
# Create Censor registry from MedCohort
cat <cohort file> | awk '{ if ($NF > 0) { print $1 "\t" $2 "\t" $4 "\t" "1" } else { print $1 "\t" $2 "\t" $3  "\t" "1" } }'

Incidence Parameters (in cohort_incidence argument, separated by ;)

  • age_bin: Size of age bins (e.g., 5)
  • min_samples_in_bin: Small bins are merged with neighbors
  • from_year, to_year: Year range
  • start_date: Date in year to test (mmyy, e.g., 508=May 8, 1201=Dec 1)
  • gender_mask, train_mask: As above
  • from_age, to_age: Age range
  • incidence_years_window: Years ahead to calculate incidence
  • incidence_days_win: Days ahead to calculate incidence (overrides years if set)

Example:

# Incidence for 2007–2010 (annual), TRAIN=1, age bins of 5, test date June 2nd
INC_PARAMS="train_mask=1;age_bin=5;start_date=602;incidence_years_window=1;from_year=2007;to_year=2010"

You can also control sampling directly with --sampler_params:

  • start_year, end_year or start_time, end_time: Full time/date
  • prediction_month_day: Prediction date
  • time_jump/day_jump: Interval between prediction dates
  • time_jump_unit: Jump unit (e.g., Day, Year)
  • time_range_unit: Time range unit (e.g., Date, Minutes)