Test 12 - Lab Frequency
Purpose
Analyze lab frequency for signals that back the model's important features. The test produces counts of how many patients had N observations of a given signal (per-signal histogram of patient-level lab counts).
Required Inputs
From configs/env.sh and the test invocation environment:
WORK_DIR: working directory where repository and output folders liveCMP_FEATURE_RES: comma-separated list of important features (used to derive the list of relevant signals)- A prepared repository under
${WORK_DIR}/rep, containingtest.signalsandtest.repository
How to Run
From your TestKit folder, execute:
After run, check ${WORK_DIR}/signals_cnt/ for per-signal TSV results.
What This Test Does
- Parses
CMP_FEATURE_RESto extract signal names. It strips category prefixes likeICD9_CODE:/ICD10_CODE:/ATC_CODE:and excludes some control features (e.g.,FTR_,Age,category_,Smoking). It also ensuresDIAGNOSISandSmoking_Statussignals are included by default. - For each signal it verifies the signal exists in
${WORK_DIR}/rep/test.signalsand skips signals not present. - For present signals it runs:
Flow --rep ${WORK_DIR}/rep/test.repository --pids_sigs --sigs <signal>to retrieve patient-signal rows (id, date, ...)- An awk pipeline deduplicates per-patient-date entries and counts how many distinct dates each patient has for the signal, then aggregates across patients to produce counts: how many patients had exactly 1 sample, 2 samples, ...
- Writes per-signal files:
${WORK_DIR}/signals_cnt/<signal>.tsvwith rows: signal, count, num_patients
Output Location
${WORK_DIR}/signals_cnt/- one TSV per signal named<signal>.tsvcontaining columns: signal, num_labs, num_patients
How to Interpret Results
- Each per-signal TSV shows how many patients had N lab entries for that signal during the observation window. Compare these distributions across signals or against a reference dataset to find differences in monitoring intensity.
- Without a reference, raw counts indicate whether certain signals are rarely or frequently measured in this dataset (useful for data quality and expected feature availability).
Troubleshooting
- Missing
test.signalsortest.repository: the script checks for the signal list in${WORK_DIR}/rep/test.signalsand will skip signals not found. Ensure the repo was created by Test 03 and contains expected files. Flownot found or failing: ensureFlowis on PATH.- Empty output files: if
${WORK_DIR}/rep/test.signalslists the signal but Flow returns no rows, inspect the repository content to ensure the signal has records (search for the exact signal token intest.signals).
Example output
A sample ${WORK_DIR}/signals_cnt/Hemoglobin.tsv might look like:
Notes and Implementation Details
- The script deduplicates by patient-date so multiple entries on the same day count as a single lab for that day.
- The signal extraction logic excludes some features (e.g.,
Age, features starting withFTR_orcategory_) because they are not time-series signals.
Test Results Review
Primary files to inspect after running this test:
${WORK_DIR}/signals_cnt/(per-signal TSVs)