Test 04 - Relabel & Create Samples
Purpose
Relabel an input samples file with outcome labels derived from the repository's diagnosis registry, produce a cleaned cohort file and filter samples to the requested comparison cohort. This prepares evaluation-ready sample files for downstream bootstrap and comparison tests.
Note: This test is distinct from Test 03 - Create Samples from
External Slient Runkit which generates a samples cohort from raw inputs or an external file.Test 04assumes a sample cohort has already been created (fromTest 03) and focuses on relabeling and filtering by outcome codes.
Required Inputs
From configs/env.sh and the test invocation environment:
WORK_DIR: Working folder where outputs will be writtenFIRST_WORK_DIR: Path to the reference run (contains Silent Run Samples and model outputs)BT_JSON: Path to bootstrap features JSON, default used from${FIRST_WORK_DIR}/json/bootstrap/bt_features.jsonCOMPARE_COHORT: Cohort identifier used byFilterSamplesto select a subset of samplesCODE_LIST_FILE: File listing diagnosis codes to use (e.g., ICD lists)CODE_DIR: Directory containing code listsSUB_CODES: Comma-separated list of sub-cohort identifiers
How to Run
From your TestKit folder, execute:
Or include as part of the full suite:
Primary output files will be placed under ${WORK_DIR}/Samples and ${WORK_DIR}/outputs.
What This Test Does
- Creates an outcome registry with
create_registry.py(if missing orOVERRIDE> 0):python ${CURR_PT}/resources/lib/create_registry.py --rep $REP_PATH --signal DIAGNOSIS --output ${WORK_DIR}/Samples/outcome.reg --end_of_data 20230101 --codes_dir ${CODE_DIR} --codes_list ${CODE_LIST_FILE} --sub_codes ${SUB_CODES}
- Relabels samples using
relabel.py(writes dropped samples too):python ${CURR_PT}/resources/lib/relabel.py --registry ${WORK_DIR}/Samples/outcome.reg --samples ${FIRST_WORK_DIR}/Samples/3.test_cohort.samples --output ${OUTPUT} --output_dropout ${WORK_DIR}/Samples/dropped.samples --follow_up_controls 730 --time_window_case_maximal_before 730 --time_window_case_minimal_before 0 --future_cases_as_control 0 --sub_codes ${SUB_CODES}
- Runs
samples_by_year.shto show distribution by year and month. - Computes sample statistics with
samples_stats.py. - Produces a cleaned sample file and runs
FilterSamplesto generate${WORK_DIR}/Samples/3.test_cohort.samplesfiltered byCOMPARE_COHORTand the bootstrap JSON (BT_JSON). - Runs
samples_by_year.shagain on the final cohort file.
Output Location
- Relabeled samples:
${WORK_DIR}/Samples/relabeled.samples - Filtered test cohort:
${WORK_DIR}/Samples/3.test_cohort.samples - Dropped samples:
${WORK_DIR}/Samples/dropped.sampleswith exclusion reason. - Clean intermediate file:
${WORK_DIR}/Samples/clean.samples - Statistics:
${WORK_DIR}/samples_stats(path passed tosamples_stats.py)
How to Interpret Results
- Inspect
${WORK_DIR}/Samples/relabeled.samplesto confirm samples are labeled with outcome columns and have expected counts. - Check
${WORK_DIR}/Samples/dropped.samplesto see why samples were excluded. - Verify
${WORK_DIR}/Samples/3.test_cohort.samplesexists and matches expected cohort selection. - Review
${WORK_DIR}/model/04.create_samples.logfor other info and statistics
Common failure modes and suggestions
- Missing or incorrect code lists (
CODE_LIST_FILE/CODE_DIR):- Ensure code files (ICD lists, etc.) are present and the
create_registry.pyarguments point to the correct directory.
- Ensure code files (ICD lists, etc.) are present and the
- Input sample file missing or malformed:
- The test expects
${FIRST_WORK_DIR}/Samples/3.test_cohort.samplesor the configuredSILENCE_RUN_OUTPUT_FILES_PATHoutputs.
- The test expects
FilterSamplesutility missing or not on PATH:- Ensure the
FilterSamplesexecutable is available in PATH
- Ensure the
Test Results Review
Primary files to inspect after running this test:
${WORK_DIR}/Samples/relabeled.samples${WORK_DIR}/Samples/3.test_cohort.samples${WORK_DIR}/Samples/dropped.samples