Test 03 - Create Samples
Purpose
Generate sample cohorts from the prepared repository or from an external samples file. These samples are used by subsequent tests for scoring and evaluation.
Required Inputs
From configs/env.sh and the test invocation environment:
WORK_DIR: Working folder where the repository and Samples outputs will be writtenSILENCE_RUN_OUTPUT_FILES_PATH: Either the special valueGENERATE(to generate samples from the repository) or a path to an input TSV/CSV file containing sample definitionsTAKE_JUST_LAST: Applicable whenSILENCE_RUN_OUTPUT_FILES_PATHisGENERATE. If set to1will filter and take only most recent Hemoglobin lab test date as candidate for the analysis, otherwise will use all Hemoglobin dates for the analysis.FILTER_LAST_DATE: The reference matrix contains multiple dates for each patient. If provided1will filter and take only most recent date for each patient. Might be better analsys if that's what we are doing in the client/in this dataset.
How to Run
From your TestKit folder, execute:
Check ${WORK_DIR}/Samples for produced sample files.
What This Test Does
Behavior depends on SILENCE_RUN_OUTPUT_FILES_PATH:
- If
SILENCE_RUN_OUTPUT_FILES_PATHisGENERATE:- Reads the repository located under
${WORK_DIR}/rep(looks for*.repository) - Generates a samples file at
${WORK_DIR}/Samples/3.test_cohort.samples - Two generation modes exist (controlled by
TAKE_JUST_LASTin the script):- If
TAKE_JUST_LAST> 0: for each patient take only the last Hemoglobin record - Otherwise: take all Hemoglobin records for each patient
- If
- Copies the generated file to
${WORK_DIR}/Samples/1.all_potential.samples
- Reads the repository located under
-
If
SILENCE_RUN_OUTPUT_FILES_PATHis a path to an existing file:- The script converts the input file into the internal
SAMPLEformat, transforming dates fromDD-MMM-YYYY-style strings into integer yyyymmdd timestamps and sorting samples. - Produces
${WORK_DIR}/Samples/test.bf.samplesand${WORK_DIR}/Samples/test.bf.orig.preds(original predictions preserved) - If
${WORK_DIR}/ETL/FinalSignals/ID2NRexists, the script remaps identifiers and writes${WORK_DIR}/Samples/3.test_cohort.samplesand${WORK_DIR}/Samples/test.orig.preds. Otherwise it symlinks the generated files.
- The script converts the input file into the internal
-
The script also prepares
${WORK_DIR}/ref_matrixeither by symlinkingREFERENCE_MATRIXor by filtering it to the last date whenFILTER_LAST_DATE> 0. - Finally it runs
samples_by_year.shon the generated cohort file to produce year-based summaries.
Output Location
- Main cohort samples:
${WORK_DIR}/Samples/3.test_cohort.samples - All potential samples (copy):
${WORK_DIR}/Samples/1.all_potential.samples - Intermediate samples:
${WORK_DIR}/Samples/test.bf.samples - Original predictions file:
${WORK_DIR}/Samples/test.bf.orig.predsand/or${WORK_DIR}/Samples/test.orig.preds - Reference matrix (symlink or filtered):
${WORK_DIR}/ref_matrix
How to Interpret Results
- Verify all generated files are non empty with just headers.
- Verify there are no errors and the execution finished successfully.
Common failure modes and suggestions
- Input file formatting errors:
- If the external file isn't the expected format (columns in different order or different date format), the awk parsing and date conversion will produce incorrect times. Confirm column positions and pre-normalize the file if needed.
- Missing repository or incorrect
*.repositoryfile:- If generation mode is selected but the repository isn't present or contains unexpected schema, the process will fail
Example output snippets
1) Generated sample header and a sample line:
2) When ID2NR mapping is used, final lines preserve the remapped ID as the second column.
Notes and Implementation Details
- The script uses several small utilities and conventions from the TestKit (
Flow,paste.pl,samples_by_year.sh). Ensure these helper scripts are available on PATH or inconfigs/env.sh
Test Results Review
Primary files to inspect after running this test:
${WORK_DIR}/Samples/3.test_cohort.samples${WORK_DIR}/Samples/test.bf.samples${WORK_DIR}/Samples/test.orig.preds(when present)${WORK_DIR}/ref_matrix