Categorical Signals & Custom Dictionaries
This page explains how categorical signals are handled in the ETL process, with examples of when to use known ontologies, how to deal with client-provided values, and how to integrate custom mapping dictionaries.
Use Case 1 – Known Signals with Standard Ontologies
When using a known categorical signal from ETL_INFRA_DIR/rep_signals/general.signals (e.g., DIAGNOSIS, Drug, PROCEDURE), the ETL automatically applies existing ontologies and mappings between codes.
Example:
For the Drug signal, if you create values with the RX_CODE prefix, the ETL will detect this and automatically pull:
- The
RX_CODEdictionary, - The
ATCdictionary, and - The mapping between
RX_CODEandATC.
You only need to set the correct prefixes in prepare_final_signals processings.
The call to finish_prepare_load takes care of the rest. No need to do anything special.
Known Ontologies and Prefixes
| Coding system prefix | description |
|---|---|
| ICD10_CODE: | Diagnosis or procedure with ICD10 codes. For PROCEDURE signal, uses procedure ontology |
| ICD9_CODE: | Diagnosis or procedure with ICD9 codes. For PROCEDURE signal, uses procedure ontology |
| ATC_CODE: | Medication prescriptions in ATC codes |
| RX_CODE: | Medications prescriptions in RX norm |
| NDC_CODE: | Medications in NDC codes |
Notes:
- Please strip "." from ICD10/ICD9 codes
Use Case 2 – New Signals from Client (List of Values)
Sometimes we receive a signal that is not part of a known ontology and comes only as a list of values from the client.
Example:
A signal like Cancer_Type with values such as:
AdenocarcinomaSmall_Cells- etc. (extracted from cancer patients)
What to do:
- Define the new categorical signal in CODE_DIR/configs/rep.signals
➡️ No manual mapping is needed. It will processed in the end as part of
finish_prepare_loadcall later
Use Case 3 – New or Known Signals with Additional Client Dictionaries
Sometimes the signal is known (e.g., DIAGNOSIS) or new, but the client provides extra mapping dictionaries.
Example:
The client uses an internal coding system (EDG_CODE) and provides:
- Translation dictionary - maps internal codes to descriptions.
- Example:
`EDG_CODE:1234→ Diabetes type II
- Example:
- Mapping dictionary - maps internal codes to another known ontology.
- Example:
EDG_CODE:1234(Diabetes type II) →ICD10_CODE:E11
- Example:
Notes:
- Sometimes only #1 (translation) is available → still valid.
- Sometimes only #2 (mapping) is available → also valid.
- If the ontology is common and reusable, we may store the mapping dictionary in ETL for future use. We will need to change the code in
create_dicts.py, currently it is not very easily extended.
How to use
Use the function prepare_dicts with up to two optional dataframes:
- Translation dictionary:
| Column | Meaning |
|---|---|
code |
Internal code |
description |
Human-readable description |
- Mapping dictionary:
| Column | Meaning |
|---|---|
client_value |
Value from client |
ontology_code |
Code from our known ontology |
How-To: Reading the Output of prepare_dicts / finish_prepare_load
During processing, the ETL produces log messages with statistics about how the dictionaries were handled.
What to Expect
- Known codes detected - how many values already exist in our mappings.
- New codes detected - how many values were introduced for the first time (e.g., new ICD10 codes for new diseases).
- Automatic mapping attempts - in some cases, new codes are mapped by truncating strings to a higher-level category (e.g., grouping a specific disease into a broader disease family).
Why It Matters
- Helps identify if client data aligns well with existing ontologies.
- Flags new codes that may need review or long-term integration.
- Provides confidence that signal values were normalized as expected.