Unified Smoking Feature Generator

Background

The purpose of the Unified Smoking Feature Generator is to generate smoking related features based on different types of available smoking information. It was based on THIN and KPSC databases.

Input Signals

The generator currently require the following signals (it doesn't depend anymore on the THIN smoking_quantity signal):

Signal	THIN	KPSC	KPNW
Smoking_Status	v	v	v
Smoking_Quit_Date	v	x	v
Pack_Years	x	v	x
Smoking_Intensity [Cigs/Day]	v	x	v
Smoking_Duration [Years]	x	v	v

Note: Every repository should have those signals, even if they are not (in that case they should be empty signals) The Smoking_Status signal is a categorical signal, with the following values: Never, Passive, Former, Current, Never_or_Former. Extraction of the status in THIN is described in the Appendix

Output Features

Boolean features: 1. Current_Smoker 2. Ex_Smoker 3. Never_Smoker 4. Passive_Smoker 5. Unknown_Smoker 6. NLST_Criterion - 1 if age between 55 to 74, pack years > 30, time since quitting < 15 years. ** features:** 1. Smok_Days_Since_Quitting - For current smokers - 0, For Former smokers, time since quitting, for Never Smokers - time since birth 2. Smok_Years_Since_Quitting - same as previous, but in years 3. Smok_Pack_Years_Max - Maximal report of pack years (pack years if available) if not, it is estimated (and can be corrected with intensity 4. Smok_Pack_Years - the same as Smok_Pack_Years_Max 5. Smok_Pack_Years_Last - Last pack years report (without estimation) 6. Smoking_Intensity - Number of pack per day 7. Smoking_Years - Smoking duration.

Config Example

"model_actions": [
    {
      "action_type": "feat_generator",
      "fg_type": "unified_smoking",
      "smoking_features": "Current_Smoker,Ex_Smoker,Never_Smoker, Unknown_Smoker,Smoking_Years,Smok_Years_Since_Quitting,Smok_Pack_Years,Smoking_Intensity"
    }
  ]

Logic Explanation

The most basic information we need to extract is smoking status on different time points The logic is based on the paper: Development of an algorithm for determining smoking status and behaviour over the life course from UK electronic primary care records https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/pdf/12911_2016_Article_400.pdf The workflow is built from the following methods: 1. genFirstLastSmokingDates - For each status find the first time and last time it appears. This is the input for setting the status at each smoking status report. 2. genSmokingStatus - generate for each point in smoking status vector a corrected smoking status. See Figure 1 3. genSmokingRanges - Build Smoking status ranges 4. genLastStatus - Set the Boolean smoking status features (take the last status according the the previous method output) 5. calcQuitTime - generates Smok_Days_Since_Quitting/Smok_Years_Since_Quitting. Check that last status in the ranges vector - If former smoker, take the delta between sample time to beginning of the "former smoking" period, if Current smoker, take 0. if never smoker return time since birth date. 6. calcSmokingIntensity - returns smoking intensity (averages the smoking intensity vector). 7. calcPackYears - Set pack years according to the pack years vector. 8. calcSmokingDuration - Return duration. runs over the ranges vector and integrates the period in which the status is "Current smoker" 9. fixPackYearsSmokingIntensity - Fix pack years using smoking intensity and duration. If Intensity is unknown and pack years is known calculate intensity. Example: Taken from THIN, birth date : July 1959, sample date 05/08/2011. Marked in Grey - Input (Raw) Data

Smoking Status		19900315 Current	19970227 Never	19970227 Never_or_Former	20060824 Never
Smoking Intensity		19900315 15.000000	19970227 0.000000	20060824 0.000000	20060824 0.000000
Quit time
Pack years
Smoking Status Processed	19590700 UNKNOWN_SMOKER	19900315 CURRENT_SMOKER	19970227 EX_SMOKER	19970227 EX_SMOKER	20060824 EX_SMOKER
Smoking Status Ranges	19590700-19781231 UNKNOWN_SMOKER	19790101-19930904 CURRENT_SMOKER	19930905-20110805 EX_SMOKER
Intensity Out:	15
Duration Out:	14.684932
Quit time:	17.926027
Pack years:	11.013699

Appendix - Extracting Smoking Status in THIN

In THIN database, smoking status is extracted from Read codes. The mapping from codes to status is taken from "Development of an algorithm for determining smoking status and behaviour over the life course from UK electronic primary care records: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/pdf/12911_2016_Article_400.pdf I have noticed that there are a lot of "collisions" in the smoking status vector when using this mapping (meaning two different status in the same date) - ~10%. After removing non-conclusive Read codes - this was reduced to ~0.5%. When the old THIN smoking feature generator was used in a simple LR model for lung cancer AUC was improved in 1 point. See original and modified mapping in the table below. smoking_readcodes_combined.csv Figure 1 - Logic for setting the smoking status. The code that generates the smoking vectors in THIN: http://bitbucket:7990/projects/MED/repos/gensmoking/browse