Skip to content

Unified Smoking Feature Generator

Background

The purpose of the Unified Smoking Feature Generator is to generate smoking related features based on different types of available smoking information. It was based on THIN and KPSC databases.

Input Signals

The generator currently require the following signals (it doesn't depend anymore on the THIN smoking_quantity signal):

Signal THIN KPSC KPNW
Smoking_Status v v v
Smoking_Quit_Date v x v
Pack_Years x v x
Smoking_Intensity [Cigs/Day] v x v

Smoking_Duration [Years]

x v v

Note:  Every repository should have those signals, even if they are not  (in that case they should be empty signals) The Smoking_Status signal is a categorical signal, with the following values: Never, Passive, Former, Current, Never_or_Former.    Extraction of the status in THIN is described in the Appendix

Output Features

Boolean features: 1. Current_Smoker 2. Ex_Smoker 3. Never_Smoker 4. Passive_Smoker 5. Unknown_Smoker 6. NLST_Criterion - 1 if age between 55 to 74, pack years > 30, time since quitting < 15 years. ** features:** 1. Smok_Days_Since_Quitting - For current smokers - 0, For Former smokers, time since quitting, for Never Smokers - time since birth 2. Smok_Years_Since_Quitting - same as previous, but in years 3. Smok_Pack_Years_Max - Maximal report of pack years (pack years if available) if not, it is estimated (and can be corrected with intensity 4. Smok_Pack_Years - the same as Smok_Pack_Years_Max 5. Smok_Pack_Years_Last - Last pack years report (without estimation) 6. Smoking_Intensity - Number of pack per day 7. Smoking_Years - Smoking duration.

Config Example

1
2
3
4
5
6
7
"model_actions": [
    {
      "action_type": "feat_generator",
      "fg_type": "unified_smoking",
      "smoking_features": "Current_Smoker,Ex_Smoker,Never_Smoker, Unknown_Smoker,Smoking_Years,Smok_Years_Since_Quitting,Smok_Pack_Years,Smoking_Intensity"
    }
  ]

Logic Explanation

The most basic information we need to extract is smoking status on different time points The logic is based on the paper: Development of an algorithm for determining smoking status and behaviour over the life course from UK electronic primary care records  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/pdf/12911_2016_Article_400.pdf The workflow is built  from the following methods: 1. genFirstLastSmokingDates - For each status find the first time and last time it appears. This is the input for setting the status at each smoking status report. 2. genSmokingStatus - generate for each point in smoking status vector a corrected smoking status. See Figure 1 3. genSmokingRanges - Build Smoking status ranges  4. genLastStatus - Set the Boolean  smoking status features (take the last status according the the previous method output) 5. calcQuitTime - generates Smok_Days_Since_Quitting/Smok_Years_Since_Quitting. Check that last status in the ranges vector - If former smoker, take the delta between sample time to beginning of the "former smoking" period, if Current smoker, take 0. if never smoker return time since birth date. 6. calcSmokingIntensity - returns smoking intensity (averages the smoking intensity vector). 7. calcPackYears - Set pack years according to the pack years vector. 8. calcSmokingDuration - Return duration. runs over the ranges vector and integrates the period in which the status is "Current smoker" 9. fixPackYearsSmokingIntensity - Fix pack years using smoking intensity  and duration. If Intensity is unknown and pack years is known calculate intensity.   Example: Taken from THIN, birth date : July 1959, sample date 05/08/2011. Marked in Grey - Input (Raw) Data

Smoking Status   19900315 Current 19970227 Never 19970227 Never_or_Former 20060824 Never
Smoking Intensity

 

19900315 15.000000

19970227 0.000000 20060824 0.000000 20060824 0.000000
Quit time          
Pack years          
Smoking Status Processed 19590700 UNKNOWN_SMOKER 19900315 CURRENT_SMOKER 19970227 EX_SMOKER 19970227 EX_SMOKER 20060824 EX_SMOKER
Smoking Status Ranges 19590700-19781231 UNKNOWN_SMOKER 19790101-19930904 CURRENT_SMOKER 19930905-20110805 EX_SMOKER    
Intensity Out: 15        
Duration Out: 14.684932        
Quit time: 17.926027        
Pack years: 11.013699        

Appendix - Extracting Smoking Status in THIN

In THIN database, smoking status is extracted from Read codes.  The mapping from codes to status is taken from  "Development of an algorithm for determining smoking status and behaviour over the life course from UK electronic primary care records:  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/pdf/12911_2016_Article_400.pdf I have noticed that there are a lot of "collisions" in the smoking status vector when using  this mapping (meaning two different status in the same date) - ~10%.  After removing non-conclusive  Read codes - this was reduced to ~0.5%. When the old THIN smoking feature generator was used in a simple LR model for lung cancer AUC was improved in 1 point. See original and modified mapping in the table below.  smoking_readcodes_combined.csv   Figure 1 - Logic for setting the smoking status. The code that generates the smoking vectors in THIN: http://bitbucket:7990/projects/MED/repos/gensmoking/browse