Finalizing the Load Process

his step completes the ETL pipeline and prepares the repository for use.

Recap of earlier steps

Prepare signals: Run prepare_final_signals for each data type (see previous step)
Handle client dictionaries (if needed): Use prepare_dicts for categorical signals

Now we will finalize the preparation and generate all configuration files needed for loading by using a third function: finish_prepare_load.

`finish_prepare_load`

Finalizes preparation and loads your data into the repository.

finish_prepare_load(WORK_DIR, '/nas1/Work/CancerData/THIN/thin_20XX', 'thin')

Parameters:

WORK_DIR - path to the working directory (string)
REPOSITORY_OUTPUT_DIR - destination folder for the repository (string)
REPO_NAME - name of the repository (string)

Full Workflow Example

Here’s a complete example combining all steps:

import pandas as pd
from ETL_Infra.etl_process import *
from parser import generic_file_fetcher, generic_big_files_fetcher

WORK_DIR = '/nas1/Work/demo_ETL'

# Step 1: Prepare signals
prepare_final_signals(
    generic_file_fetcher("^demo.*"),  # Fetch files starting with "demo"
    WORK_DIR,
    "demographic",  # Name of this processing pipeline
    batch_size=0,   # Process all files in a single batch
    override="n"    # Skip if already successfully completed
)
prepare_final_signals(
    generic_big_files_fetcher("^labs.*"),  # Fetch files starting with "labs"
    WORK_DIR,
    "labs",  
    batch_size=1e6,   # Process each 1M lines in a single batch
    override="n"    
)

# Step 2 (optional): Handle custom dictionaries
# Provide client dicts as DataFrames: def_dict, set_dict (or None)
prepare_dicts(WORK_DIR, 'DIAGNOSIS', def_dict, set_dict)

# Step 3: Finalize and load
finish_prepare_load(WORK_DIR, '/nas1/Work/CancerData/THIN/thin_20XX', 'thin')

Function Reference

1. prepare_final_signals
Processes and tests each data type. Handles batching if needed.

Arguments:
data_fetcher or DataFrame: Source of your data
workdir: Working directory for outputs
signal_type: Name/type of the signal (used for classification)
batch_size: Batch size (0 = no batching)
override: 'y' to overwrite, 'n' to skip completed signals

2. prepare_dicts
Creates mapping dictionaries for categorical signals.

Arguments:
workdir: Working directory
signal: Signal name
def_dict: DataFrame with internal codes and descriptions (optional)
set_dict: DataFrame mapping client codes to known ontology

3. finish_prepare_load
Finalizes preparation, generates signals, and loads the repository.

Arguments:
workdir: Working directory
dest_folder: Destination for the repository
dest_rep: Repository name (prefix)
to_remove (optional): List of signals to skip
load_only (optional): List of signals to load only

Extending and Testing

For guidance on extending the process and adding automated tests, see Test Extention