Skip to content

CAPE Replication Project - Script Reference Guide

Table of Contents

  1. Processing Scripts
  2. Development Scripts
  3. Forecasting Scripts
  4. Analysis Scripts
  5. Utility Scripts

Processing Scripts

cape_preprocessing.py

File Location: scripts/processing/cape_preprocessing.py

Purpose: Main orchestrator for the preprocessing workflow with step-by-step execution and progress tracking.

Key Functions: - main(start_from): Main execution function - mark(name): Creates completion marker files - done(name): Checks if a step is completed

Command Line Usage:

# Run from step 1 (data streaming)
python cape_preprocessing.py --start-from 1

# Run from step 2 (data preprocessing)
python cape_preprocessing.py --start-from 2

# Run from step 3 (data aggregation)
python cape_preprocessing.py --start-from 3

Parameters: - --start-from or -s: Starting step (1, 2, or 3) - 1: Data streaming - 2: Data preprocessing
- 3: Data aggregation

Output Files: - 1_data_stream.done: Completion marker for data streaming - 2_data_preprocessing.done: Completion marker for preprocessing - 3_data_aggregation.done: Completion marker for aggregation - logs/log_preprocessing.txt: Log file


data_stream.py

File Location: scripts/processing/data_stream.py

Purpose: Handles data streaming from external sources and manages data pipelines.

Key Functions: - data_stream(): Main streaming function - Downloads and processes raw data from various sources - Saves streamed data to ./data_stream directory

Dependencies: - External data sources (specified in configuration) - Network connectivity for data downloads

Output: Streamed data files in ./data_stream/ directory


data_preprocessing.py

File Location: scripts/processing/data_preprocessing.py

Purpose: Cleans and transforms raw data for analysis.

Key Functions: - data_preprocessing(): Main preprocessing function - Handles missing values and outliers - Applies data transformations and normalization - Performs spatial/temporal adjustments

Processing Steps: 1. Load raw data from streamed sources 2. Clean missing values and outliers 3. Apply transformations (normalization, scaling) 4. Perform spatial/temporal adjustments 5. Save processed data to /data_processed directory

Output: Processed data files in ./data_processed/ directory


data_aggregation.py

File Location: scripts/processing/data_aggregation.py

Purpose: Aggregates processed data into unified formats for analysis.

Key Functions: - data_aggregation(): Main aggregation function - Computes spatial and temporal averages - Combines multiple datasets - Creates unified output files

Aggregation Operations: - Spatial averaging across regions - Temporal aggregation (daily to monthly) - Multi-dataset combination - Format standardization (CSV, HDF)

Output: Aggregated data files in ./data_processed/output/ directory


tools.py (Processing)

File Location: scripts/processing/tools.py

Purpose: Utility functions for data processing operations.

Key Functions: - Data validation utilities - File handling helpers - Common data transformation functions - Error handling utilities


Development Scripts

cape_development.py

File Location: scripts/development/cape_development.py

Purpose: Main orchestrator for model development and training experiments.

Key Functions: - main(force=False): Main execution function - month_range(): Computes forecast lead months - find_exist_case(): Checks for existing experiment outputs

Command Line Usage:

# Run normal development workflow
python cape_development.py

# Force re-run all experiments
python cape_development.py --force

Parameters: - -f or --force: Re-run all experiments, ignoring existing outputs

Configuration Requirements: - config/config.json: Path configurations - cape_setting.csv: Experiment settings - fnids_info.hdf: FNID information

Workflow: 1. Load experiment settings from CSV 2. Compute lead time matrices 3. Filter FNIDs (require >12 records) 4. Execute forecasting experiments via subprocess calls


cape_development_sim.py

File Location: scripts/development/cape_development_sim.py

Purpose: Individual experiment execution script for model training and prediction.

Key Functions: - main(): Main execution function - Parses command-line arguments - Configures experiment settings - Executes simulation via cape_sim_build

Command Line Usage:

python cape_development_sim.py \
  --fnid=AO2008A101 \
  --product_name=Maize \
  --season_name=Main \
  --model_name=XGB \
  --exp_name=YNN_ACUM_ALL \
  --window=1004

Parameters: - --fnid: FNID identifier (administrative region) - --product_name: Product name (e.g., Maize, Sorghum) - --season_name: Season name (e.g., Main, Summer) - --model_name: Model type (XGB, LR) - --exp_name: Experiment name (e.g., YNN_ACUM_ALL) - --window: Forecast window identifier (e.g., 1004)

Output Files: - Model files: cape_sim_<exp_string>_L<lead_time>.{json,pkl} - Results: cape_sim_<exp_string>.npz


cape_tools_sim.py (Development)

File Location: scripts/development/cape_tools_sim.py

Purpose: Core simulation tools and utilities for model development.

Key Functions:

Main Simulation Functions

  • CAPE_SIM_Reforecast(rp, dir_data_in, dir_data_out): Reforecast predictions
  • cape_sim_build(bp, dir_data_in): Main simulation workflow builder
  • load_input_data(ubp, dir_data_in): Load Earth observation and crop data

Configuration Functions

  • ExperimentSettingPocket(exp_name, window): Configure experiment settings
  • month_range(start_month, end_month): Create month ranges with year wrap-around

Data Control Functions

  • CropDataControl(): Crop data processing and control
  • EODataControl(): Earth observation data control

Forecasting Functions

  • GenerateSeriesLeadPredTable(): Generate lead-time predictor tables
  • cape_sim_build_prediction(): Execute forecasting process

Utility Functions

  • CheckLeadPred(): Validate lead predictors
  • CombSerialLead(): Generate serial lead combinations
  • find_exist_case(df_all, dir_data_out): Check for existing experiment outputs

Experiment Name Format: <CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>

Examples: - YNN_ACUM_ALL: Yield prediction, No transformation, No trend, Accumulated predictors - PQA_ACUM_ALL: Production prediction, Quantile transformation, Automatic trend, Accumulated predictors


cape_dev.py

File Location: scripts/development/cape_dev.py

Purpose: Development utilities and testing scripts.

Key Features: - Development environment setup - Testing utilities - Debugging tools - Development workflow helpers


Forecasting Scripts

cape_forecasting.py

File Location: scripts/forecasting/cape_forecasting.py

Purpose: Main orchestrator for forecasting and visualization workflow.

Key Functions: - main(): Main execution function - safe_function_call(func, *args, **kwargs): Safe function execution with error handling

Configuration Integration: - Loads paths from config/config.json - Uses centralized configuration management - Implements comprehensive logging

Workflow Steps: 1. Load configuration and CAPE settings 2. Compute lead matrices and adjustments 3. Filter for specific countries/products (Somalia & Maize/Sorghum) 4. Create input data using create_input_data 5. Wait for prediction files to be available 6. Generate viewer CSV files using generate_viewer_sim

Logging: Outputs to scripts/logs/log_forecasting.txt


create_input_data.py

File Location: scripts/forecasting/create_input_data.py

Purpose: Prepares input data for forecasting operations.

Key Functions: - create_input_data(table, dir_data_in, fn_data_processed, fn_cropdata, fn_shapefile): Main data preparation function

Data Sources: - Processed data from HDF files - Crop data from CSV files - Geographic shapefile data

Operations: - Data loading and validation - Format standardization - Geographic integration - Temporal alignment


generate_viewer_sim.py

File Location: scripts/forecasting/generate_viewer_sim.py

Purpose: Generates simulation data for viewer applications.

Key Functions: - so_generate_viewer_sim(table, country, products): Main viewer generation function

Output Generation: - Viewer-compatible CSV files - Simulation validation data - Formatted results for visualization

Parameters: - table: CAPE settings table - country: Target country (e.g., 'Somalia') - products: List of products (e.g., ['Maize', 'Sorghum'])


cape_common.py

File Location: scripts/forecasting/cape_common.py

Purpose: Common utilities and functions for forecasting operations.

Key Functions: - Shared data processing functions - Common validation utilities - Reusable forecasting components - Error handling utilities


cape_tools_sim.py (Forecasting)

File Location: scripts/forecasting/cape_tools_sim.py

Purpose: Forecasting-specific simulation tools.

Key Functions: - Forecasting-specific data handling - Model prediction utilities - Result processing functions - Time series analysis tools


tools.py (Forecasting)

File Location: scripts/forecasting/tools.py

Purpose: General utility functions for forecasting operations.

Key Functions: - Data manipulation utilities - Statistical functions - File handling helpers - Validation functions


requirements.txt (Forecasting)

File Location: scripts/forecasting/requirements.txt

Purpose: Python package requirements for forecasting module.

Key Dependencies: - numpy - pandas - scikit-learn - xgboost - Additional forecasting-specific packages


Analysis Scripts

Cape Results Analysis

forecast_comparison.py

File Location: scripts/analysis_and_viz/cape_results_analysis/forecast_comparison.py

Purpose: Compares different forecasting models and methods.

Key Functions: - Model performance comparison - Statistical analysis of forecasts - Visualization of comparison results

Analysis Types: - Model accuracy comparison - Error analysis - Performance metrics calculation - Statistical significance testing


model_forecast_analysis.py

File Location: scripts/analysis_and_viz/cape_results_analysis/model_forecast_analysis.py

Purpose: Analyzes model forecasting performance.

Key Functions: - Model accuracy assessment - Error analysis - Performance metrics calculation - Model validation

Analysis Metrics: - RMSE (Root Mean Square Error) - MAE (Mean Absolute Error) - Correlation coefficients - Bias analysis


HVSTAT CAPE Comparison Analysis

hvstat_cape_analysis_README.md

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/hvstat_cape_analysis_README.md

Purpose: Documentation for HVSTAT-CAPE comparison analysis.

Key Content: - Analysis methodology - Data sources and preparation - Expected outputs and formats - Usage instructions and examples


data_processing.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/data_processing.py

Purpose: Data processing utilities for comparison analysis.

Key Functions: - Data loading and validation - Format standardization - Quality control checks - Preprocessing for analysis


plot_yield_comparison.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/plot_yield_comparison.py

Purpose: Generates yield comparison plots and visualizations.

Key Features: - Interactive plotting with Plotly - Statistical visualization - Comparison charts and graphs - Geographic mapping

Output Types: - Static plots (PNG format) - Interactive plots (HTML format) - Geographic maps - Time series visualizations


yield_diff_analysis.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_analysis.py

Purpose: Analyzes yield differences between different methods or models.

Key Functions: - Difference calculation - Statistical analysis - Significance testing - Trend analysis


yield_diff_map_animation.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_map_animation.py

Purpose: Creates animated maps of yield differences over time.

Key Features: - Geographic visualization - Temporal animation - Interactive maps - Difference highlighting

Output Formats: - Animated GIF files - Interactive HTML maps - Frame sequences for video creation


year_alignment.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_alignment.py

Purpose: Aligns data by year for temporal comparison.

Key Functions: - Temporal data alignment - Year-based filtering - Consistency checks - Gap filling


year_fnid_alignment.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_fnid_alignment.py

Purpose: Aligns data by year and FNID for spatial-temporal analysis.

Key Functions: - Spatial-temporal alignment - FNID-based grouping - Year-based filtering - Consistency validation


year_prod_alignment.py

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_prod_alignment.py

Purpose: Aligns production data by year for analysis.

Key Functions: - Production data alignment - Year-based aggregation - Quality control - Format standardization


requirements.txt (Analysis)

File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/requirements.txt

Purpose: Python package requirements for analysis module.

Key Dependencies: - pandas - numpy - matplotlib - plotly - Additional analysis-specific packages


Utility Scripts

__init__.py Files

Purpose: Python package initialization files.

Locations: - scripts/__init__.py - scripts/processing/__init__.py - scripts/forecasting/__init__.py - scripts/analysis_and_viz/__init__.py


Completion Marker Files

3_data_aggregation.done

File Location: scripts/processing/3_data_aggregation.done

Purpose: Completion marker file for data aggregation step.

Usage: Used by cape_preprocessing.py to track workflow progress.


Configuration Files

config.json

File Location: config/config.json

Purpose: Central configuration file for all project paths and settings.

Key Parameters:

{
  "dir_data_in": "/path/to/input/data/",
  "dir_data_out": "/path/to/output/data/",
  "dir_viewer": "/path/to/viewer/output/",
  "fn_data_processed": "/path/to/processed/data.hdf",
  "fn_cropdata": "/path/to/crop/data.csv",
  "fn_shapefile": "/path/to/shapefile.gpkg",
  "cape_setting_file": "/path/to/cape_settings.csv",
  "fnids_info": "/path/to/fnids_info.hdf",
  "fn_viewer_csv": "/path/to/viewer_data.csv"
}

Usage: All scripts reference this file for consistent path management.


environment.yml

File Location: scripts/environment.yml

Purpose: Conda environment specification with all required dependencies.

Environment Name: chafs_b

Key Dependencies: - Core scientific computing packages - Geospatial libraries - Machine learning frameworks - Visualization tools - Development utilities

Installation:

conda env create -f scripts/environment.yml -n chafs_b
conda activate chafs_b


Usage Examples

Complete Workflow Execution

# 1. Set up environment
conda env create -f scripts/environment.yml -n chafs_b
conda activate chafs_b

# 2. Configure paths
# Edit config/config.json with appropriate paths

# 3. Run preprocessing
cd scripts/processing
python cape_preprocessing.py --start-from 1

# 4. Run development
cd ../development
python cape_development.py

# 5. Run forecasting
cd ../forecasting
python cape_forecasting.py

# 6. Run analysis
cd ../analysis_and_viz/hvstat_cape_comp_analysis
python plot_yield_comparison.py

Individual Experiment Execution

# Run specific experiment
cd scripts/development
python cape_development_sim.py \
  --fnid=AO2008A101 \
  --product_name=Maize \
  --season_name=Main \
  --model_name=XGB \
  --exp_name=YNN_ACUM_ALL \
  --window=1004

Analysis and Visualization

# Generate comparison plots
cd scripts/analysis_and_viz/hvstat_cape_comp_analysis
python plot_yield_comparison.py

# Create animated maps
python yield_diff_map_animation.py

# Analyze yield differences
python yield_diff_analysis.py

This script reference guide provides detailed information about each script's purpose, functions, parameters, and usage examples. For additional information, refer to the main project documentation and individual script comments.