CAPE Replication Project - Script Reference Guide¶
Table of Contents¶
Processing Scripts¶
cape_preprocessing.py¶
File Location: scripts/processing/cape_preprocessing.py
Purpose: Main orchestrator for the preprocessing workflow with step-by-step execution and progress tracking.
Key Functions: - main(start_from): Main execution function - mark(name): Creates completion marker files - done(name): Checks if a step is completed
Command Line Usage:
# Run from step 1 (data streaming)
python cape_preprocessing.py --start-from 1
# Run from step 2 (data preprocessing)
python cape_preprocessing.py --start-from 2
# Run from step 3 (data aggregation)
python cape_preprocessing.py --start-from 3
Parameters: - --start-from or -s: Starting step (1, 2, or 3) - 1: Data streaming - 2: Data preprocessing
- 3: Data aggregation
Output Files: - 1_data_stream.done: Completion marker for data streaming - 2_data_preprocessing.done: Completion marker for preprocessing - 3_data_aggregation.done: Completion marker for aggregation - logs/log_preprocessing.txt: Log file
data_stream.py¶
File Location: scripts/processing/data_stream.py
Purpose: Handles data streaming from external sources and manages data pipelines.
Key Functions: - data_stream(): Main streaming function - Downloads and processes raw data from various sources - Saves streamed data to ./data_stream directory
Dependencies: - External data sources (specified in configuration) - Network connectivity for data downloads
Output: Streamed data files in ./data_stream/ directory
data_preprocessing.py¶
File Location: scripts/processing/data_preprocessing.py
Purpose: Cleans and transforms raw data for analysis.
Key Functions: - data_preprocessing(): Main preprocessing function - Handles missing values and outliers - Applies data transformations and normalization - Performs spatial/temporal adjustments
Processing Steps: 1. Load raw data from streamed sources 2. Clean missing values and outliers 3. Apply transformations (normalization, scaling) 4. Perform spatial/temporal adjustments 5. Save processed data to /data_processed directory
Output: Processed data files in ./data_processed/ directory
data_aggregation.py¶
File Location: scripts/processing/data_aggregation.py
Purpose: Aggregates processed data into unified formats for analysis.
Key Functions: - data_aggregation(): Main aggregation function - Computes spatial and temporal averages - Combines multiple datasets - Creates unified output files
Aggregation Operations: - Spatial averaging across regions - Temporal aggregation (daily to monthly) - Multi-dataset combination - Format standardization (CSV, HDF)
Output: Aggregated data files in ./data_processed/output/ directory
tools.py (Processing)¶
File Location: scripts/processing/tools.py
Purpose: Utility functions for data processing operations.
Key Functions: - Data validation utilities - File handling helpers - Common data transformation functions - Error handling utilities
Development Scripts¶
cape_development.py¶
File Location: scripts/development/cape_development.py
Purpose: Main orchestrator for model development and training experiments.
Key Functions: - main(force=False): Main execution function - month_range(): Computes forecast lead months - find_exist_case(): Checks for existing experiment outputs
Command Line Usage:
# Run normal development workflow
python cape_development.py
# Force re-run all experiments
python cape_development.py --force
Parameters: - -f or --force: Re-run all experiments, ignoring existing outputs
Configuration Requirements: - config/config.json: Path configurations - cape_setting.csv: Experiment settings - fnids_info.hdf: FNID information
Workflow: 1. Load experiment settings from CSV 2. Compute lead time matrices 3. Filter FNIDs (require >12 records) 4. Execute forecasting experiments via subprocess calls
cape_development_sim.py¶
File Location: scripts/development/cape_development_sim.py
Purpose: Individual experiment execution script for model training and prediction.
Key Functions: - main(): Main execution function - Parses command-line arguments - Configures experiment settings - Executes simulation via cape_sim_build
Command Line Usage:
python cape_development_sim.py \
--fnid=AO2008A101 \
--product_name=Maize \
--season_name=Main \
--model_name=XGB \
--exp_name=YNN_ACUM_ALL \
--window=1004
Parameters: - --fnid: FNID identifier (administrative region) - --product_name: Product name (e.g., Maize, Sorghum) - --season_name: Season name (e.g., Main, Summer) - --model_name: Model type (XGB, LR) - --exp_name: Experiment name (e.g., YNN_ACUM_ALL) - --window: Forecast window identifier (e.g., 1004)
Output Files: - Model files: cape_sim_<exp_string>_L<lead_time>.{json,pkl} - Results: cape_sim_<exp_string>.npz
cape_tools_sim.py (Development)¶
File Location: scripts/development/cape_tools_sim.py
Purpose: Core simulation tools and utilities for model development.
Key Functions:
Main Simulation Functions¶
CAPE_SIM_Reforecast(rp, dir_data_in, dir_data_out): Reforecast predictionscape_sim_build(bp, dir_data_in): Main simulation workflow builderload_input_data(ubp, dir_data_in): Load Earth observation and crop data
Configuration Functions¶
ExperimentSettingPocket(exp_name, window): Configure experiment settingsmonth_range(start_month, end_month): Create month ranges with year wrap-around
Data Control Functions¶
CropDataControl(): Crop data processing and controlEODataControl(): Earth observation data control
Forecasting Functions¶
GenerateSeriesLeadPredTable(): Generate lead-time predictor tablescape_sim_build_prediction(): Execute forecasting process
Utility Functions¶
CheckLeadPred(): Validate lead predictorsCombSerialLead(): Generate serial lead combinationsfind_exist_case(df_all, dir_data_out): Check for existing experiment outputs
Experiment Name Format: <CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>
Examples: - YNN_ACUM_ALL: Yield prediction, No transformation, No trend, Accumulated predictors - PQA_ACUM_ALL: Production prediction, Quantile transformation, Automatic trend, Accumulated predictors
cape_dev.py¶
File Location: scripts/development/cape_dev.py
Purpose: Development utilities and testing scripts.
Key Features: - Development environment setup - Testing utilities - Debugging tools - Development workflow helpers
Forecasting Scripts¶
cape_forecasting.py¶
File Location: scripts/forecasting/cape_forecasting.py
Purpose: Main orchestrator for forecasting and visualization workflow.
Key Functions: - main(): Main execution function - safe_function_call(func, *args, **kwargs): Safe function execution with error handling
Configuration Integration: - Loads paths from config/config.json - Uses centralized configuration management - Implements comprehensive logging
Workflow Steps: 1. Load configuration and CAPE settings 2. Compute lead matrices and adjustments 3. Filter for specific countries/products (Somalia & Maize/Sorghum) 4. Create input data using create_input_data 5. Wait for prediction files to be available 6. Generate viewer CSV files using generate_viewer_sim
Logging: Outputs to scripts/logs/log_forecasting.txt
create_input_data.py¶
File Location: scripts/forecasting/create_input_data.py
Purpose: Prepares input data for forecasting operations.
Key Functions: - create_input_data(table, dir_data_in, fn_data_processed, fn_cropdata, fn_shapefile): Main data preparation function
Data Sources: - Processed data from HDF files - Crop data from CSV files - Geographic shapefile data
Operations: - Data loading and validation - Format standardization - Geographic integration - Temporal alignment
generate_viewer_sim.py¶
File Location: scripts/forecasting/generate_viewer_sim.py
Purpose: Generates simulation data for viewer applications.
Key Functions: - so_generate_viewer_sim(table, country, products): Main viewer generation function
Output Generation: - Viewer-compatible CSV files - Simulation validation data - Formatted results for visualization
Parameters: - table: CAPE settings table - country: Target country (e.g., 'Somalia') - products: List of products (e.g., ['Maize', 'Sorghum'])
cape_common.py¶
File Location: scripts/forecasting/cape_common.py
Purpose: Common utilities and functions for forecasting operations.
Key Functions: - Shared data processing functions - Common validation utilities - Reusable forecasting components - Error handling utilities
cape_tools_sim.py (Forecasting)¶
File Location: scripts/forecasting/cape_tools_sim.py
Purpose: Forecasting-specific simulation tools.
Key Functions: - Forecasting-specific data handling - Model prediction utilities - Result processing functions - Time series analysis tools
tools.py (Forecasting)¶
File Location: scripts/forecasting/tools.py
Purpose: General utility functions for forecasting operations.
Key Functions: - Data manipulation utilities - Statistical functions - File handling helpers - Validation functions
requirements.txt (Forecasting)¶
File Location: scripts/forecasting/requirements.txt
Purpose: Python package requirements for forecasting module.
Key Dependencies: - numpy - pandas - scikit-learn - xgboost - Additional forecasting-specific packages
Analysis Scripts¶
Cape Results Analysis¶
forecast_comparison.py¶
File Location: scripts/analysis_and_viz/cape_results_analysis/forecast_comparison.py
Purpose: Compares different forecasting models and methods.
Key Functions: - Model performance comparison - Statistical analysis of forecasts - Visualization of comparison results
Analysis Types: - Model accuracy comparison - Error analysis - Performance metrics calculation - Statistical significance testing
model_forecast_analysis.py¶
File Location: scripts/analysis_and_viz/cape_results_analysis/model_forecast_analysis.py
Purpose: Analyzes model forecasting performance.
Key Functions: - Model accuracy assessment - Error analysis - Performance metrics calculation - Model validation
Analysis Metrics: - RMSE (Root Mean Square Error) - MAE (Mean Absolute Error) - Correlation coefficients - Bias analysis
HVSTAT CAPE Comparison Analysis¶
hvstat_cape_analysis_README.md¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/hvstat_cape_analysis_README.md
Purpose: Documentation for HVSTAT-CAPE comparison analysis.
Key Content: - Analysis methodology - Data sources and preparation - Expected outputs and formats - Usage instructions and examples
data_processing.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/data_processing.py
Purpose: Data processing utilities for comparison analysis.
Key Functions: - Data loading and validation - Format standardization - Quality control checks - Preprocessing for analysis
plot_yield_comparison.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/plot_yield_comparison.py
Purpose: Generates yield comparison plots and visualizations.
Key Features: - Interactive plotting with Plotly - Statistical visualization - Comparison charts and graphs - Geographic mapping
Output Types: - Static plots (PNG format) - Interactive plots (HTML format) - Geographic maps - Time series visualizations
yield_diff_analysis.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_analysis.py
Purpose: Analyzes yield differences between different methods or models.
Key Functions: - Difference calculation - Statistical analysis - Significance testing - Trend analysis
yield_diff_map_animation.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_map_animation.py
Purpose: Creates animated maps of yield differences over time.
Key Features: - Geographic visualization - Temporal animation - Interactive maps - Difference highlighting
Output Formats: - Animated GIF files - Interactive HTML maps - Frame sequences for video creation
year_alignment.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_alignment.py
Purpose: Aligns data by year for temporal comparison.
Key Functions: - Temporal data alignment - Year-based filtering - Consistency checks - Gap filling
year_fnid_alignment.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_fnid_alignment.py
Purpose: Aligns data by year and FNID for spatial-temporal analysis.
Key Functions: - Spatial-temporal alignment - FNID-based grouping - Year-based filtering - Consistency validation
year_prod_alignment.py¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_prod_alignment.py
Purpose: Aligns production data by year for analysis.
Key Functions: - Production data alignment - Year-based aggregation - Quality control - Format standardization
requirements.txt (Analysis)¶
File Location: scripts/analysis_and_viz/hvstat_cape_comp_analysis/requirements.txt
Purpose: Python package requirements for analysis module.
Key Dependencies: - pandas - numpy - matplotlib - plotly - Additional analysis-specific packages
Utility Scripts¶
__init__.py Files¶
Purpose: Python package initialization files.
Locations: - scripts/__init__.py - scripts/processing/__init__.py - scripts/forecasting/__init__.py - scripts/analysis_and_viz/__init__.py
Completion Marker Files¶
3_data_aggregation.done¶
File Location: scripts/processing/3_data_aggregation.done
Purpose: Completion marker file for data aggregation step.
Usage: Used by cape_preprocessing.py to track workflow progress.
Configuration Files¶
config.json¶
File Location: config/config.json
Purpose: Central configuration file for all project paths and settings.
Key Parameters:
{
"dir_data_in": "/path/to/input/data/",
"dir_data_out": "/path/to/output/data/",
"dir_viewer": "/path/to/viewer/output/",
"fn_data_processed": "/path/to/processed/data.hdf",
"fn_cropdata": "/path/to/crop/data.csv",
"fn_shapefile": "/path/to/shapefile.gpkg",
"cape_setting_file": "/path/to/cape_settings.csv",
"fnids_info": "/path/to/fnids_info.hdf",
"fn_viewer_csv": "/path/to/viewer_data.csv"
}
Usage: All scripts reference this file for consistent path management.
environment.yml¶
File Location: scripts/environment.yml
Purpose: Conda environment specification with all required dependencies.
Environment Name: chafs_b
Key Dependencies: - Core scientific computing packages - Geospatial libraries - Machine learning frameworks - Visualization tools - Development utilities
Installation:
Usage Examples¶
Complete Workflow Execution¶
# 1. Set up environment
conda env create -f scripts/environment.yml -n chafs_b
conda activate chafs_b
# 2. Configure paths
# Edit config/config.json with appropriate paths
# 3. Run preprocessing
cd scripts/processing
python cape_preprocessing.py --start-from 1
# 4. Run development
cd ../development
python cape_development.py
# 5. Run forecasting
cd ../forecasting
python cape_forecasting.py
# 6. Run analysis
cd ../analysis_and_viz/hvstat_cape_comp_analysis
python plot_yield_comparison.py
Individual Experiment Execution¶
# Run specific experiment
cd scripts/development
python cape_development_sim.py \
--fnid=AO2008A101 \
--product_name=Maize \
--season_name=Main \
--model_name=XGB \
--exp_name=YNN_ACUM_ALL \
--window=1004
Analysis and Visualization¶
# Generate comparison plots
cd scripts/analysis_and_viz/hvstat_cape_comp_analysis
python plot_yield_comparison.py
# Create animated maps
python yield_diff_map_animation.py
# Analyze yield differences
python yield_diff_analysis.py
This script reference guide provides detailed information about each script's purpose, functions, parameters, and usage examples. For additional information, refer to the main project documentation and individual script comments.