Skip to content

CAPE Replication Project - Complete File Documentation

Table of Contents

  1. Project Overview
  2. Directory Structure
  3. Configuration Files
  4. Processing Scripts
  5. Development Scripts
  6. Forecasting Scripts
  7. Analysis and Visualization Scripts
  8. Environment and Dependencies
  9. Usage Instructions

Project Overview

The CAPE (Crop Assessment and Production Estimation) Replication Project is a comprehensive framework for crop yield forecasting and analysis. The project implements a complete workflow from data preprocessing to forecasting and visualization, with a focus on reproducibility and modularity.

Key Components:

  • Data Processing: Raw data streaming, preprocessing, and aggregation
  • Model Development: Training and validation of forecasting models
  • Forecasting: Generation of crop yield predictions
  • Analysis: Comparison and validation of results
  • Visualization: Interactive plots and maps

Directory Structure

cape_replication_project/
├── config/                     # Configuration files
├── docs/                       # Documentation
├── scripts/                    # Main script directories
│   ├── processing/            # Data processing scripts
│   ├── development/           # Model development scripts
│   ├── forecasting/           # Forecasting scripts
│   └── analysis_and_viz/      # Analysis and visualization
├── capevenv/                  # Virtual environment
└── environment.yml            # Conda environment specification

Configuration Files

config/config.json

Purpose: Central configuration file containing all file paths and directory locations.

Key Configuration Parameters: - dir_data_in: Input data directory path - dir_data_out: Output data directory path
- dir_viewer: Viewer output directory path - fn_data_processed: Processed data file path - fn_cropdata: Crop data CSV file path - fn_shapefile: Geographic shapefile path - cape_setting_file: CAPE settings CSV file path - fnids_info: FNID information HDF file path - fn_viewer_csv: Viewer CSV output file path

Usage: All scripts reference this file for consistent path management across the project.

config/.config.json.swp

Purpose: Vim swap file (temporary file created during editing).


Processing Scripts

Main Processing Scripts

scripts/processing/cape_preprocessing.py

Purpose: Main orchestrator for the preprocessing workflow.

Key Features: - Step-by-step execution: Runs preprocessing in 3 sequential steps - Progress tracking: Uses .done files to track completion - Resumable workflow: Can restart from any step using --start-from argument - Comprehensive logging: Logs all operations to file and console

Command Line Arguments: - --start-from or -s: Choose starting step (1=stream, 2=prep, 3=agg)

Workflow Steps: 1. Data Streaming (data_stream.py): Fetches and streams raw data 2. Data Preprocessing (data_preprocessing.py): Cleans and transforms data 3. Data Aggregation (data_aggregation.py): Combines and aggregates data

scripts/processing/data_stream.py

Purpose: Handles data streaming from external sources.

Key Functions: - Downloads raw data from various sources - Manages data streaming pipelines - Saves streamed data to ./data_stream directory - Handles different data formats and sources

scripts/processing/data_preprocessing.py

Purpose: Cleans and transforms raw data.

Key Functions: - Handles missing values and outliers - Applies data transformations and normalization - Performs spatial/temporal adjustments - Saves processed data to /data_processed directory

scripts/processing/data_aggregation.py

Purpose: Aggregates processed data into unified formats.

Key Functions: - Computes spatial and temporal averages - Combines multiple datasets - Creates unified output files (CSV, HDF formats) - Saves aggregated data to ./data_processed/output directory

Supporting Processing Files

scripts/processing/tools.py

Purpose: Utility functions for data processing.

Key Functions: - Data validation utilities - File handling helpers - Common data transformation functions

scripts/processing/3_data_aggregation.done

Purpose: Completion marker file for data aggregation step.

scripts/processing/__init__.py

Purpose: Python package initialization file.

Preprocessing Subdirectory

scripts/processing/preprocessing/

Purpose: Contains specialized preprocessing scripts for different data types.

Key Files: - atmp_fldas.py: FLDAS atmospheric data preprocessing - eta_ssebop_v6.py: SSEBop v6 evapotranspiration data preprocessing - Additional specialized preprocessing modules


Development Scripts

Main Development Scripts

scripts/development/cape_development.py

Purpose: Main orchestrator for model development and training.

Key Features: - Configuration-driven: Reads settings from cape_setting.csv - Experiment management: Handles multiple forecasting experiments - Lead time computation: Calculates forecast lead months and adjustments - Duplicate prevention: Avoids redundant computation using find_exist_case - Parallel execution: Runs experiments sequentially with subprocess calls

Command Line Arguments: - -f or --force: Re-run all experiments, ignoring existing outputs

Key Workflow: 1. Loads experiment settings from CSV 2. Computes lead time matrices 3. Filters FNIDs based on record requirements (>12 records) 4. Executes forecasting experiments via cape_development_sim.py

scripts/development/cape_development_sim.py

Purpose: Individual experiment execution script.

Key Features: - Parameter-driven: Accepts command-line arguments for experiment parameters - Model training: Trains forecasting models (XGBoost, Linear Regression) - Cross-validation: Implements time series cross-validation - Hyperparameter optimization: Uses Bayesian optimization - Output generation: Saves models and predictions

Command Line Arguments: - --fnid: FNID identifier - --product_name: Product name (e.g., Maize) - --season_name: Season name (e.g., Main) - --model_name: Model type (XGB, LR) - --exp_name: Experiment name (e.g., YNN_ACUM_ALL) - --window: Forecast window identifier

scripts/development/cape_tools_sim.py

Purpose: Core simulation tools and utilities.

Key Functions: - CAPE_SIM_Reforecast: Reforecast predictions for given setup - cape_sim_build: Main simulation workflow builder - load_input_data: Load Earth observation and crop data - ExperimentSettingPocket: Configure experiment settings - CropDataControl: Crop data processing and control - EODataControl: Earth observation data control - GenerateSeriesLeadPredTable: Generate lead-time predictor tables - cape_sim_build_prediction: Execute forecasting process

Supporting Functions: - month_range: Create month ranges with year wrap-around - CheckLeadPred: Validate lead predictors - CombSerialLead: Generate serial lead combinations

scripts/development/cape_dev.py

Purpose: Development utilities and testing scripts.

Key Features: - Development environment setup - Testing utilities - Debugging tools


Forecasting Scripts

Main Forecasting Scripts

scripts/forecasting/cape_forecasting.py

Purpose: Main orchestrator for forecasting and visualization workflow.

Key Features: - Configuration integration: Uses centralized config.json - Safe execution: Implements error handling with safe_function_call - Data preparation: Calls create_input_data for data setup - Viewer generation: Calls generate_viewer_sim for output creation - Comprehensive logging: Detailed logging of all operations

Workflow Steps: 1. Load configuration and CAPE settings 2. Compute lead matrices and adjustments 3. Filter for specific countries/products (currently Somalia & Maize/Sorghum) 4. Create input data using create_input_data 5. Wait for prediction files to be available 6. Generate viewer CSV files using generate_viewer_sim

scripts/forecasting/create_input_data.py

Purpose: Prepares input data for forecasting.

Key Functions: - Loads processed data from HDF files - Prepares crop data from CSV files - Integrates geographic shapefile data - Creates unified input datasets for forecasting

scripts/forecasting/generate_viewer_sim.py

Purpose: Generates simulation data for viewer applications.

Key Functions: - Processes forecasting results - Creates viewer-compatible CSV files - Handles data formatting and validation - Generates simulation validation data

Supporting Forecasting Files

scripts/forecasting/cape_common.py

Purpose: Common utilities and functions for forecasting.

Key Functions: - Shared data processing functions - Common validation utilities - Reusable forecasting components

scripts/forecasting/cape_tools_sim.py

Purpose: Forecasting-specific simulation tools.

Key Functions: - Forecasting-specific data handling - Model prediction utilities - Result processing functions

scripts/forecasting/tools.py

Purpose: General utility functions for forecasting.

scripts/forecasting/requirements.txt

Purpose: Python package requirements for forecasting module.


Analysis and Visualization Scripts

Cape Results Analysis

scripts/analysis_and_viz/cape_results_analysis/forecast_comparison.py

Purpose: Compares different forecasting models and methods.

Key Features: - Model performance comparison - Statistical analysis of forecasts - Visualization of comparison results

scripts/analysis_and_viz/cape_results_analysis/model_forecast_analysis.py

Purpose: Analyzes model forecasting performance.

Key Features: - Model accuracy assessment - Error analysis - Performance metrics calculation

HVSTAT CAPE Comparison Analysis

scripts/analysis_and_viz/hvstat_cape_comp_analysis/hvstat_cape_analysis_README.md

Purpose: Documentation for HVSTAT-CAPE comparison analysis.

Key Content: - Analysis methodology - Data sources - Expected outputs - Usage instructions

scripts/analysis_and_viz/hvstat_cape_comp_analysis/data_processing.py

Purpose: Data processing utilities for comparison analysis.

scripts/analysis_and_viz/hvstat_cape_comp_analysis/plot_yield_comparison.py

Purpose: Generates yield comparison plots.

Key Features: - Interactive plotting - Statistical visualization - Comparison charts

scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_analysis.py

Purpose: Analyzes yield differences between methods.

scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_map_animation.py

Purpose: Creates animated maps of yield differences.

Key Features: - Geographic visualization - Temporal animation - Interactive maps

scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_alignment.py

Purpose: Aligns data by year for comparison.

scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_fnid_alignment.py

Purpose: Aligns data by year and FNID.

scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_prod_alignment.py

Purpose: Aligns production data by year.

scripts/analysis_and_viz/hvstat_cape_comp_analysis/requirements.txt

Purpose: Python package requirements for analysis module.

Analysis Outputs

scripts/analysis_and_viz/hvstat_cape_comp_analysis/results/

Purpose: Directory containing analysis results.

Key Files: - year_prod_table.csv: Yearly production comparison table - year_table.csv: Yearly summary table

scripts/analysis_and_viz/hvstat_cape_comp_analysis/plots/

Purpose: Directory containing generated plots and visualizations.

Key Subdirectories: - individual_plots/: Individual country/product plots - map_frames/: Geographic map frames - plotly_product_plots/: Interactive Plotly visualizations - product_plots/: Static product comparison plots


Environment and Dependencies

scripts/environment.yml

Purpose: Conda environment specification file.

Key Dependencies: - Core Scientific Computing: numpy, pandas, scipy, scikit-learn - Geospatial: geopandas, gdal, cartopy, folium - Machine Learning: xgboost, scikit-optimize - Visualization: matplotlib, plotly, bokeh - Data Handling: h5py, netcdf4, fiona - Development: jupyter, ipython

Environment Name: chafs_b

capevenv/

Purpose: Virtual environment directory.

Contents: - bin/: Executable files and activation scripts - lib64/: Python libraries - pyvenv.cfg: Virtual environment configuration - share/: Shared resources


Usage Instructions

Setting Up the Environment

  1. Create Conda Environment:

    conda env create -f scripts/environment.yml -n chafs_b
    conda activate chafs_b
    

  2. Configure Paths: Edit config/config.json to set appropriate paths for your system.

  3. Prepare Data: Ensure required data files are in the specified input directories.

Running the Workflow

  1. Preprocessing:

    cd scripts/processing
    python cape_preprocessing.py --start-from 1
    

  2. Development:

    cd scripts/development
    python cape_development.py
    

  3. Forecasting:

    cd scripts/forecasting
    python cape_forecasting.py
    

  4. Analysis:

    cd scripts/analysis_and_viz/hvstat_cape_comp_analysis
    python plot_yield_comparison.py
    

Configuration Management

  • Central Configuration: All paths are managed in config/config.json
  • Experiment Settings: Use cape_setting.csv for experiment configuration
  • Environment: Use environment.yml for dependency management

Output Management

  • Processing Outputs: Stored in data_processed/ directories
  • Development Outputs: Model files and predictions in data_out/
  • Forecasting Outputs: Viewer files and visualizations in viewer/
  • Analysis Outputs: Plots and results in analysis subdirectories

File Naming Conventions

Data Files

  • Crop Data: gscd_data_YYYYMMDD.csv (with timestamp)
  • Shapefiles: gscd_shape.gpkg
  • Processed Data: data_product_day_all.hdf
  • Model Outputs: cape_sim_<fnid>_<crop>_<season>_<model>_<exp>_<window>_L<lead>.{pkl,json,npz}

Script Files

  • Main Scripts: cape_<phase>.py (e.g., cape_preprocessing.py)
  • Utility Scripts: tools.py, common.py
  • Specialized Scripts: Descriptive names (e.g., data_stream.py)

Configuration Files

  • Main Config: config.json
  • Environment: environment.yml
  • Settings: cape_setting.csv

Troubleshooting

Common Issues

  1. Path Configuration: Ensure all paths in config/config.json are correct
  2. Dependencies: Verify conda environment is properly activated
  3. Data Availability: Check that required input files exist
  4. Permissions: Ensure write permissions for output directories

Log Files

  • Processing: scripts/processing/logs/log_preprocessing.txt
  • Forecasting: scripts/logs/log_forecasting.txt
  • Development: Check individual experiment outputs

Error Handling

  • Safe Execution: Most scripts include error handling and logging
  • Resumable Workflows: Use --start-from and .done files for recovery
  • Validation: Scripts include input validation and error checking

Contributing

Code Organization

  • Modular Design: Each phase has its own directory and scripts
  • Configuration-Driven: Centralized configuration management
  • Documentation: Comprehensive documentation for all components

Best Practices

  • Logging: All scripts include comprehensive logging
  • Error Handling: Robust error handling and recovery mechanisms
  • Configuration: Use configuration files for path and parameter management
  • Testing: Include validation and testing utilities

This documentation provides a comprehensive overview of all files in the CAPE Replication Project. Each component is designed to be modular, configurable, and well-documented for ease of use and reproducibility.