Skip to content

Cape Workflow Documentation

Table of Contents


Background

The Cape Workflow is designed to support yield data acquisition, pre-processing, development and forecasting, and visualization. The overall process can be summarized as follows:

Yield Data Acquisition ⇒ Pre-processing ⇒ Development and Forecasting ⇒ Visualization

Yield Data Acquisition

Yield data is obtained from the FEWS NET’s Data Warehouse (FDW), which regularly updates crop data using agricultural reports submitted by individual countries. The FDW collects district-level harvested area and production data. However, because administrative boundaries may change over time, historical crop reporting units can be inconsistent with current boundaries. To address these issues, we:

  • Correct Spatial Inconsistencies:
    Aggregate or disaggregate time-series crop data to account for changes in administrative boundaries.

  • Calculate Yield:
    Use the corrected harvested areas and production data to derive yield estimates.

Since all Earth Observation (EO) data are available beginning in 1981, yield data is used starting from 1982. The dataset includes districts with records spanning at least 14 years, totaling 150 districts with an average record of 31 years. Examples include: - Kenya: 46 administrative level-1 districts (mean: 34 years) - Somalia: 32 administrative level-2 districts (mean: 21 years) - Malawi: 27 administrative level-2 districts (mean: 32 years) - Burkina Faso: 45 administrative level-2 districts (mean: 34 years)

For more details, please refer to the source here.

Crop-Growing Season

The growing season period is derived from the Joint Research Centre’s Anomaly Hotspots of Agricultural Production (ASAP) database (Rembold et al., 2019). Key points include:

  • Data Source:
    Available at JRC ASAP.

  • Definition of Growing Season:
    The mean growing season period is determined by satellite-derived phenology from the long-term mean of 10-day MODIS NDVI data.

  • Start of Season (SOS):
    SOS is defined as the time when the NDVI reaches 25% of the seasonal ascending amplitude, verified against FAO data. The SOS is adjusted two dekads earlier to mark the beginning of crop-development processes (e.g., sowing and germination).

  • Vegetative and Reproductive Phases:

  • Vegetative Growth Period (VP): Begins at SOS and lasts for 6 dekads (60 days).
  • Reproductive Development Period (RP): Follows the VP.

Workflow Description

The Cape workflow consists of the following main stages:

  1. Pre-processing:
    Raw yield data is cleaned, transformed, and corrected for spatial inconsistencies before further analysis.

  2. Development :
    Processed data is used to develop models and generate forecasts.

  3. Forecasting and Visualization:

Results are visualized to support decision-making and provide insights into crop performance.

This document also provides an overview of the directory structure, file descriptions, and location details for the Cape workflow. The following sections cover the specifics of the environments and file organization.


1. Data Directory Structure

Data Directory Location:

pith:/home/cape/

├── cape
│   ├── CAPE
│   │   ├── data_marty/
│   │   ├── data_nmathlouthi/
│   ├── data
│   │   ├── cropdata
│   │   │   ├── gscd_data_240213.csv
│   │   │   ├── gscd_data.csv
│   │   ├── cropmask
│   │   │   ├── geoglam_bacs_v1/
│   │   │   ├── ifpri/
│   │   ├── data_processed
│   │   ├── data_stream
│   │   ├── logs
│   │   ├── shapefile
│   │   │   ├── gscd_shape.gpkg
│   │   │   ├── gscd_shape.gpkg-shm
│   │   │   ├── gscd_shape.gpkg-wal
│   │   ├── tmp
│   ├── data_in
│   │   ├── adm_cropland_area.hdf
│   │   ├── <fnid>_crop_<crop_name>_<season>.hdf
│   │   ├── <find>_pred.hdf
│   ├── data_out
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>_<L0*>.pkl
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>_<L0*>.pkl
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>.npz
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>.npz
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>_<L0*>.json
│   │   ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>_<L0*>.json
│   ├── data2
│   │   ├── cropdata
│   │   │   ├── gscd_data_240213.csv
│   │   │   ├── gscd_data.csv
│   │   ├── cropmask
│   │   │   ├── geoglam_bacs_v1/
│   │   │   ├── ifpri/
│   │   ├── data_processed
│   │   ├── data_stream
│   │   ├── logs
│   │   ├── shapefile
│   │   │   ├── gscd_shape.gpkg
│   │   │   ├── gscd_shape.gpkg-shm
│   │   │   ├── gscd_shape.gpkg-wal
│   │   ├── tmp
│   ├── viewer
│   │   ├── viewer_data_sim_validation.csv
│   │   ├── viewer_data_sim.csv

Outline of Data Directory and Location

Note: As of 01/25, please use cape/data2/ for up-to-date data.

Directory / File Description Notes
cape/CAPE/data_marty/ Reproducibility directory containing data from Marty.
cape/CAPE/data_nmathlouthi/ Reproducibility directory containing data from N. Mathlouthi.
cape/data/cropdata/gscd_data_240213.csv Crop data file with timestamp 240213. Primary crop data file
cape/data/cropdata/gscd_data.csv Crop data file without timestamp. Secondary or merged crop data file
cape/data/cropmask/geoglam_bacs_v1/ Directory containing crop mask data from Geoglam BACS v1.
cape/data/cropmask/ifpri/ Directory containing crop mask data from IFPRI.
cape/data/data_processed/ Directory for processed data outputs from the pre-processing phase.
cape/data/data_stream/ Directory for streaming data during the pre-processing phase.
cape/data/logs/ Directory for log files.
cape/data/shapefile/gscd_shape.gpkg GeoPackage file containing shapefile data needed for the Cape workflow.
cape/data/shapefile/gscd_shape.gpkg-shm Shared memory file for the GeoPackage shapefile.
cape/data/shapefile/gscd_shape.gpkg-wal WAL file for the GeoPackage shapefile.
cape/data/tmp/ Temporary data storage directory.
cape/data_in/adm_cropland_area.hdf HDF file for administrative cropland area data. Used for simulation during development phase.
cape/data_in/<fnid>_crop_<crop_name>_<season>.hdf HDF file for country- and admin-specific crop data. Files created during simulation in the development phase. Replace <fnid>, <crop_name>, <season>.
cape/data_in/<find>_pred.hdf HDF file for prediction data. Files created during simulation in the development phase. Replace <find> as needed.
cape/data_out/ Directory for simulation outputs (pkl, npz, json files).
cape/data2/ Mirror of cape/data. Latest data directory (as of 01/25). Use this directory for reproducibility of the Cape workflow.
cape/viewer/viewer_data_sim_validation.csv CSV file containing simulation validation data.
cape/viewer/viewer_data_sim.csv CSV file containing simulation data.

2. Production Environment

Location:
pith:/home/chc-prod/

├──cape
│   ├──code_transfer.sh  
│   ├──environment.yml  
│   ├──miniconda3  
│   ├──subx
│   ├──cape-eric  
│   ├──data          
│   ├──migration.sh     
│   ├──ryu678     
│   ├──ug
Directory / File Description Notes
cape/ Root directory for the production environment.
code_transfer.sh Shell script for transferring code files.
environment.yml Conda environment configuration file. Useful fore reproducing the cape workflow environment.
miniconda3/ Directory containing the Miniconda installation.
subx/ Subdirectory for miscellaneous scripts/files.
cape-eric/ Reproduciblity user specific directory.
data/ Data directory for production. No longer using this directory. Please use cape/data2/ for any data needs.
migration.sh Shell script for migration tasks.
ryu678/ User/project-specific directory.
ug/ Directory for UG-specific files.

3. Main Cape Directory in Production

Location:
pith:/home/chc-prod/cape

Main Directory Contents

Scripts and Configuration Files

  • cape_development.py: Python script for development processes.
  • cape_forecasting.py: Python script for forecasting processes.
  • cape_preprocessing.py: Python script for preprocessing operations.
  • cape_setting.csv: Configuration file for CAPE settings.
  • code_transfer.sh: Shell script for transferring code files.
  • run_development.sh: Shell script for running development processes.
  • run_forecasting.sh: Shell script for running forecasting workflows.
  • run_preprocessing.sh: Shell script for running preprocessing workflows.

  • Logs:

  • log_forecasting.txt: Log file for forecasting operations.
  • log_preprocessing.txt: Log file for preprocessing operations.

  • Data Directories:

  • data_in: Directory for input data used in various workflows.
  • data_out: Directory for output data generated from workflows.

  • Viewer:

  • viewer: Contains visualizations and figures.
    • figures: Stores graphical outputs.
    • Example: Angola-Maize-Main/GB/feature_importance/importance_heatmap_GB.png.
    • Forecast and hindcast visualizations are also stored.
Subdirectory: cape:
  • Code Files:
  • cape_development_sim.py: Python script for simulated development.
  • cape_graphics.py: Python script for generating graphical outputs.
  • cape_tools_sim.py: Tools for simulation operations.
    • Includes versions: .mod, .save (these may have been created during reproducibility tasks).
  • create_input_data.py: Script to create input datasets.
    • Versions: .marty, .prod(these may have been created during reproducibility tasks).
  • data_aggregation.py: Script for aggregating data.
  • data_preprocessing.py: Script for preprocessing data.
  • data_stream.py: Script for managing data streams.
  • generate_graphics.py: Generates graphics for analysis.
  • generate_viewer_sim.py: Generates viewer simulation.
Additional Files:
  • viewer_data_sim.csv: CSV file for simulated viewer data.
  • viewer_data_sim_validation.csv: CSV file for validating simulated viewer data.
Subdirectory: preprocessing:
  • Contains various scripts for preprocessing EO files/data such as:
    • atmp_fldas.py: Preprocessing script for FLDAS atmospheric data.
    • eta_ssebop_v6.py: Preprocessing script for SSEBop v6 data.
    • etos_noaa_merra.py: Preprocessing script for NOAA-MERRA evapotranspiration data.

4. Pre-processing Phase Overview

The cape_preprocessing.py script is primarily a wrapper script that logs and initiates the stages of the preprocessing workflow. Meanwhile, the cape_preprocessing.sh shell script handles execution of the pipeline, ensuring the appropriate environment is activated and the Python script is run.

  • Primary Script: cape_preprocessing.py
    • Acts as a wrapper that logs and orchestrates the workflow.
    • Ensures proper environment activation via cape_preprocessing.sh(optional).
  • Workflow Steps:
    • Data Streaming (data_stream.py):
      • Fetches raw data from external sources.
      • Streams data and saves it to the ./data_stream directory.
    • Data Preprocessing (data_preprocessing.py):
      • Cleans raw data (e.g., handling missing values, filtering outliers).
      • Transforms and normalizes data.
      • Applies spatial/temporal adjustments (e.g., interpolation, alignment with shapefiles).
      • Saves processed data to the /data_processed directory.
    • Data Aggregation (data_aggregation.py):
      • Aggregates data (e.g., computing spatial averages, temporal means).
      • Combines multiple datasets into a unified file (e.g., .csv format).
      • Saves the aggregated output to the ./data_processed/output directory.

5. Development/Forecasting Phase Overview

Before running cape_development.py, we need to adjust cape_development_sim.py and cape_tools_sim.py.

The development phase of the CAPE framework involves the following scripts: - cape_development.py [main script] - cape_tools_sim.py called in cape_development_sim.py - cape_development_sim.py [called in cape_development.py]

It includes both training of prediction models within a unified workflow. The main components are outlined below:

a. Experiment Configuration and Setup

  • Initial Settings:

  • Adjusting directories : cape_development.py [main script] function def main() adjust the following directories:

    • dir_data_in contains hdf files
    • dir_data_out set up for user reproducibility.
    • Adjust CAPE settings ./cape_setting.csv directory as needed.
  • Experiment settings

    • cape_setting.csv, which includes details such as country, product, season, forecast start/end months, harvest month, and experiment name.
    • fnids_info.hdf
    • Forecasting windows determine the lead times for predictions.
  • Computing Forecast Lead Times:

  • A custom function (month_range) calculates the forecast lead months and derives additional values such as countdown lists and relative lead month adjustments.

b. Data Filtering and Preparation

  • FNID Selection:
  • Country identifiers (FNIDs) are retrieved from fnids_info.hdf, which contains metadata (e.g., record_yield).
  • Filtering:
  • FNIDs with insufficient historical records (e.g., record_yield > 12) are excluded.
  • Avoiding Redundant Computation:
  • The find_exist_case function filters out already processed experiments.

c. Model Training and Forecast Execution

  • Training Phase:
  • Forecasting models (e.g., XGBoost, linear regression) are trained on historical data.
  • Techniques include Bayesian hyperparameter optimization, cross-validation, and variance threshold filtering.
  • Trained models are saved as multiple .json and one .npz file per fnid (adm1).
  • Prediction Phase:
  • Hindcasts and forecasts are generated using trained models.
  • Both predictions and models are post-processed to restore the original data scale (using inverse transformations and trend adjustments).

d. Output Generation

  • Saving Results:
  • Forecast results and lead-time model files are saved as compressed .npz files in the output directory.
  • Users need to update the output directory (dir_data_out) to match their workspace.

Cape Development Main Script (cape_development.py)

This script organizes the forecasting experiments based on CAPE settings.

Key Tasks:

  1. Read Experiment Settings:
  2. Extract parameters from cape_setting.csv.
  3. Compute Lead Times:
  4. Use the month_range function to compute forecast lead months.
  5. Set Up Experiment Cases:
  6. Retrieve FNIDs from fnids_info.hdf.
  7. Filter cases with less than 12 records.
  8. Avoid Redundant Computation:
  9. Use find_exist_case to skip processed experiments.
  10. Execute Forecasting:
  11. Iterate over each experiment and call cape_development_sim.py with necessary parameters.

Prerequisites

  • Python Version: Python 3.7 or higher.
  • Required Packages: pandas, numpy, sys, os, json, subprocess, itertools, time.
  • Custom Modules: cape_tools_sim (contains month_range and find_exist_case functions).

Directory Structure

├── data_in/
│   ├── fnids_info.hdf        # Input HDF file
├── data_out/                 # Output directory
├── cape_setting.csv          # Experiment settings file
├── cape/
│   ├── cape_development_sim.py  # Secondary script for forecasting
│   └── cape_tools_sim.py        # Helper functions module

Cape Development Simulation Script (cape_development_sim.py)

This script runs simulations and saves forecasting results based on CAPE model settings.

Usage:

bash
python cape_development_sim.py --fnid=<FNID> --product_name=<product> --season_name=<season> --model_name=<model> --exp_name=<experiment> --window=<time_window>

Arguments:

Argument Type Description Required
--fnid str FNID identifier (adm1). Yes
--product_name str Name of the product (e.g., Maize). Yes
--season_name str Season name (e.g., Main, Summer). Yes
--model_name str Model type (e.g., XGB, LR). Yes
--exp_name str <CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>. Yes
--window str Forecast window identifier. Yes

What is exp_name? Example - YNN_ACUM_ALL in cape_setting.csv - Yield Prediction (Y). - No Transformation (N). - No Trend Removal (N). - Accumulated Predictors (ACUM_ALL).

  • <CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>
  • Crop Indicator

    • Y: Refers to yield prediction.
    • P: Refers to production prediction.
  • Transformation Method: specifies how the data is transformed:

    • Q: Quantile transformation.
    • N: No transformation (raw data).
    • 5: Five-year moving average transformation.
  • Trend Method: Defines how trends are analyzed or removed:

    • A: Automatic trend detection (chooses the best trend model).
    • L: Linear trend removal.
    • N: No trend adjustment.
  • Predictor Configuration: Specifies which predictors are used and how they are combined:
    • ACUM_ALL: Uses accumulated predictors over the lead time. Includes variables like prcp, pdry, etos, tavg, gdd , kdd , ndvi, and year.

Directory Paths

  • Input Directory:

    dir_data_in = './data_in'
    

  • Output Directory:

    dir_data_out = './data_out'
    

Workflow Summary

  1. Read and Parse Arguments:
  2. Parse command-line arguments using argparse.
  3. Output Filename Generation:
  4. Generate output file name based on parameters.
  5. Experiment Setting:
  6. Use ExperimentSettingPocket to configure simulation parameters.
  7. Forecast Execution:
  8. Run simulation using cape_sim_build.
  9. Save Output:
  10. Save models and results to the output directory.

Cape Simulation Tools Script (cape_tools_sim.py)

This module includes helper functions for forecasting and data management.

Key Functions:

  1. CAPE_SIM_Reforecast:
  2. Reforecasts predictions for a given setup.
  3. cape_sim_build:
  4. Builds and executes the CAPE simulation workflow.
  5. load_input_data:
  6. Loads Earth observation and crop data.
  7. ExperimentSettingPocket:
  8. Configures experiment settings.

Supporting Functions:

  • Data Preparation: CropDataControl, EODataControl
  • Forecasting: GenerateSeriesLeadPredTable, cape_sim_build_prediction
  • Utility: month_range, CheckLeadPred, CombSerialLead

Configuration Adjustments

  1. Directory Paths:

    dir_data_in = '/home/cape/CAPE/narjes/cape_workflow/data_in/'
    dir_data_out = '/home/cape/CAPE/data_nmathlouthi/data_out/'
    

  2. Dependencies:

    pip install numpy pandas geopandas scipy statsmodels scikit-learn xgboost scikit-optimize
    

  3. File Validation:

  4. Ensure necessary files exist in data_in.

Cape Development Script cape_development.py

cape_development.py python script that orchestrates a series of forecasting experiments based on CAPE settings. The script reads experiment configurations from a CSV file, computes forecast lead times, filters experiments based on available records, and then executes each experiment via a secondary script.

cape_development.py performs the following tasks:

  1. Reads Experiment Settings:
    Reads the cape_setting.csv file to extract experiment parameters such as country, product, season, forecast start/end months, harvest month, and experiment name.

  2. Computes Lead Times:
    Uses the custom function month_range to compute a list of forecast lead months and creates additional derived values (such as a countdown list and relative lead month adjustments).

  3. Sets Up Experiment Cases:
    Reads additional information from an HDF file (fnids_info.hdf) to retrieve experiment identifiers (FNIDs). It then filters out cases that do not meet a minimum record threshold (i.e., record_yield > 12).

  4. Avoids Redundant Computation:
    Uses a custom function find_exist_case to check which experiments have already been processed and filters them out.

  5. Executes Forecasting Experiments:
    Iterates over each remaining experiment and calls a secondary script (cape/cape_development_sim.py) using command-line arguments that pass the necessary parameters.

Prerequisites

  • Python Version: Python 3.7 or higher is recommended.
  • Required Python Packages:
  • pandas
  • numpy
  • Standard libraries: sys, os, json, subprocess, itertools, and time.
  • Custom Modules:
  • cape.cape_tools_sim — This module should include the functions month_range and find_exist_case.

Tip: Use the project virtual environment (venv) to manage dependencies.

Input Files and Settings

├── data_in/
│   ├── `fnids_info.hdf`        # Input HDF file with FNID information
│   └── (other input files)
├── data_out/                 # Output directory where results will be stored
├── cape_setting.csv          # CSV file containing experiment configurations
├── cape/
│   ├── cape_development_sim.py  # Script to run individual forecasting experiments
│   └── cape_tools_sim.py        # Contains custom helper functions (e.g., month_range, find_exist_case)

cape_setting.csv

  • Contains key configuration columns such as:
  • country
  • product
  • season
  • model_name
  • forecast_start_month
  • forecast_end_month
  • harvest_month
  • exp_name

Notes Based on Cape Notes

Manual Configuration of Forecasting Windows: The documentation states that the forecasting window (i.e., the start and end months for the forecast) can be “added or modified in the CAPE settings file (cape_setting.csv).” - Question do users manually adjust these values?

The planting_month and harvest_month values in cape_setting.csv are copied from the crop data file.

  • Question do users manually adjust these values?

fnids_info.hdf

  • An HDF file in /home/cape/data_in/fnids_info.hdf that contains FNIDs along with metadata such as country_code, product, season_name, and record_yield.
  • Make sure the file is in /home/cape/data_in/fnids_info.hdf directory and that it contains the necessary fields for filtering (e.g., record_yield > 12).
  • Description: This HDF file contains FNIDs along with relevant metadata fields such as: • country_code • country • fnid • name • product • season_name • planting_month • harvest_month • record_area • record_production • record_yield • Dataset Structure: The file structure is organized under the df group and includes several sub-datasets: • axis0: Lists column names (fields). • block0_items: Contains metadata fields (e.g., country_code, product). • block1_items: Contains numerical fields (record_area, record_production, record_yield). • block1_values: Stores the corresponding values for the fields in block1_items.

Output Directory (data_out/)

  • The results of the experiments are saved here.

  • Adjustment
    New users need to change this path to their output directory

Configuration Adjustments

When replicating or modifying the experiment, new users need to adjust the following parts of the code:

  1. Directory Paths:
  2. Input Directory:
    dir_data_in = './data_in/'
    
  3. Output Directory:

    dir_data_out = './data_out/'  # Adjust as needed (e.g., an absolute path)
    

  4. CSV Experiment Settings:

  5. Ensure that the values in cape_setting.csv (like forecast_start_month, forecast_end_month, and harvest_month) correctly reflect your experiment design.
  6. Verify that no duplicates exist for the key columns: country, product, season, and model_name.

  7. Filtering Criteria in FNID Selection:

  8. The code filters out FNIDs with fewer than 12 records:
    sub = sub[sub['record_yield'] > 12].reset_index(drop=True)
    
  9. Adjustment Needed:
    Modify the threshold if your data requires a different minimum record count.

  10. Forecasting Command Construction:

  11. The script constructs a command to run cape_development_sim.py:
    command = "python cape/cape_development_sim.py --fnid=%s --product_name=%s --season_name=%s --model_name=%s --exp_name=%s --window=%s" % (
        fnid, product_name, season_name, model_name, exp_name, window)
    
  12. Adjustment Needed:
    If you change the parameters or the name/location of the forecasting script, update this command accordingly.

  13. Custom Helper Functions:

  14. The functions month_range and find_exist_case are used to process dates and check for existing results.
  15. Adjustment Needed:
    If you wish to change how lead months or duplicate experiments are handled, edit these functions in the cape/cape_tools_sim.py module.

Running the Script

  1. Activate Your Virtual Environment (if applicable):
    source .python_venv/bin/activate
    

Cape Development Simulation Script cape_development_sim.py

This script (cape_development_sim.py) is designed to run simulations and save forecasting results based on the CAPE model settings.

Usage

Run the script with the required arguments:

python cape_development_sim.py --fnid=<FNID> --product_name=<product> --season_name=<season> --model_name=<model> --exp_name=<experiment> --window=<time_window>

Example Command:

python cape_development_sim.py --fnid=AO2008A101 --product_name=Maize --season_name=Main --model_name=XGB --exp_name=YNN_ACUM_ALL --window=1004

Arguments

Argument Type Description Required
--fnid str The FNID identifier. Yes
--product_name str Name of the product (e.g., Maize). Yes
--season_name str Season name (e.g., Main). Yes
--model_name str Model to be used (e.g., XGB or LR). Yes
--exp_name str Experiment name. Yes
--window str Forecast window identifier. Yes
--note str Optional note for the run. No

Directory Paths

  • Input Directory:

    dir_data_in = './data_in'
    
    Ensure that the input data files are present in this directory or adjust the path if necessary (e.g., /home/cape/CAPE/data_in/).

  • Output Directory:

    dir_data_out = '/home/cape/CAPE/data_nmathlouthi/data_out/'
    
    This path should be updated for each user to point to their dedicated output directory.


Workflow Overview

  1. Read and Parse Arguments:
    The script parses command-line arguments using the argparse library.

  2. Output Filename Generation:
    An output file name is generated based on the parameters:

    exp_string = '%s_%s_%s_%s_%s_%s' % (fnid, product_name, season_name, model_name, exp_name, window)
    

  3. Experiment Setting:
    The ExperimentSettingPocket class is used to configure the simulation parameters.

  4. Forecasting Execution:
    The script runs the simulation through the cape_sim_build function, which generates forecasting results.

  5. Save Output:
    The prediction models and results are saved to the output directory:

  6. Models (.json or .pkl) for each lead time.
  7. Compressed results (.npz file).

Output Files

  • Model Files:
    Depending on the model type, the script saves either JSON or PKL files:

    'cape_sim_<exp_string>_L<lead_time>.json'
    'cape_sim_<exp_string>_L<lead_time>.pkl'
    

  • Compressed Results:

    'cape_sim_<exp_string>.npz'
    


What Needs Adjustment

  1. Directory Paths:
  2. Adjust dir_data_in and dir_data_out based on your file locations.

  3. Dependencies:
    Ensure the required Python libraries are installed:

    pip install numpy pandas joblib
    

  4. File Existence:
    Verify that necessary files and directories exist (e.g., data_in directory and data files).

  5. Error Handling:
    You may want to add error handling to manage missing files or invalid arguments.


Adjustments

Modify paths like so:

dir_data_in = ' '
dir_data_out = '/home/cape/CAPE/data_nmathlouthi/data_out/'

Dependencies

  • Python 3.x
  • Libraries: numpy, pandas, joblib, argparse

Cape Development Simulation Tools Script cape_tools_sim.py

CAPE simulation tools workflow, includes functions for forecasting crop yields, managing data, and generating predictions. Below is an explanation of key functions, workflows, and required adjustments.


Key Functions

1. CAPE_SIM_Reforecast

  • Purpose: Reforecast predictions for a given experiment setup.
  • Parameters:
  • rp: Dictionary containing input parameters (e.g., fnid, product_name).
  • dir_data_in: Path to input data directory.
  • dir_data_out: Path to output data directory.
  • Output:
  • Returns reforecasted and merged prediction results.

2. cape_sim_build

  • Purpose: Build and execute the CAPE simulation.
  • Parameters:
  • bp: Dictionary containing simulation settings.
  • dir_data_in: Input data directory.
  • Workflow:
  • Load input data using load_input_data.
  • Generate forecast time tables and run predictions.
  • Restore data to its original scale.

3. load_input_data

  • Purpose: Load Earth observation and crop data for simulations.
  • Parameters:
  • ubp: Dictionary containing crop and indicator data settings.
  • dir_data_in: Input directory path.
  • Output:
  • Returns loaded crop data (box), time series data (y), and formatted data (df).

4. ExperimentSettingPocket

  • Purpose: Generate settings for an experiment based on the exp_name and window.
  • Parameters:
  • exp_name: Experiment name (e.g., YNN_ACUM_ALL).
  • window: Forecast window (e.g., 1004).
  • Output:
  • Dictionary containing indicator name, transformation method, trending method, and lead predictors.

Important Functions Overview

Data Preparation Functions

  • CropDataControl: Handles crop data processing, including outlier removal, trend analysis, and data transformation.
  • EODataControl: Controls the aggregation and resampling of Earth observation data.

Forecasting Functions

  • GenerateSeriesLeadPredTable: Generates a table of lead-time predictors for forecasting.
  • cape_sim_build_prediction: Executes the forecasting process using regression models.

Supporting Functions

  • month_range: Creates a list of months within a given range, supporting year wrap-around.
  • CheckLeadPred: Validates that the lead predictors meet expected format and constraints.
  • CombSerialLead: Generates serial lead combinations for time series forecasting.

Workflow Summary

  1. Initial Setup and Parameter Validation
  2. Read and validate input parameters from the experiment settings (ExperimentSettingPocket).

  3. Data Loading and Preparation

  4. Load crop and Earth observation data using load_input_data.
  5. Process data for lead-time predictors and time windows.

  6. Forecasting and Prediction

  7. Use regression models (e.g., XGBRegressor, LinearRegression) to forecast crop yields.
  8. Perform cross-validation (LeaveOneGroupOut, TimeSeriesSplit) and hyperparameter tuning (HyperparameterTuning).

  9. Output Generation

  10. Save prediction models and forecast results to the output directory (dir_data_out).

Dependencies

  • Python 3.x
  • Libraries:
  • numpy
  • pandas
  • geopandas
  • scipy
  • statsmodels
  • sklearn
  • xgboost
  • skopt

Directory Paths

  • Input Directory: Ensure the following files are in the input directory (dir_data_in):
  • Crop data files (e.g., fnid_crop_<product>_<season>.hdf).
  • Earth observation data (e.g., fnid_pred.hdf).

  • Output Directory: Update the dir_data_out path to reflect the user's dedicated output directory.


Adjustments

  1. Directory Paths:
    Modify dir_data_in and dir_data_out as needed:

    dir_data_in = '/home/cape/CAPE/narjes/cape_workflow/data_in/'
    dir_data_out = '/home/cape/CAPE/data_nmathlouthi/data_out/'
    

  2. Dependencies:
    Ensure the required Python libraries are installed:

    pip install numpy pandas geopandas scipy statsmodels scikit-learn xgboost scikit-optimize
    

  3. File Validation:
    Add checks to ensure necessary files and directories exist:

    if not os.path.exists(os.path.join(dir_data_in, 'fnid_crop_<product>_<season>.hdf')):
        raise FileNotFoundError("Input crop data file not found.")
    


Sample Run

Run the script with appropriate parameters:

python cape_development_sim.py --fnid=AO2008A101 --product_name=Maize --season_name=Main --model_name=XGB --exp_name=YNN_ACUM_ALL --window=1004

This will initiate the simulation and save results in the specified output directory.

7. Forecasting & Visualization Phase

Relies on 5 scripts:

  • cape_forecasting.py [Main script]
  • create_input_data.py [ Called within cape_forecasting.py]
  • generate_viewer_sim.py [ Called within cape_forecasting.py]
  • cape_tools_sim.py [ Called within cape_forecasting.py]
  • tools.py[ Called within cape_forecasting.py]

cape_forecasting.py

Configuration and Setup:

  • Reads the experiment settings from a CSV file (cape_setting.csv)
  • Checks for duplicate entries in key columns to ensure data integrity.
  • Computes the lead months based on the forecast start and end months and adjusts them relative to the harvest month (including handling year wrap-around).

Directory and File Setup:

  • Specifies directories for input data, processed output data, and viewer outputs.
  • Defines file paths for processed data, crop data, and the required shapefile.

Forecasting Workflow:

  • Uses a safe function wrapper to call:
  • create_input_data: Prepares the necessary input data from the specified files and settings.
  • generate_viewer_sim: Executes the forecasting process and generates simulation data for the viewer.
  • (Note: The visualization step using generate_graphics is currently deprecated.)

Logging and Timing: - Logs the start and end times, along with key process steps, to both a log file and the console. - Provides a runtime summary when the process completes.

7. Reproducibility Overview

Environment Setup

  1. Access the Host Location:
    Ensure you have access to the following location:
    pith:/home/chc-prod/
    This is required to use the CAPE environment.

  2. Setting Up the Python Environment:

  3. Using Conda:
    Use the provided environment.yml file to create a dedicated Conda environment:
    conda env create -f environment.yml -n cape_env
    conda activate cape_env
    
  4. Alternative Virtual Environment:
    If you prefer to work with a virtual environment, generate a requirements.txt file based on the environment.yml file and install the packages:
    # Export the environment (if needed)
    conda env export -n cape_env > environment.yml
    # Create a requirements.txt file from environment.yml manually, then:
    pip install -r requirements.txt
    

Data Files and Directories

  • Shapefile and Crop Data:
    The shapefile (gscd_shape.gpkg) and crop data (gscd_data_240213.csv) are obtained from FDW. These files are used in all phases of the CAPE workflow (from preprocessing to forecasting).
  • If you are starting from the development phase, locate these files in:

    • ./data2/cropdata/gscd_data_240213.csv
    • ./data2/cropdata/gscd_shape.gpkg
      These files are based on the latest acquisition data from February 13th, 2024.
  • Preprocessing Data:
    You do not need to create directories to store preprocessing data if you are using the existing files in this directory. However, if you are reprocessing data for a new country, you may replicate the data_in directory to store the adm_cropland_area.hdf generated during preprocessing. Otherwise, this file can be found in the cape/data_in directory as part of the data_processed and data_stream directories and is aggregated again if reproducibility starts from the preprocessing phase.
    Important: If you are replicating preprocessing, please replicate these directories so as not to overwrite actual data.

  • Development Phase:
    During the development phase, you will rely on these files to initiate development. Here you will need to create a data_out directory that will contain the model outputs for the newly added country.

  • The process takes approximately 4-5 hours.

  • Forecasting Phase:
    For the forecasting phase, you would need to duplicate the viewer directory.

  • Directory Settings in the Code:
    Update directories and filenames as follows (adjust paths as needed):

# Set directories and filenames
dir_data_in = "./cape/data_in/"  # simulation output need to change
dir_data_out = "./cape/CAPE/data_nmathlouthi/data_out/"
dir_viewer   = "./cape/CAPE/data_nmathlouthi/viewer"
fn_data_processed = "./cape/data2/data_processed/output/data_product_day_all.hdf"  # only if starting from preprocessing stage, otherwise use the existing file in data2/data_processed/output/data_product_day_all.hdf
fn_cropdata = "./cape/data2/cropdata/gscd_data_240213.csv"  # you may use the existing file
fn_shapefile = "./cape/data2/cropdata/gscd_shape.gpkg"  # you may use the existing file

Updating Scripts for a New Country

For all scripts, directory adjustments are required. Additionally, to add a new country to CAPE, you must update the main() function in the following scripts: - cape_processing.py - cape_development.py - cape_forecasting.py

Example Update for Country Filtering:

Inside the main() function, after loading cape_settings.csv, add:

table = pd.read_csv('path/to/cape_settings.csv')
# Filter to process data only for the chosen country (e.g., South Africa)
table = table[table['country'] == "South Africa"]
Replace "South Africa" with the desired country name. This ensures that the scripts run only for that one country.