Cape Workflow Documentation¶
Table of Contents¶
- Background
- Yield Data Acquisition
- Crop-Growing Season
- Workflow Description
- 1. Data Directory Structure
- 2. Production Environment
- 3. Main Cape Directory in Production
- Main Directory Contents
- Subdirectory:
cape - Additional Files
- Subdirectory:
preprocessing - 4. Pre-processing Phase Overview
- 5. Development/Forecasting Phase Overview
- 6. Forecasting and Visualization Phase Overview
- 7. Reproducibility Overview
Background¶
The Cape Workflow is designed to support yield data acquisition, pre-processing, development and forecasting, and visualization. The overall process can be summarized as follows:
Yield Data Acquisition ⇒ Pre-processing ⇒ Development and Forecasting ⇒ Visualization
Yield Data Acquisition¶
Yield data is obtained from the FEWS NET’s Data Warehouse (FDW), which regularly updates crop data using agricultural reports submitted by individual countries. The FDW collects district-level harvested area and production data. However, because administrative boundaries may change over time, historical crop reporting units can be inconsistent with current boundaries. To address these issues, we:
-
Correct Spatial Inconsistencies:
Aggregate or disaggregate time-series crop data to account for changes in administrative boundaries. -
Calculate Yield:
Use the corrected harvested areas and production data to derive yield estimates.
Since all Earth Observation (EO) data are available beginning in 1981, yield data is used starting from 1982. The dataset includes districts with records spanning at least 14 years, totaling 150 districts with an average record of 31 years. Examples include: - Kenya: 46 administrative level-1 districts (mean: 34 years) - Somalia: 32 administrative level-2 districts (mean: 21 years) - Malawi: 27 administrative level-2 districts (mean: 32 years) - Burkina Faso: 45 administrative level-2 districts (mean: 34 years)
For more details, please refer to the source here.
Crop-Growing Season¶
The growing season period is derived from the Joint Research Centre’s Anomaly Hotspots of Agricultural Production (ASAP) database (Rembold et al., 2019). Key points include:
-
Data Source:
Available at JRC ASAP. -
Definition of Growing Season:
The mean growing season period is determined by satellite-derived phenology from the long-term mean of 10-day MODIS NDVI data. -
Start of Season (SOS):
SOS is defined as the time when the NDVI reaches 25% of the seasonal ascending amplitude, verified against FAO data. The SOS is adjusted two dekads earlier to mark the beginning of crop-development processes (e.g., sowing and germination). -
Vegetative and Reproductive Phases:
- Vegetative Growth Period (VP): Begins at SOS and lasts for 6 dekads (60 days).
- Reproductive Development Period (RP): Follows the VP.
Workflow Description¶
The Cape workflow consists of the following main stages:
-
Pre-processing:
Raw yield data is cleaned, transformed, and corrected for spatial inconsistencies before further analysis. -
Development :
Processed data is used to develop models and generate forecasts. -
Forecasting and Visualization:
Results are visualized to support decision-making and provide insights into crop performance.
This document also provides an overview of the directory structure, file descriptions, and location details for the Cape workflow. The following sections cover the specifics of the environments and file organization.
1. Data Directory Structure¶
Data Directory Location:¶
pith:/home/cape/
├── cape
│ ├── CAPE
│ │ ├── data_marty/
│ │ ├── data_nmathlouthi/
│ ├── data
│ │ ├── cropdata
│ │ │ ├── gscd_data_240213.csv
│ │ │ ├── gscd_data.csv
│ │ ├── cropmask
│ │ │ ├── geoglam_bacs_v1/
│ │ │ ├── ifpri/
│ │ ├── data_processed
│ │ ├── data_stream
│ │ ├── logs
│ │ ├── shapefile
│ │ │ ├── gscd_shape.gpkg
│ │ │ ├── gscd_shape.gpkg-shm
│ │ │ ├── gscd_shape.gpkg-wal
│ │ ├── tmp
│ ├── data_in
│ │ ├── adm_cropland_area.hdf
│ │ ├── <fnid>_crop_<crop_name>_<season>.hdf
│ │ ├── <find>_pred.hdf
│ ├── data_out
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>_<L0*>.pkl
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>_<L0*>.pkl
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>.npz
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>.npz
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_LR_<exp_name>_<window>_<L0*>.json
│ │ ├── cape_sim_<fnid>_<crop_name>_<season>_XGB_<exp_name>_<window>_<L0*>.json
│ ├── data2
│ │ ├── cropdata
│ │ │ ├── gscd_data_240213.csv
│ │ │ ├── gscd_data.csv
│ │ ├── cropmask
│ │ │ ├── geoglam_bacs_v1/
│ │ │ ├── ifpri/
│ │ ├── data_processed
│ │ ├── data_stream
│ │ ├── logs
│ │ ├── shapefile
│ │ │ ├── gscd_shape.gpkg
│ │ │ ├── gscd_shape.gpkg-shm
│ │ │ ├── gscd_shape.gpkg-wal
│ │ ├── tmp
│ ├── viewer
│ │ ├── viewer_data_sim_validation.csv
│ │ ├── viewer_data_sim.csv
Outline of Data Directory and Location¶
Note: As of 01/25, please use cape/data2/ for up-to-date data.
| Directory / File | Description | Notes |
|---|---|---|
cape/CAPE/data_marty/ | Reproducibility directory containing data from Marty. | |
cape/CAPE/data_nmathlouthi/ | Reproducibility directory containing data from N. Mathlouthi. | |
cape/data/cropdata/gscd_data_240213.csv | Crop data file with timestamp 240213. | Primary crop data file |
cape/data/cropdata/gscd_data.csv | Crop data file without timestamp. | Secondary or merged crop data file |
cape/data/cropmask/geoglam_bacs_v1/ | Directory containing crop mask data from Geoglam BACS v1. | |
cape/data/cropmask/ifpri/ | Directory containing crop mask data from IFPRI. | |
cape/data/data_processed/ | Directory for processed data outputs from the pre-processing phase. | |
cape/data/data_stream/ | Directory for streaming data during the pre-processing phase. | |
cape/data/logs/ | Directory for log files. | |
cape/data/shapefile/gscd_shape.gpkg | GeoPackage file containing shapefile data needed for the Cape workflow. | |
cape/data/shapefile/gscd_shape.gpkg-shm | Shared memory file for the GeoPackage shapefile. | |
cape/data/shapefile/gscd_shape.gpkg-wal | WAL file for the GeoPackage shapefile. | |
cape/data/tmp/ | Temporary data storage directory. | |
cape/data_in/adm_cropland_area.hdf | HDF file for administrative cropland area data. | Used for simulation during development phase. |
cape/data_in/<fnid>_crop_<crop_name>_<season>.hdf | HDF file for country- and admin-specific crop data. | Files created during simulation in the development phase. Replace <fnid>, <crop_name>, <season>. |
cape/data_in/<find>_pred.hdf | HDF file for prediction data. | Files created during simulation in the development phase. Replace <find> as needed. |
cape/data_out/ | Directory for simulation outputs (pkl, npz, json files). | |
cape/data2/ | Mirror of cape/data. Latest data directory (as of 01/25). | Use this directory for reproducibility of the Cape workflow. |
cape/viewer/viewer_data_sim_validation.csv | CSV file containing simulation validation data. | |
cape/viewer/viewer_data_sim.csv | CSV file containing simulation data. |
2. Production Environment¶
Location:
pith:/home/chc-prod/
├──cape
│ ├──code_transfer.sh
│ ├──environment.yml
│ ├──miniconda3
│ ├──subx
│ ├──cape-eric
│ ├──data
│ ├──migration.sh
│ ├──ryu678
│ ├──ug
| Directory / File | Description | Notes |
|---|---|---|
cape/ | Root directory for the production environment. | |
code_transfer.sh | Shell script for transferring code files. | |
environment.yml | Conda environment configuration file. | Useful fore reproducing the cape workflow environment. |
miniconda3/ | Directory containing the Miniconda installation. | |
subx/ | Subdirectory for miscellaneous scripts/files. | |
cape-eric/ | Reproduciblity user specific directory. | |
data/ | Data directory for production. | No longer using this directory. Please use cape/data2/ for any data needs. |
migration.sh | Shell script for migration tasks. | |
ryu678/ | User/project-specific directory. | |
ug/ | Directory for UG-specific files. |
3. Main Cape Directory in Production¶
Location:
pith:/home/chc-prod/cape
Main Directory Contents¶
Scripts and Configuration Files¶
cape_development.py: Python script for development processes.cape_forecasting.py: Python script for forecasting processes.cape_preprocessing.py: Python script for preprocessing operations.cape_setting.csv: Configuration file for CAPE settings.code_transfer.sh: Shell script for transferring code files.run_development.sh: Shell script for running development processes.run_forecasting.sh: Shell script for running forecasting workflows.-
run_preprocessing.sh: Shell script for running preprocessing workflows. -
Logs:
log_forecasting.txt: Log file for forecasting operations.-
log_preprocessing.txt: Log file for preprocessing operations. -
Data Directories:
data_in: Directory for input data used in various workflows.-
data_out: Directory for output data generated from workflows. -
Viewer:
viewer: Contains visualizations and figures.figures: Stores graphical outputs.- Example:
Angola-Maize-Main/GB/feature_importance/importance_heatmap_GB.png. - Forecast and hindcast visualizations are also stored.
Subdirectory: cape:¶
- Code Files:
cape_development_sim.py: Python script for simulated development.cape_graphics.py: Python script for generating graphical outputs.cape_tools_sim.py: Tools for simulation operations.- Includes versions:
.mod,.save(these may have been created during reproducibility tasks).
- Includes versions:
create_input_data.py: Script to create input datasets.- Versions:
.marty,.prod(these may have been created during reproducibility tasks).
- Versions:
data_aggregation.py: Script for aggregating data.data_preprocessing.py: Script for preprocessing data.data_stream.py: Script for managing data streams.generate_graphics.py: Generates graphics for analysis.generate_viewer_sim.py: Generates viewer simulation.
Additional Files:¶
viewer_data_sim.csv: CSV file for simulated viewer data.viewer_data_sim_validation.csv: CSV file for validating simulated viewer data.
Subdirectory: preprocessing:¶
- Contains various scripts for preprocessing EO files/data such as:
atmp_fldas.py: Preprocessing script for FLDAS atmospheric data.eta_ssebop_v6.py: Preprocessing script for SSEBop v6 data.etos_noaa_merra.py: Preprocessing script for NOAA-MERRA evapotranspiration data.
4. Pre-processing Phase Overview¶
The cape_preprocessing.py script is primarily a wrapper script that logs and initiates the stages of the preprocessing workflow. Meanwhile, the cape_preprocessing.sh shell script handles execution of the pipeline, ensuring the appropriate environment is activated and the Python script is run.
- Primary Script:
cape_preprocessing.py- Acts as a wrapper that logs and orchestrates the workflow.
- Ensures proper environment activation via
cape_preprocessing.sh(optional).
- Workflow Steps:
- Data Streaming (
data_stream.py):- Fetches raw data from external sources.
- Streams data and saves it to the
./data_streamdirectory.
- Data Preprocessing (
data_preprocessing.py):- Cleans raw data (e.g., handling missing values, filtering outliers).
- Transforms and normalizes data.
- Applies spatial/temporal adjustments (e.g., interpolation, alignment with shapefiles).
- Saves processed data to the
/data_processeddirectory.
- Data Aggregation (
data_aggregation.py):- Aggregates data (e.g., computing spatial averages, temporal means).
- Combines multiple datasets into a unified file (e.g.,
.csvformat). - Saves the aggregated output to the
./data_processed/outputdirectory.
- Data Streaming (
5. Development/Forecasting Phase Overview¶
Before running cape_development.py, we need to adjust cape_development_sim.py and cape_tools_sim.py.
The development phase of the CAPE framework involves the following scripts: - cape_development.py [main script] - cape_tools_sim.py called in cape_development_sim.py - cape_development_sim.py [called in cape_development.py]
It includes both training of prediction models within a unified workflow. The main components are outlined below:
a. Experiment Configuration and Setup¶
-
Initial Settings:
-
Adjusting directories :
cape_development.py[main script] functiondef main()adjust the following directories:dir_data_incontains hdf filesdir_data_outset up for user reproducibility.- Adjust CAPE settings
./cape_setting.csvdirectory as needed.
-
Experiment settings
cape_setting.csv, which includes details such as country, product, season, forecast start/end months, harvest month, and experiment name.fnids_info.hdf- Forecasting windows determine the lead times for predictions.
-
Computing Forecast Lead Times:
- A custom function (
month_range) calculates the forecast lead months and derives additional values such as countdown lists and relative lead month adjustments.
b. Data Filtering and Preparation¶
- FNID Selection:
- Country identifiers (FNIDs) are retrieved from
fnids_info.hdf, which contains metadata (e.g., record_yield). - Filtering:
- FNIDs with insufficient historical records (e.g.,
record_yield> 12) are excluded. - Avoiding Redundant Computation:
- The
find_exist_casefunction filters out already processed experiments.
c. Model Training and Forecast Execution¶
- Training Phase:
- Forecasting models (e.g., XGBoost, linear regression) are trained on historical data.
- Techniques include Bayesian hyperparameter optimization, cross-validation, and variance threshold filtering.
- Trained models are saved as multiple
.jsonand one.npzfile per fnid (adm1). - Prediction Phase:
- Hindcasts and forecasts are generated using trained models.
- Both predictions and models are post-processed to restore the original data scale (using inverse transformations and trend adjustments).
d. Output Generation¶
- Saving Results:
- Forecast results and lead-time model files are saved as compressed
.npzfiles in the output directory. - Users need to update the output directory (
dir_data_out) to match their workspace.
Cape Development Main Script (cape_development.py)¶
This script organizes the forecasting experiments based on CAPE settings.
Key Tasks:¶
- Read Experiment Settings:
- Extract parameters from
cape_setting.csv. - Compute Lead Times:
- Use the
month_rangefunction to compute forecast lead months. - Set Up Experiment Cases:
- Retrieve FNIDs from
fnids_info.hdf. - Filter cases with less than 12 records.
- Avoid Redundant Computation:
- Use
find_exist_caseto skip processed experiments. - Execute Forecasting:
- Iterate over each experiment and call
cape_development_sim.pywith necessary parameters.
Prerequisites¶
- Python Version: Python 3.7 or higher.
- Required Packages: pandas, numpy, sys, os, json, subprocess, itertools, time.
- Custom Modules:
cape_tools_sim(containsmonth_rangeandfind_exist_casefunctions).
Directory Structure¶
├── data_in/
│ ├── fnids_info.hdf # Input HDF file
├── data_out/ # Output directory
├── cape_setting.csv # Experiment settings file
├── cape/
│ ├── cape_development_sim.py # Secondary script for forecasting
│ └── cape_tools_sim.py # Helper functions module
Cape Development Simulation Script (cape_development_sim.py)¶
This script runs simulations and saves forecasting results based on CAPE model settings.
Usage:¶
bash
python cape_development_sim.py --fnid=<FNID> --product_name=<product> --season_name=<season> --model_name=<model> --exp_name=<experiment> --window=<time_window>
Arguments:¶
| Argument | Type | Description | Required |
|---|---|---|---|
--fnid | str | FNID identifier (adm1). | Yes |
--product_name | str | Name of the product (e.g., Maize). | Yes |
--season_name | str | Season name (e.g., Main, Summer). | Yes |
--model_name | str | Model type (e.g., XGB, LR). | Yes |
--exp_name | str | <CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>. | Yes |
--window | str | Forecast window identifier. | Yes |
What is exp_name? Example - YNN_ACUM_ALL in cape_setting.csv - Yield Prediction (Y). - No Transformation (N). - No Trend Removal (N). - Accumulated Predictors (ACUM_ALL).
<CropIndicator><TransformationMethod><TrendMethod>_<PredictorConfiguration>-
Crop Indicator
Y: Refers to yield prediction.P: Refers to production prediction.
-
Transformation Method: specifies how the data is transformed:
Q: Quantile transformation.N: No transformation (raw data).5: Five-year moving average transformation.
-
Trend Method: Defines how trends are analyzed or removed:
A: Automatic trend detection (chooses the best trend model).L: Linear trend removal.N: No trend adjustment.
- Predictor Configuration: Specifies which predictors are used and how they are combined:
- ACUM_ALL: Uses accumulated predictors over the lead time. Includes variables like
prcp,pdry,etos,tavg,gdd,kdd,ndvi, andyear.
- ACUM_ALL: Uses accumulated predictors over the lead time. Includes variables like
Directory Paths¶
-
Input Directory:
-
Output Directory:
Workflow Summary¶
- Read and Parse Arguments:
- Parse command-line arguments using
argparse. - Output Filename Generation:
- Generate output file name based on parameters.
- Experiment Setting:
- Use
ExperimentSettingPocketto configure simulation parameters. - Forecast Execution:
- Run simulation using
cape_sim_build. - Save Output:
- Save models and results to the output directory.
Cape Simulation Tools Script (cape_tools_sim.py)¶
This module includes helper functions for forecasting and data management.
Key Functions:¶
CAPE_SIM_Reforecast:- Reforecasts predictions for a given setup.
cape_sim_build:- Builds and executes the CAPE simulation workflow.
load_input_data:- Loads Earth observation and crop data.
ExperimentSettingPocket:- Configures experiment settings.
Supporting Functions:¶
- Data Preparation:
CropDataControl,EODataControl - Forecasting:
GenerateSeriesLeadPredTable,cape_sim_build_prediction - Utility:
month_range,CheckLeadPred,CombSerialLead
Configuration Adjustments¶
-
Directory Paths:
-
Dependencies:
-
File Validation:
- Ensure necessary files exist in
data_in.
Cape Development Script cape_development.py¶
cape_development.py python script that orchestrates a series of forecasting experiments based on CAPE settings. The script reads experiment configurations from a CSV file, computes forecast lead times, filters experiments based on available records, and then executes each experiment via a secondary script.
cape_development.py performs the following tasks:
-
Reads Experiment Settings:
Reads thecape_setting.csvfile to extract experiment parameters such as country, product, season, forecast start/end months, harvest month, and experiment name. -
Computes Lead Times:
Uses the custom functionmonth_rangeto compute a list of forecast lead months and creates additional derived values (such as a countdown list and relative lead month adjustments). -
Sets Up Experiment Cases:
Reads additional information from an HDF file (fnids_info.hdf) to retrieve experiment identifiers (FNIDs). It then filters out cases that do not meet a minimum record threshold (i.e.,record_yield > 12). -
Avoids Redundant Computation:
Uses a custom functionfind_exist_caseto check which experiments have already been processed and filters them out. -
Executes Forecasting Experiments:
Iterates over each remaining experiment and calls a secondary script (cape/cape_development_sim.py) using command-line arguments that pass the necessary parameters.
Prerequisites¶
- Python Version: Python 3.7 or higher is recommended.
- Required Python Packages:
pandasnumpy- Standard libraries:
sys,os,json,subprocess,itertools, andtime. - Custom Modules:
cape.cape_tools_sim— This module should include the functionsmonth_rangeandfind_exist_case.
Tip: Use the project virtual environment (
venv) to manage dependencies.
Input Files and Settings¶
├── data_in/
│ ├── `fnids_info.hdf` # Input HDF file with FNID information
│ └── (other input files)
├── data_out/ # Output directory where results will be stored
├── cape_setting.csv # CSV file containing experiment configurations
├── cape/
│ ├── cape_development_sim.py # Script to run individual forecasting experiments
│ └── cape_tools_sim.py # Contains custom helper functions (e.g., month_range, find_exist_case)
cape_setting.csv¶
- Contains key configuration columns such as:
countryproductseasonmodel_nameforecast_start_monthforecast_end_monthharvest_monthexp_name
Notes Based on Cape Notes
Manual Configuration of Forecasting Windows: The documentation states that the forecasting window (i.e., the start and end months for the forecast) can be “added or modified in the CAPE settings file (cape_setting.csv).” - Question do users manually adjust these values?
The planting_month and harvest_month values in cape_setting.csv are copied from the crop data file.
- Question do users manually adjust these values?
fnids_info.hdf¶
- An HDF file in
/home/cape/data_in/fnids_info.hdfthat contains FNIDs along with metadata such ascountry_code,product,season_name, andrecord_yield. - Make sure the file is in
/home/cape/data_in/fnids_info.hdfdirectory and that it contains the necessary fields for filtering (e.g.,record_yield > 12). - Description: This HDF file contains FNIDs along with relevant metadata fields such as: • country_code • country • fnid • name • product • season_name • planting_month • harvest_month • record_area • record_production • record_yield • Dataset Structure: The file structure is organized under the df group and includes several sub-datasets: • axis0: Lists column names (fields). • block0_items: Contains metadata fields (e.g., country_code, product). • block1_items: Contains numerical fields (record_area, record_production, record_yield). • block1_values: Stores the corresponding values for the fields in block1_items.
Output Directory (data_out/)¶
-
The results of the experiments are saved here.
-
Adjustment
New users need to change this path to their output directory
Configuration Adjustments¶
When replicating or modifying the experiment, new users need to adjust the following parts of the code:
- Directory Paths:
- Input Directory:
-
Output Directory:
-
CSV Experiment Settings:
- Ensure that the values in
cape_setting.csv(likeforecast_start_month,forecast_end_month, andharvest_month) correctly reflect your experiment design. -
Verify that no duplicates exist for the key columns:
country,product,season, andmodel_name. -
Filtering Criteria in FNID Selection:
- The code filters out FNIDs with fewer than 12 records:
-
Adjustment Needed:
Modify the threshold if your data requires a different minimum record count. -
Forecasting Command Construction:
- The script constructs a command to run
cape_development_sim.py: -
Adjustment Needed:
If you change the parameters or the name/location of the forecasting script, update this command accordingly. -
Custom Helper Functions:
- The functions
month_rangeandfind_exist_caseare used to process dates and check for existing results. - Adjustment Needed:
If you wish to change how lead months or duplicate experiments are handled, edit these functions in thecape/cape_tools_sim.pymodule.
Running the Script¶
- Activate Your Virtual Environment (if applicable):
Cape Development Simulation Script cape_development_sim.py¶
This script (cape_development_sim.py) is designed to run simulations and save forecasting results based on the CAPE model settings.
Usage¶
Run the script with the required arguments:
python cape_development_sim.py --fnid=<FNID> --product_name=<product> --season_name=<season> --model_name=<model> --exp_name=<experiment> --window=<time_window>
Example Command:¶
python cape_development_sim.py --fnid=AO2008A101 --product_name=Maize --season_name=Main --model_name=XGB --exp_name=YNN_ACUM_ALL --window=1004
Arguments¶
| Argument | Type | Description | Required |
|---|---|---|---|
--fnid | str | The FNID identifier. | Yes |
--product_name | str | Name of the product (e.g., Maize). | Yes |
--season_name | str | Season name (e.g., Main). | Yes |
--model_name | str | Model to be used (e.g., XGB or LR). | Yes |
--exp_name | str | Experiment name. | Yes |
--window | str | Forecast window identifier. | Yes |
--note | str | Optional note for the run. | No |
Directory Paths¶
-
Input Directory:
Ensure that the input data files are present in this directory or adjust the path if necessary (e.g.,/home/cape/CAPE/data_in/). -
Output Directory:
This path should be updated for each user to point to their dedicated output directory.
Workflow Overview¶
-
Read and Parse Arguments:
The script parses command-line arguments using theargparselibrary. -
Output Filename Generation:
An output file name is generated based on the parameters: -
Experiment Setting:
TheExperimentSettingPocketclass is used to configure the simulation parameters. -
Forecasting Execution:
The script runs the simulation through thecape_sim_buildfunction, which generates forecasting results. -
Save Output:
The prediction models and results are saved to the output directory: - Models (
.jsonor.pkl) for each lead time. - Compressed results (
.npzfile).
Output Files¶
-
Model Files:
Depending on the model type, the script saves either JSON or PKL files: -
Compressed Results:
What Needs Adjustment¶
- Directory Paths:
-
Adjust
dir_data_inanddir_data_outbased on your file locations. -
Dependencies:
Ensure the required Python libraries are installed: -
File Existence:
Verify that necessary files and directories exist (e.g.,data_indirectory and data files). -
Error Handling:
You may want to add error handling to manage missing files or invalid arguments.
Adjustments¶
Modify paths like so:
Dependencies¶
- Python 3.x
- Libraries:
numpy,pandas,joblib,argparse
Cape Development Simulation Tools Script cape_tools_sim.py¶
CAPE simulation tools workflow, includes functions for forecasting crop yields, managing data, and generating predictions. Below is an explanation of key functions, workflows, and required adjustments.
Key Functions¶
1. CAPE_SIM_Reforecast¶
- Purpose: Reforecast predictions for a given experiment setup.
- Parameters:
rp: Dictionary containing input parameters (e.g.,fnid,product_name).dir_data_in: Path to input data directory.dir_data_out: Path to output data directory.- Output:
- Returns reforecasted and merged prediction results.
2. cape_sim_build¶
- Purpose: Build and execute the CAPE simulation.
- Parameters:
bp: Dictionary containing simulation settings.dir_data_in: Input data directory.- Workflow:
- Load input data using
load_input_data. - Generate forecast time tables and run predictions.
- Restore data to its original scale.
3. load_input_data¶
- Purpose: Load Earth observation and crop data for simulations.
- Parameters:
ubp: Dictionary containing crop and indicator data settings.dir_data_in: Input directory path.- Output:
- Returns loaded crop data (
box), time series data (y), and formatted data (df).
4. ExperimentSettingPocket¶
- Purpose: Generate settings for an experiment based on the
exp_nameandwindow. - Parameters:
exp_name: Experiment name (e.g.,YNN_ACUM_ALL).window: Forecast window (e.g.,1004).- Output:
- Dictionary containing indicator name, transformation method, trending method, and lead predictors.
Important Functions Overview¶
Data Preparation Functions¶
CropDataControl: Handles crop data processing, including outlier removal, trend analysis, and data transformation.EODataControl: Controls the aggregation and resampling of Earth observation data.
Forecasting Functions¶
GenerateSeriesLeadPredTable: Generates a table of lead-time predictors for forecasting.cape_sim_build_prediction: Executes the forecasting process using regression models.
Supporting Functions¶
month_range: Creates a list of months within a given range, supporting year wrap-around.CheckLeadPred: Validates that the lead predictors meet expected format and constraints.CombSerialLead: Generates serial lead combinations for time series forecasting.
Workflow Summary¶
- Initial Setup and Parameter Validation
-
Read and validate input parameters from the experiment settings (
ExperimentSettingPocket). -
Data Loading and Preparation
- Load crop and Earth observation data using
load_input_data. -
Process data for lead-time predictors and time windows.
-
Forecasting and Prediction
- Use regression models (e.g.,
XGBRegressor,LinearRegression) to forecast crop yields. -
Perform cross-validation (
LeaveOneGroupOut,TimeSeriesSplit) and hyperparameter tuning (HyperparameterTuning). -
Output Generation
- Save prediction models and forecast results to the output directory (
dir_data_out).
Dependencies¶
- Python 3.x
- Libraries:
numpypandasgeopandasscipystatsmodelssklearnxgboostskopt
Directory Paths¶
- Input Directory: Ensure the following files are in the input directory (
dir_data_in): - Crop data files (e.g.,
fnid_crop_<product>_<season>.hdf). -
Earth observation data (e.g.,
fnid_pred.hdf). -
Output Directory: Update the
dir_data_outpath to reflect the user's dedicated output directory.
Adjustments¶
-
Directory Paths:
Modifydir_data_inanddir_data_outas needed: -
Dependencies:
Ensure the required Python libraries are installed: -
File Validation:
Add checks to ensure necessary files and directories exist:
Sample Run¶
Run the script with appropriate parameters:
python cape_development_sim.py --fnid=AO2008A101 --product_name=Maize --season_name=Main --model_name=XGB --exp_name=YNN_ACUM_ALL --window=1004
This will initiate the simulation and save results in the specified output directory.
7. Forecasting & Visualization Phase¶
Relies on 5 scripts:
cape_forecasting.py[Main script]create_input_data.py[ Called withincape_forecasting.py]generate_viewer_sim.py[ Called withincape_forecasting.py]cape_tools_sim.py[ Called withincape_forecasting.py]tools.py[ Called withincape_forecasting.py]
cape_forecasting.py
Configuration and Setup:
- Reads the experiment settings from a CSV file (
cape_setting.csv) - Checks for duplicate entries in key columns to ensure data integrity.
- Computes the lead months based on the forecast start and end months and adjusts them relative to the harvest month (including handling year wrap-around).
Directory and File Setup:
- Specifies directories for input data, processed output data, and viewer outputs.
- Defines file paths for processed data, crop data, and the required shapefile.
Forecasting Workflow:
- Uses a safe function wrapper to call:
create_input_data: Prepares the necessary input data from the specified files and settings.generate_viewer_sim: Executes the forecasting process and generates simulation data for the viewer.- (Note: The visualization step using generate_graphics is currently deprecated.)
Logging and Timing: - Logs the start and end times, along with key process steps, to both a log file and the console. - Provides a runtime summary when the process completes.
7. Reproducibility Overview¶
Environment Setup¶
-
Access the Host Location:
Ensure you have access to the following location:
pith:/home/chc-prod/
This is required to use the CAPE environment. -
Setting Up the Python Environment:
- Using Conda:
Use the providedenvironment.ymlfile to create a dedicated Conda environment: - Alternative Virtual Environment:
If you prefer to work with a virtual environment, generate arequirements.txtfile based on theenvironment.ymlfile and install the packages:
Data Files and Directories¶
- Shapefile and Crop Data:
The shapefile (gscd_shape.gpkg) and crop data (gscd_data_240213.csv) are obtained from FDW. These files are used in all phases of the CAPE workflow (from preprocessing to forecasting). -
If you are starting from the development phase, locate these files in:
./data2/cropdata/gscd_data_240213.csv./data2/cropdata/gscd_shape.gpkg
These files are based on the latest acquisition data from February 13th, 2024.
-
Preprocessing Data:
You do not need to create directories to store preprocessing data if you are using the existing files in this directory. However, if you are reprocessing data for a new country, you may replicate thedata_indirectory to store theadm_cropland_area.hdfgenerated during preprocessing. Otherwise, this file can be found in thecape/data_indirectory as part of thedata_processedanddata_streamdirectories and is aggregated again if reproducibility starts from the preprocessing phase.
Important: If you are replicating preprocessing, please replicate these directories so as not to overwrite actual data. -
Development Phase:
During the development phase, you will rely on these files to initiate development. Here you will need to create adata_outdirectory that will contain the model outputs for the newly added country. -
The process takes approximately 4-5 hours.
-
Forecasting Phase:
For the forecasting phase, you would need to duplicate theviewerdirectory. -
Directory Settings in the Code:
Update directories and filenames as follows (adjust paths as needed):
# Set directories and filenames
dir_data_in = "./cape/data_in/" # simulation output need to change
dir_data_out = "./cape/CAPE/data_nmathlouthi/data_out/"
dir_viewer = "./cape/CAPE/data_nmathlouthi/viewer"
fn_data_processed = "./cape/data2/data_processed/output/data_product_day_all.hdf" # only if starting from preprocessing stage, otherwise use the existing file in data2/data_processed/output/data_product_day_all.hdf
fn_cropdata = "./cape/data2/cropdata/gscd_data_240213.csv" # you may use the existing file
fn_shapefile = "./cape/data2/cropdata/gscd_shape.gpkg" # you may use the existing file
Updating Scripts for a New Country¶
For all scripts, directory adjustments are required. Additionally, to add a new country to CAPE, you must update the main() function in the following scripts: - cape_processing.py - cape_development.py - cape_forecasting.py
Example Update for Country Filtering:
Inside the main() function, after loading cape_settings.csv, add: