CAPE Replication Project - Complete File Documentation¶
Table of Contents¶
- Project Overview
- Directory Structure
- Configuration Files
- Processing Scripts
- Development Scripts
- Forecasting Scripts
- Analysis and Visualization Scripts
- Environment and Dependencies
- Usage Instructions
Project Overview¶
The CAPE (Crop Assessment and Production Estimation) Replication Project is a comprehensive framework for crop yield forecasting and analysis. The project implements a complete workflow from data preprocessing to forecasting and visualization, with a focus on reproducibility and modularity.
Key Components:¶
- Data Processing: Raw data streaming, preprocessing, and aggregation
- Model Development: Training and validation of forecasting models
- Forecasting: Generation of crop yield predictions
- Analysis: Comparison and validation of results
- Visualization: Interactive plots and maps
Directory Structure¶
cape_replication_project/
├── config/ # Configuration files
├── docs/ # Documentation
├── scripts/ # Main script directories
│ ├── processing/ # Data processing scripts
│ ├── development/ # Model development scripts
│ ├── forecasting/ # Forecasting scripts
│ └── analysis_and_viz/ # Analysis and visualization
├── capevenv/ # Virtual environment
└── environment.yml # Conda environment specification
Configuration Files¶
config/config.json¶
Purpose: Central configuration file containing all file paths and directory locations.
Key Configuration Parameters: - dir_data_in: Input data directory path - dir_data_out: Output data directory path
- dir_viewer: Viewer output directory path - fn_data_processed: Processed data file path - fn_cropdata: Crop data CSV file path - fn_shapefile: Geographic shapefile path - cape_setting_file: CAPE settings CSV file path - fnids_info: FNID information HDF file path - fn_viewer_csv: Viewer CSV output file path
Usage: All scripts reference this file for consistent path management across the project.
config/.config.json.swp¶
Purpose: Vim swap file (temporary file created during editing).
Processing Scripts¶
Main Processing Scripts¶
scripts/processing/cape_preprocessing.py¶
Purpose: Main orchestrator for the preprocessing workflow.
Key Features: - Step-by-step execution: Runs preprocessing in 3 sequential steps - Progress tracking: Uses .done files to track completion - Resumable workflow: Can restart from any step using --start-from argument - Comprehensive logging: Logs all operations to file and console
Command Line Arguments: - --start-from or -s: Choose starting step (1=stream, 2=prep, 3=agg)
Workflow Steps: 1. Data Streaming (data_stream.py): Fetches and streams raw data 2. Data Preprocessing (data_preprocessing.py): Cleans and transforms data 3. Data Aggregation (data_aggregation.py): Combines and aggregates data
scripts/processing/data_stream.py¶
Purpose: Handles data streaming from external sources.
Key Functions: - Downloads raw data from various sources - Manages data streaming pipelines - Saves streamed data to ./data_stream directory - Handles different data formats and sources
scripts/processing/data_preprocessing.py¶
Purpose: Cleans and transforms raw data.
Key Functions: - Handles missing values and outliers - Applies data transformations and normalization - Performs spatial/temporal adjustments - Saves processed data to /data_processed directory
scripts/processing/data_aggregation.py¶
Purpose: Aggregates processed data into unified formats.
Key Functions: - Computes spatial and temporal averages - Combines multiple datasets - Creates unified output files (CSV, HDF formats) - Saves aggregated data to ./data_processed/output directory
Supporting Processing Files¶
scripts/processing/tools.py¶
Purpose: Utility functions for data processing.
Key Functions: - Data validation utilities - File handling helpers - Common data transformation functions
scripts/processing/3_data_aggregation.done¶
Purpose: Completion marker file for data aggregation step.
scripts/processing/__init__.py¶
Purpose: Python package initialization file.
Preprocessing Subdirectory¶
scripts/processing/preprocessing/¶
Purpose: Contains specialized preprocessing scripts for different data types.
Key Files: - atmp_fldas.py: FLDAS atmospheric data preprocessing - eta_ssebop_v6.py: SSEBop v6 evapotranspiration data preprocessing - Additional specialized preprocessing modules
Development Scripts¶
Main Development Scripts¶
scripts/development/cape_development.py¶
Purpose: Main orchestrator for model development and training.
Key Features: - Configuration-driven: Reads settings from cape_setting.csv - Experiment management: Handles multiple forecasting experiments - Lead time computation: Calculates forecast lead months and adjustments - Duplicate prevention: Avoids redundant computation using find_exist_case - Parallel execution: Runs experiments sequentially with subprocess calls
Command Line Arguments: - -f or --force: Re-run all experiments, ignoring existing outputs
Key Workflow: 1. Loads experiment settings from CSV 2. Computes lead time matrices 3. Filters FNIDs based on record requirements (>12 records) 4. Executes forecasting experiments via cape_development_sim.py
scripts/development/cape_development_sim.py¶
Purpose: Individual experiment execution script.
Key Features: - Parameter-driven: Accepts command-line arguments for experiment parameters - Model training: Trains forecasting models (XGBoost, Linear Regression) - Cross-validation: Implements time series cross-validation - Hyperparameter optimization: Uses Bayesian optimization - Output generation: Saves models and predictions
Command Line Arguments: - --fnid: FNID identifier - --product_name: Product name (e.g., Maize) - --season_name: Season name (e.g., Main) - --model_name: Model type (XGB, LR) - --exp_name: Experiment name (e.g., YNN_ACUM_ALL) - --window: Forecast window identifier
scripts/development/cape_tools_sim.py¶
Purpose: Core simulation tools and utilities.
Key Functions: - CAPE_SIM_Reforecast: Reforecast predictions for given setup - cape_sim_build: Main simulation workflow builder - load_input_data: Load Earth observation and crop data - ExperimentSettingPocket: Configure experiment settings - CropDataControl: Crop data processing and control - EODataControl: Earth observation data control - GenerateSeriesLeadPredTable: Generate lead-time predictor tables - cape_sim_build_prediction: Execute forecasting process
Supporting Functions: - month_range: Create month ranges with year wrap-around - CheckLeadPred: Validate lead predictors - CombSerialLead: Generate serial lead combinations
scripts/development/cape_dev.py¶
Purpose: Development utilities and testing scripts.
Key Features: - Development environment setup - Testing utilities - Debugging tools
Forecasting Scripts¶
Main Forecasting Scripts¶
scripts/forecasting/cape_forecasting.py¶
Purpose: Main orchestrator for forecasting and visualization workflow.
Key Features: - Configuration integration: Uses centralized config.json - Safe execution: Implements error handling with safe_function_call - Data preparation: Calls create_input_data for data setup - Viewer generation: Calls generate_viewer_sim for output creation - Comprehensive logging: Detailed logging of all operations
Workflow Steps: 1. Load configuration and CAPE settings 2. Compute lead matrices and adjustments 3. Filter for specific countries/products (currently Somalia & Maize/Sorghum) 4. Create input data using create_input_data 5. Wait for prediction files to be available 6. Generate viewer CSV files using generate_viewer_sim
scripts/forecasting/create_input_data.py¶
Purpose: Prepares input data for forecasting.
Key Functions: - Loads processed data from HDF files - Prepares crop data from CSV files - Integrates geographic shapefile data - Creates unified input datasets for forecasting
scripts/forecasting/generate_viewer_sim.py¶
Purpose: Generates simulation data for viewer applications.
Key Functions: - Processes forecasting results - Creates viewer-compatible CSV files - Handles data formatting and validation - Generates simulation validation data
Supporting Forecasting Files¶
scripts/forecasting/cape_common.py¶
Purpose: Common utilities and functions for forecasting.
Key Functions: - Shared data processing functions - Common validation utilities - Reusable forecasting components
scripts/forecasting/cape_tools_sim.py¶
Purpose: Forecasting-specific simulation tools.
Key Functions: - Forecasting-specific data handling - Model prediction utilities - Result processing functions
scripts/forecasting/tools.py¶
Purpose: General utility functions for forecasting.
scripts/forecasting/requirements.txt¶
Purpose: Python package requirements for forecasting module.
Analysis and Visualization Scripts¶
Cape Results Analysis¶
scripts/analysis_and_viz/cape_results_analysis/forecast_comparison.py¶
Purpose: Compares different forecasting models and methods.
Key Features: - Model performance comparison - Statistical analysis of forecasts - Visualization of comparison results
scripts/analysis_and_viz/cape_results_analysis/model_forecast_analysis.py¶
Purpose: Analyzes model forecasting performance.
Key Features: - Model accuracy assessment - Error analysis - Performance metrics calculation
HVSTAT CAPE Comparison Analysis¶
scripts/analysis_and_viz/hvstat_cape_comp_analysis/hvstat_cape_analysis_README.md¶
Purpose: Documentation for HVSTAT-CAPE comparison analysis.
Key Content: - Analysis methodology - Data sources - Expected outputs - Usage instructions
scripts/analysis_and_viz/hvstat_cape_comp_analysis/data_processing.py¶
Purpose: Data processing utilities for comparison analysis.
scripts/analysis_and_viz/hvstat_cape_comp_analysis/plot_yield_comparison.py¶
Purpose: Generates yield comparison plots.
Key Features: - Interactive plotting - Statistical visualization - Comparison charts
scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_analysis.py¶
Purpose: Analyzes yield differences between methods.
scripts/analysis_and_viz/hvstat_cape_comp_analysis/yield_diff_map_animation.py¶
Purpose: Creates animated maps of yield differences.
Key Features: - Geographic visualization - Temporal animation - Interactive maps
scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_alignment.py¶
Purpose: Aligns data by year for comparison.
scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_fnid_alignment.py¶
Purpose: Aligns data by year and FNID.
scripts/analysis_and_viz/hvstat_cape_comp_analysis/year_prod_alignment.py¶
Purpose: Aligns production data by year.
scripts/analysis_and_viz/hvstat_cape_comp_analysis/requirements.txt¶
Purpose: Python package requirements for analysis module.
Analysis Outputs¶
scripts/analysis_and_viz/hvstat_cape_comp_analysis/results/¶
Purpose: Directory containing analysis results.
Key Files: - year_prod_table.csv: Yearly production comparison table - year_table.csv: Yearly summary table
scripts/analysis_and_viz/hvstat_cape_comp_analysis/plots/¶
Purpose: Directory containing generated plots and visualizations.
Key Subdirectories: - individual_plots/: Individual country/product plots - map_frames/: Geographic map frames - plotly_product_plots/: Interactive Plotly visualizations - product_plots/: Static product comparison plots
Environment and Dependencies¶
scripts/environment.yml¶
Purpose: Conda environment specification file.
Key Dependencies: - Core Scientific Computing: numpy, pandas, scipy, scikit-learn - Geospatial: geopandas, gdal, cartopy, folium - Machine Learning: xgboost, scikit-optimize - Visualization: matplotlib, plotly, bokeh - Data Handling: h5py, netcdf4, fiona - Development: jupyter, ipython
Environment Name: chafs_b
capevenv/¶
Purpose: Virtual environment directory.
Contents: - bin/: Executable files and activation scripts - lib64/: Python libraries - pyvenv.cfg: Virtual environment configuration - share/: Shared resources
Usage Instructions¶
Setting Up the Environment¶
-
Create Conda Environment:
-
Configure Paths: Edit
config/config.jsonto set appropriate paths for your system. -
Prepare Data: Ensure required data files are in the specified input directories.
Running the Workflow¶
-
Preprocessing:
-
Development:
-
Forecasting:
-
Analysis:
Configuration Management¶
- Central Configuration: All paths are managed in
config/config.json - Experiment Settings: Use
cape_setting.csvfor experiment configuration - Environment: Use
environment.ymlfor dependency management
Output Management¶
- Processing Outputs: Stored in
data_processed/directories - Development Outputs: Model files and predictions in
data_out/ - Forecasting Outputs: Viewer files and visualizations in
viewer/ - Analysis Outputs: Plots and results in analysis subdirectories
File Naming Conventions¶
Data Files¶
- Crop Data:
gscd_data_YYYYMMDD.csv(with timestamp) - Shapefiles:
gscd_shape.gpkg - Processed Data:
data_product_day_all.hdf - Model Outputs:
cape_sim_<fnid>_<crop>_<season>_<model>_<exp>_<window>_L<lead>.{pkl,json,npz}
Script Files¶
- Main Scripts:
cape_<phase>.py(e.g.,cape_preprocessing.py) - Utility Scripts:
tools.py,common.py - Specialized Scripts: Descriptive names (e.g.,
data_stream.py)
Configuration Files¶
- Main Config:
config.json - Environment:
environment.yml - Settings:
cape_setting.csv
Troubleshooting¶
Common Issues¶
- Path Configuration: Ensure all paths in
config/config.jsonare correct - Dependencies: Verify conda environment is properly activated
- Data Availability: Check that required input files exist
- Permissions: Ensure write permissions for output directories
Log Files¶
- Processing:
scripts/processing/logs/log_preprocessing.txt - Forecasting:
scripts/logs/log_forecasting.txt - Development: Check individual experiment outputs
Error Handling¶
- Safe Execution: Most scripts include error handling and logging
- Resumable Workflows: Use
--start-fromand.donefiles for recovery - Validation: Scripts include input validation and error checking
Contributing¶
Code Organization¶
- Modular Design: Each phase has its own directory and scripts
- Configuration-Driven: Centralized configuration management
- Documentation: Comprehensive documentation for all components
Best Practices¶
- Logging: All scripts include comprehensive logging
- Error Handling: Robust error handling and recovery mechanisms
- Configuration: Use configuration files for path and parameter management
- Testing: Include validation and testing utilities
This documentation provides a comprehensive overview of all files in the CAPE Replication Project. Each component is designed to be modular, configurable, and well-documented for ease of use and reproducibility.