C6 Carbon Model V2 Summary
C6 Carbon Model v2 — Summary & High-Level Design
Date: 2026-05-07
Pairs with: methodology-inputs-comparison.md, c6-carbon-model.md (gap analysis), c6-carbon-model-v2-plan.md (initial farm-grain plan, superseded by this doc on the H3 grain question).
This is the consolidated takeaway from a deep design conversation. Read this if you only have time for one.
1. The big shift
From: vendor PDF → parser → FarmMetric. Vendor was the source of truth; provenance was a string.
To: raster sources → zonal-stats runner → RasterAggregate → ProjectH3Metric → FarmMetric. Source of truth is the public raster + your code; provenance is an FK chain auditor can click through.
What changes:
- No more Canopy / C2050 / ClearBlue PDF ingestion
- All values come from per-pixel satellite rasters that C6 zonal-stats itself
- Every farm metric traces to a reproducible computation: same inputs + same code → same number, every time
- The audit story becomes “click through to see the raster and the code”
2. Four-layer data model
| Layer | Lives in | Cardinality | What it stores |
|---|---|---|---|
| L0 raster pixels | GCS / AWS / Earth Engine (COGs) | trillions | the actual datasets — never enter Postgres |
| L0 source registry | raster_data_sources (exists) | ~30 rows | catalog of MapBiomas, HGB, GEDI, SoilGrids, PRODES, Sentinel, etc., with version + storage tier |
| L1 raster aggregate | raster_aggregates (NEW) | medium | one row per (geometry × source × vintage × method) — the audit anchor |
| L2 per-cell metric | project_h3_metric (exists, extend) | high | canonical-unit values per H3 cell, FK to L1 |
| L2 per-farm metric | farm_metrics (exists, extend) | medium | derived from H3 cells via spatial join + weighted aggregation |
| L3 methodology output | vintage_ledger + project_computation (exist) | low | engine outputs: gross/net/credits/revenue per project per vintage |
Pixels live in GCS forever. Postgres only stores aggregates of pixels — the slices we’ve computed and want to keep.
3. H3 is the primary spatial grain
H3 cells are immutable hexagonal grid cells with deterministic indexing. They’re the unit of carbon accounting.
Resolution 8 (~74 ha per cell) is the default, because:
- Matches HGB biomass native resolution (~74 pixels per cell — statistically stable mean)
- ~822 MapBiomas pixels per cell — generous for class composition
- Most cells are single-stratum (~70%); mixed cells get composition stored explicitly
- VM0048 risk maps (~100 m native) average cleanly to res 8
- Auditable at human scale — a 74 ha hexagon is something a person can see
Two exceptions:
- Smallholder farms (<500 ha): promote to res 9 (~10.5 ha cells) so individual fields are visible
- SOC (SoilGrids 250 m native) and GEDI L4B (1 km native): store at res 7, inherit down to res 8 via H3 parent/child relationships
H3 has the nested property: every res-9 cell sits inside a res-8 parent. Mixing resolutions is well-defined.
4. Why H3 over farm-grain rollup
H3 cells are better than farm polygons for the redesign:
- Eternal: cell
882ac4f9ffffffalways covers the same hexagon. Nogeometry_versioninvalidation needed. Farm polygons get edited; H3 cells don’t. - Per-stratum natural: at 74 ha, ~70% of cells are single-stratum. Stratum composition is stored per cell as a normal column instead of needing JSONB filter slices.
- Spatially queryable: “show me forest cells with AGB > 200” is a simple SQL filter.
- Audit-aligned: UI rows = data rows. Each row a user sees on screen is one cell, with one provenance chain.
FarmMetric becomes a derived rollup of H3 cells inside the farm polygon. Engine reads farm-level summary numbers; per-cell detail backs every claim.
5. Stratification — composition, not polygons
Strata are not stored as polygons. Each H3 cell has:
stratum_dominant(enum): the majority pixel class — drives UI color and filtersstratum_composition(JSONB): normalized class histogram —{primary_forest: 0.78, pasture: 0.20, water: 0.02}forest_cover_pct(numeric): derived from compositionagb_density_mg_ha: AGB masked to forest pixels only within the cell
Engine math per cell respects composition:
cell_baseline_tco2e =
agb_forest_only (185 Mg/ha)
× forest_cover_pct/100 (0.78)
× cell_area_ha (73.6)
× carbon_fraction (0.46)
× (44/12) (CO₂ stoichiometry)
× baseline_defor_rate (varies — see §7)
Mixed cells get credited only for their forest fraction. No double-counting of pasture as forest.
6. The raster_aggregates table — the audit anchor
One row per scientific computation. Carries:
- What dataset: raster_source FK + version + asset URLs + asset hash
- What geometry: kind (
h3_cell/farm/project/leakage_belt) + id + geometry_version - What slice:
stratum_filterJSONB (rarely used at H3 grain — composition columns cover most needs) - What time: vintage + period_start/end for windowed metrics
- What number: aggregate_method (mean/sum/histogram/p95/std), value, unit, pixel_count
- Uncertainty:
uncertainty_pct_ci95from the raster’s own uncertainty layer (HGB SE, ESA CCI Biomass uncertainty band, SoilGrids SD) - Reproducibility:
code_version(git sha),runner_key,inputs_fingerprint(SHA of geometry + asset_hash + code_sha + params)
Same inputs_fingerprint = same number. Different fingerprint = re-run was triggered. Auditor can verify by hash.
7. Multi-baseline switching — baselines are per-cell layers
Old: “the baseline rate is 0.82%/yr” — a project-wide scalar.
New: each baseline method produces a per-cell rate layer, with a baseline_source discriminator. Switching baselines means querying a different layer slice.
| Method | What it produces | Storage |
|---|---|---|
| C6 internal historical | scalar from PRODES inside project polygon | per-cell, uniform |
| PRODES national reference | scalar from state/biome trend | per-cell, uniform |
| Verra VM0048 jurisdictional | per-cell rate from Verra risk raster + JAD allocation | per-cell, varies spatially |
| ART TREES national NFMS | scalar from national forest monitoring system | per-cell, uniform per jurisdiction |
| VM0047 dynamic control | matched-control ΔSI, no rate concept | per-cell, layer code differs |
Engine math becomes uniform across methods:
SUM(agb × forest_pct/100 × cell_area × cf × 44/12 × baseline_rate)
WHERE baseline_rate.baseline_source = :active
Switching = changing one parameter. Same SQL, different number.
UI gain: side-by-side scenario table (“under C6 internal: $2.66M; under PRODES national: $3.30M; under Verra VM0048: $2.26M”) + per-cell baseline-rate heatmap + per-cell delta map between methods.
8. Per-pool uncertainty — wired through
Uncertainty is captured at zonal-stats time from each raster’s own uncertainty layer:
- HGB ships AGB SE (band 2) →
uncertainty_pct_ci95on the AGB RasterAggregate - ESA CCI Biomass ships per-pixel uncertainty → same
- SoilGrids ships per-pixel SD → same
- MapBiomas / PRODES ship no uncertainty →
uncertainty_method = 'none', engine applies a conservative default deduction
Engine helper combined_uncertainty_pct(metrics) propagates per-pool CIs into a project total per VMD0017 (Verra) or TREES guidance.
New vintage_ledger row inserted between buffer and issuable: “Uncertainty deduction (X% CI → Y% deduction)”. Today’s ledger is gross → leakage → buffer → issuable. Becomes gross → leakage → buffer → uncertainty deduction → issuable.
9. Pool & methodology configuration
Pools (AGB, BGB, DW, L, SOC, HWP) are not separate types in storage. They’re metric_codes like everything else. “Pool-ness” is a methodology-config role, not a data-model concept.
Methodology defines its own input list:
# app/methodologies/vm0048.py
REQUIRED_INPUTS = (
InputSpec(code="agb_forest_only", ...),
InputSpec(code="forest_cover_pct", ...),
InputSpec(code="baseline_rate_pct_yr", source="active_baseline", ...),
# DW, L, HWP not in this list — VM0048 lets us exclude as de minimis
)
A logging project methodology adds DW + HWP to its list:
# app/methodologies/vm0007_ifm.py
REQUIRED_INPUTS = (
InputSpec(code="agb_forest_only", ...),
InputSpec(code="dw_density", ...),
InputSpec(code="hwp_carbon", ...),
...
)
Engine reads only what the methodology asks for. Adding a new pool = config + new runner, no schema change.
10. Storage shape — example of one cell
For H3 cell 882ac4f9ffffff in project C6-BR-2406, vintage 2026:
Five raster_aggregates rows (one per source × layer):
- HGB AGB mean (forest-only mask) → 185 Mg/ha, CI ±18.6%
- MapBiomas LULC histogram →
{primary_forest: 0.78, pasture: 0.20, ...} - MapBiomas Alerta defor → 0.60 ha
- SoilGrids SOC (inherited from res-7 parent) → 73.4 tC/ha, CI ±41%
- MapBiomas Fogo cumulative burned → 4.2% of cell
~10 project_h3_metric rows (canonical-unit, FK to aggregates):
- agb_density_mg_ha = 185.0 (FK → HGB aggregate)
- agb_uncertainty_pct_ci95 = 22.4 (FK → HGB aggregate)
- forest_cover_pct = 78.0 (FK → MapBiomas histogram)
- stratum_dominant = “primary_forest” (derived)
- stratum_composition = JSONB
- defor_ha_yr = 0.60 (FK → MapBiomas Alerta aggregate)
- soc_density_tc_ha = 73.4 (FK → SoilGrids aggregate)
- baseline_rate_pct_yr [c6_internal_historical] = 0.82 (uniform scalar)
- baseline_rate_pct_yr [verra_vm0048_mt] = 2.41 (FK → Verra risk map aggregate)
- flag = “ok” (derived)
Two L3 outputs (engine math per cell, also project_h3_metric rows):
- baseline_tco2e [under active baseline] = 499 t
- issuable_tco2e [post all deductions] = 389 t
The UI you’ve already built (per-cell table with H3 INDEX / AGB / DEFOR / BASELINE / ISSUABLE / FLAG columns) renders these rows directly.
11. What changes from today
New tables
raster_aggregates— the audit anchor. Per-(geometry × source × vintage × method) row.project_baselines— per-project metadata about each baseline scenario (source, label, is_active, computed_at, project_total summary, FKs to L1 inputs).farm_metric_h3_source(optional) — join table linking FarmMetric rollups to constituent H3 cells.
Modified tables
project_h3_metric— gainsraster_aggregate_idFK +baseline_sourcediscriminator.farm_metrics— gainsraster_aggregate_idFK; eventually dropsparser_version,source_document_id,carbon_stock_source_id(vendor-era columns).
Deprecated
- Vendor parsers:
app/services/vendor_report/parsers/canopy.py,c2050.py,clearblue.py carbon_measurementstable (already marked deprecated)
Engine changes
- Engine reads CI from L1 aggregate via FK; applies methodology uncertainty deduction
- Engine takes
(project, vintage, active_baseline_id); routes to per-cell baseline_rate layer for that source - Engine math becomes mostly methodology-agnostic — methodology choice = which layer codes + which baseline_source to read
- New
vintage_ledgerrow for uncertainty deduction
UI gains
- Click any cell row → audit chain to RasterAggregate → RasterDataSource → reproducible
- Side-by-side baseline scenario comparison
- Per-cell baseline rate heatmap (especially valuable for VM0048 risk visualization)
- Per-cell delta map between two baselines
- Project-type-from-satellite classifier (“this farm looks like REDD-AUD; this one looks like ARR”)
12. Migration phases
Each phase shippable independently.
| Phase | Scope | Estimate |
|---|---|---|
| 0 | raster_aggregates table; runners write to it (nothing reads yet) | 1 wk |
| 1 | project_h3_metric.raster_aggregate_id + farm_metrics.raster_aggregate_id FKs; runners populate going forward | 1 wk |
| 2 | Engine reads CI from L1; new vintage_ledger uncertainty-deduction row | 1–2 wk |
| 3 | project_baselines table + scenario UI; baseline_source discriminator on per-cell layer; engine takes baseline arg | 2 wk |
| 4 | Per-stratum composition columns on cells; engine multiplies by forest_cover_pct for accurate mixed-cell math | 1 wk |
| 5 | Smallholder res-9 promotion; SOC + GEDI res-7 inheritance | 1 wk |
| 6 | Retire vendor parsers + drop legacy columns; finalize carbon_measurements removal | 1 wk |
Total: ~7–9 weeks with testing.
Recommended first move: don’t start with the migration. Pick the HGB biomass runner and do a vertical slice — write one RasterAggregate row + one ProjectH3Metric row, wire UI click-through, verify reproducibility — before migrating other runners.
13. Capabilities this unlocks
For audit conversations:
Auditor: “Where did this 78.4 tC/ha for farm 553 come from?” C6: clicks farm metric → H3 cells → RasterAggregate → HGB raster URL + hash + git sha + computed timestamp + carbon fraction source. Reproducible to 4 decimals.
For methodology migrations (VM0007 → VM0048):
User: “What’s our credit forecast under VM0048 vs VM0007 vs PRODES national?” UI: three project_baselines rows side-by-side. Toggle active. vintage_ledger recomputes. Both stay in audit trail.
For new project types (logging, ARR, soil carbon):
Marc: “We just signed an IFM project, need DW + HWP.” Dev: writes new runner using GEDI L4A + national logging records. Adds
dw_densityto lookup_metric_type, units registry, IFM methodology’s REQUIRED_INPUTS. No schema migration. Done in 2 days.
For project-type discovery:
Onboarding wizard: shows new farm uploaded as KMZ. Satellite signals parse. UI suggests “Based on 89% forest cover + 0.6%/yr historical defor, this farm looks like a REDD-AUD candidate. VM0048 eligible: yes (Mato Grosso). Confirm?“
14. Open questions
- Stratum class taxonomy — fixed C6 enum + mapping tables from MapBiomas / PRODES / Hansen class IDs. Who owns the decision?
- Smallholder threshold — exact farm-area cutoff for res-9 promotion. 200 ha? 500 ha? Per project, per registry?
- Vendor retirement timeline — active projects relying on vendor numbers need parallel-run + numerical comparison before cutover.
- VM0048 ingestion path — direct from Verra (data licensing) or via partner (Everland, Wildlife Works, etc.)?
- ARR & Plan Vivo on roadmap? — affects whether to invest in V47 control-plot matching, livelihood/ecosystem indicators, etc.
- Plot data path — if C6 ever collects field plots (validation campaigns), separate
field_observationtable vs. shoehorn into RasterAggregate withrunner_key='field_plot'? - Histogram column — collapse the 8 separate
lulc_*_haper-class FarmMetric rows into a single histogram aggregate, or keep as-is for backward compat?
15. The principle
Pixels live in GCS forever. Postgres stores the aggregates we’ve asked for, with full provenance back to source rasters and code. Methodologies are configurations on top — pool lists, baseline sources, deduction rules — not separate data paths. Adding new methodologies, baseline methods, or carbon pools is a config + runner change, not a schema migration.
That’s the v2 model in one paragraph. Everything else is implementation detail.