C6 Carbon Model V2 Summary

C6 Carbon Model v2 — Summary & High-Level Design

Date: 2026-05-07 Pairs with: methodology-inputs-comparison.md, c6-carbon-model.md (gap analysis), c6-carbon-model-v2-plan.md (initial farm-grain plan, superseded by this doc on the H3 grain question).

This is the consolidated takeaway from a deep design conversation. Read this if you only have time for one.


1. The big shift

From: vendor PDF → parser → FarmMetric. Vendor was the source of truth; provenance was a string.

To: raster sources → zonal-stats runner → RasterAggregate → ProjectH3Metric → FarmMetric. Source of truth is the public raster + your code; provenance is an FK chain auditor can click through.

What changes:

  • No more Canopy / C2050 / ClearBlue PDF ingestion
  • All values come from per-pixel satellite rasters that C6 zonal-stats itself
  • Every farm metric traces to a reproducible computation: same inputs + same code → same number, every time
  • The audit story becomes “click through to see the raster and the code”

2. Four-layer data model

LayerLives inCardinalityWhat it stores
L0 raster pixelsGCS / AWS / Earth Engine (COGs)trillionsthe actual datasets — never enter Postgres
L0 source registryraster_data_sources (exists)~30 rowscatalog of MapBiomas, HGB, GEDI, SoilGrids, PRODES, Sentinel, etc., with version + storage tier
L1 raster aggregateraster_aggregates (NEW)mediumone row per (geometry × source × vintage × method) — the audit anchor
L2 per-cell metricproject_h3_metric (exists, extend)highcanonical-unit values per H3 cell, FK to L1
L2 per-farm metricfarm_metrics (exists, extend)mediumderived from H3 cells via spatial join + weighted aggregation
L3 methodology outputvintage_ledger + project_computation (exist)lowengine outputs: gross/net/credits/revenue per project per vintage

Pixels live in GCS forever. Postgres only stores aggregates of pixels — the slices we’ve computed and want to keep.


3. H3 is the primary spatial grain

H3 cells are immutable hexagonal grid cells with deterministic indexing. They’re the unit of carbon accounting.

Resolution 8 (~74 ha per cell) is the default, because:

  • Matches HGB biomass native resolution (~74 pixels per cell — statistically stable mean)
  • ~822 MapBiomas pixels per cell — generous for class composition
  • Most cells are single-stratum (~70%); mixed cells get composition stored explicitly
  • VM0048 risk maps (~100 m native) average cleanly to res 8
  • Auditable at human scale — a 74 ha hexagon is something a person can see

Two exceptions:

  • Smallholder farms (<500 ha): promote to res 9 (~10.5 ha cells) so individual fields are visible
  • SOC (SoilGrids 250 m native) and GEDI L4B (1 km native): store at res 7, inherit down to res 8 via H3 parent/child relationships

H3 has the nested property: every res-9 cell sits inside a res-8 parent. Mixing resolutions is well-defined.


4. Why H3 over farm-grain rollup

H3 cells are better than farm polygons for the redesign:

  • Eternal: cell 882ac4f9ffffff always covers the same hexagon. No geometry_version invalidation needed. Farm polygons get edited; H3 cells don’t.
  • Per-stratum natural: at 74 ha, ~70% of cells are single-stratum. Stratum composition is stored per cell as a normal column instead of needing JSONB filter slices.
  • Spatially queryable: “show me forest cells with AGB > 200” is a simple SQL filter.
  • Audit-aligned: UI rows = data rows. Each row a user sees on screen is one cell, with one provenance chain.

FarmMetric becomes a derived rollup of H3 cells inside the farm polygon. Engine reads farm-level summary numbers; per-cell detail backs every claim.


5. Stratification — composition, not polygons

Strata are not stored as polygons. Each H3 cell has:

  • stratum_dominant (enum): the majority pixel class — drives UI color and filters
  • stratum_composition (JSONB): normalized class histogram — {primary_forest: 0.78, pasture: 0.20, water: 0.02}
  • forest_cover_pct (numeric): derived from composition
  • agb_density_mg_ha: AGB masked to forest pixels only within the cell

Engine math per cell respects composition:

cell_baseline_tco2e =
    agb_forest_only       (185 Mg/ha)
  × forest_cover_pct/100  (0.78)
  × cell_area_ha          (73.6)
  × carbon_fraction       (0.46)
  × (44/12)               (CO₂ stoichiometry)
  × baseline_defor_rate   (varies — see §7)

Mixed cells get credited only for their forest fraction. No double-counting of pasture as forest.


6. The raster_aggregates table — the audit anchor

One row per scientific computation. Carries:

  • What dataset: raster_source FK + version + asset URLs + asset hash
  • What geometry: kind (h3_cell / farm / project / leakage_belt) + id + geometry_version
  • What slice: stratum_filter JSONB (rarely used at H3 grain — composition columns cover most needs)
  • What time: vintage + period_start/end for windowed metrics
  • What number: aggregate_method (mean/sum/histogram/p95/std), value, unit, pixel_count
  • Uncertainty: uncertainty_pct_ci95 from the raster’s own uncertainty layer (HGB SE, ESA CCI Biomass uncertainty band, SoilGrids SD)
  • Reproducibility: code_version (git sha), runner_key, inputs_fingerprint (SHA of geometry + asset_hash + code_sha + params)

Same inputs_fingerprint = same number. Different fingerprint = re-run was triggered. Auditor can verify by hash.


7. Multi-baseline switching — baselines are per-cell layers

Old: “the baseline rate is 0.82%/yr” — a project-wide scalar.

New: each baseline method produces a per-cell rate layer, with a baseline_source discriminator. Switching baselines means querying a different layer slice.

MethodWhat it producesStorage
C6 internal historicalscalar from PRODES inside project polygonper-cell, uniform
PRODES national referencescalar from state/biome trendper-cell, uniform
Verra VM0048 jurisdictionalper-cell rate from Verra risk raster + JAD allocationper-cell, varies spatially
ART TREES national NFMSscalar from national forest monitoring systemper-cell, uniform per jurisdiction
VM0047 dynamic controlmatched-control ΔSI, no rate conceptper-cell, layer code differs

Engine math becomes uniform across methods:

SUM(agb × forest_pct/100 × cell_area × cf × 44/12 × baseline_rate)
WHERE baseline_rate.baseline_source = :active

Switching = changing one parameter. Same SQL, different number.

UI gain: side-by-side scenario table (“under C6 internal: $2.66M; under PRODES national: $3.30M; under Verra VM0048: $2.26M”) + per-cell baseline-rate heatmap + per-cell delta map between methods.


8. Per-pool uncertainty — wired through

Uncertainty is captured at zonal-stats time from each raster’s own uncertainty layer:

  • HGB ships AGB SE (band 2) → uncertainty_pct_ci95 on the AGB RasterAggregate
  • ESA CCI Biomass ships per-pixel uncertainty → same
  • SoilGrids ships per-pixel SD → same
  • MapBiomas / PRODES ship no uncertainty → uncertainty_method = 'none', engine applies a conservative default deduction

Engine helper combined_uncertainty_pct(metrics) propagates per-pool CIs into a project total per VMD0017 (Verra) or TREES guidance.

New vintage_ledger row inserted between buffer and issuable: “Uncertainty deduction (X% CI → Y% deduction)”. Today’s ledger is gross → leakage → buffer → issuable. Becomes gross → leakage → buffer → uncertainty deduction → issuable.


9. Pool & methodology configuration

Pools (AGB, BGB, DW, L, SOC, HWP) are not separate types in storage. They’re metric_codes like everything else. “Pool-ness” is a methodology-config role, not a data-model concept.

Methodology defines its own input list:

# app/methodologies/vm0048.py
REQUIRED_INPUTS = (
    InputSpec(code="agb_forest_only", ...),
    InputSpec(code="forest_cover_pct", ...),
    InputSpec(code="baseline_rate_pct_yr", source="active_baseline", ...),
    # DW, L, HWP not in this list — VM0048 lets us exclude as de minimis
)

A logging project methodology adds DW + HWP to its list:

# app/methodologies/vm0007_ifm.py
REQUIRED_INPUTS = (
    InputSpec(code="agb_forest_only", ...),
    InputSpec(code="dw_density", ...),
    InputSpec(code="hwp_carbon", ...),
    ...
)

Engine reads only what the methodology asks for. Adding a new pool = config + new runner, no schema change.


10. Storage shape — example of one cell

For H3 cell 882ac4f9ffffff in project C6-BR-2406, vintage 2026:

Five raster_aggregates rows (one per source × layer):

  • HGB AGB mean (forest-only mask) → 185 Mg/ha, CI ±18.6%
  • MapBiomas LULC histogram → {primary_forest: 0.78, pasture: 0.20, ...}
  • MapBiomas Alerta defor → 0.60 ha
  • SoilGrids SOC (inherited from res-7 parent) → 73.4 tC/ha, CI ±41%
  • MapBiomas Fogo cumulative burned → 4.2% of cell

~10 project_h3_metric rows (canonical-unit, FK to aggregates):

  • agb_density_mg_ha = 185.0 (FK → HGB aggregate)
  • agb_uncertainty_pct_ci95 = 22.4 (FK → HGB aggregate)
  • forest_cover_pct = 78.0 (FK → MapBiomas histogram)
  • stratum_dominant = “primary_forest” (derived)
  • stratum_composition = JSONB
  • defor_ha_yr = 0.60 (FK → MapBiomas Alerta aggregate)
  • soc_density_tc_ha = 73.4 (FK → SoilGrids aggregate)
  • baseline_rate_pct_yr [c6_internal_historical] = 0.82 (uniform scalar)
  • baseline_rate_pct_yr [verra_vm0048_mt] = 2.41 (FK → Verra risk map aggregate)
  • flag = “ok” (derived)

Two L3 outputs (engine math per cell, also project_h3_metric rows):

  • baseline_tco2e [under active baseline] = 499 t
  • issuable_tco2e [post all deductions] = 389 t

The UI you’ve already built (per-cell table with H3 INDEX / AGB / DEFOR / BASELINE / ISSUABLE / FLAG columns) renders these rows directly.


11. What changes from today

New tables

  • raster_aggregates — the audit anchor. Per-(geometry × source × vintage × method) row.
  • project_baselines — per-project metadata about each baseline scenario (source, label, is_active, computed_at, project_total summary, FKs to L1 inputs).
  • farm_metric_h3_source (optional) — join table linking FarmMetric rollups to constituent H3 cells.

Modified tables

  • project_h3_metric — gains raster_aggregate_id FK + baseline_source discriminator.
  • farm_metrics — gains raster_aggregate_id FK; eventually drops parser_version, source_document_id, carbon_stock_source_id (vendor-era columns).

Deprecated

  • Vendor parsers: app/services/vendor_report/parsers/canopy.py, c2050.py, clearblue.py
  • carbon_measurements table (already marked deprecated)

Engine changes

  • Engine reads CI from L1 aggregate via FK; applies methodology uncertainty deduction
  • Engine takes (project, vintage, active_baseline_id); routes to per-cell baseline_rate layer for that source
  • Engine math becomes mostly methodology-agnostic — methodology choice = which layer codes + which baseline_source to read
  • New vintage_ledger row for uncertainty deduction

UI gains

  • Click any cell row → audit chain to RasterAggregate → RasterDataSource → reproducible
  • Side-by-side baseline scenario comparison
  • Per-cell baseline rate heatmap (especially valuable for VM0048 risk visualization)
  • Per-cell delta map between two baselines
  • Project-type-from-satellite classifier (“this farm looks like REDD-AUD; this one looks like ARR”)

12. Migration phases

Each phase shippable independently.

PhaseScopeEstimate
0raster_aggregates table; runners write to it (nothing reads yet)1 wk
1project_h3_metric.raster_aggregate_id + farm_metrics.raster_aggregate_id FKs; runners populate going forward1 wk
2Engine reads CI from L1; new vintage_ledger uncertainty-deduction row1–2 wk
3project_baselines table + scenario UI; baseline_source discriminator on per-cell layer; engine takes baseline arg2 wk
4Per-stratum composition columns on cells; engine multiplies by forest_cover_pct for accurate mixed-cell math1 wk
5Smallholder res-9 promotion; SOC + GEDI res-7 inheritance1 wk
6Retire vendor parsers + drop legacy columns; finalize carbon_measurements removal1 wk

Total: ~7–9 weeks with testing.

Recommended first move: don’t start with the migration. Pick the HGB biomass runner and do a vertical slice — write one RasterAggregate row + one ProjectH3Metric row, wire UI click-through, verify reproducibility — before migrating other runners.


13. Capabilities this unlocks

For audit conversations:

Auditor: “Where did this 78.4 tC/ha for farm 553 come from?” C6: clicks farm metric → H3 cells → RasterAggregate → HGB raster URL + hash + git sha + computed timestamp + carbon fraction source. Reproducible to 4 decimals.

For methodology migrations (VM0007 → VM0048):

User: “What’s our credit forecast under VM0048 vs VM0007 vs PRODES national?” UI: three project_baselines rows side-by-side. Toggle active. vintage_ledger recomputes. Both stay in audit trail.

For new project types (logging, ARR, soil carbon):

Marc: “We just signed an IFM project, need DW + HWP.” Dev: writes new runner using GEDI L4A + national logging records. Adds dw_density to lookup_metric_type, units registry, IFM methodology’s REQUIRED_INPUTS. No schema migration. Done in 2 days.

For project-type discovery:

Onboarding wizard: shows new farm uploaded as KMZ. Satellite signals parse. UI suggests “Based on 89% forest cover + 0.6%/yr historical defor, this farm looks like a REDD-AUD candidate. VM0048 eligible: yes (Mato Grosso). Confirm?“


14. Open questions

  1. Stratum class taxonomy — fixed C6 enum + mapping tables from MapBiomas / PRODES / Hansen class IDs. Who owns the decision?
  2. Smallholder threshold — exact farm-area cutoff for res-9 promotion. 200 ha? 500 ha? Per project, per registry?
  3. Vendor retirement timeline — active projects relying on vendor numbers need parallel-run + numerical comparison before cutover.
  4. VM0048 ingestion path — direct from Verra (data licensing) or via partner (Everland, Wildlife Works, etc.)?
  5. ARR & Plan Vivo on roadmap? — affects whether to invest in V47 control-plot matching, livelihood/ecosystem indicators, etc.
  6. Plot data path — if C6 ever collects field plots (validation campaigns), separate field_observation table vs. shoehorn into RasterAggregate with runner_key='field_plot'?
  7. Histogram column — collapse the 8 separate lulc_*_ha per-class FarmMetric rows into a single histogram aggregate, or keep as-is for backward compat?

15. The principle

Pixels live in GCS forever. Postgres stores the aggregates we’ve asked for, with full provenance back to source rasters and code. Methodologies are configurations on top — pool lists, baseline sources, deduction rules — not separate data paths. Adding new methodologies, baseline methods, or carbon pools is a config + runner change, not a schema migration.

That’s the v2 model in one paragraph. Everything else is implementation detail.