C6 Carbon Model V2 Summary

C6 Carbon Model v2 — Summary & High-Level Design

Date: 2026-05-07 Pairs with: methodology-inputs-comparison.md, c6-carbon-model.md (gap analysis), c6-carbon-model-v2-plan.md (initial farm-grain plan, superseded by this doc on the H3 grain question).

This is the consolidated takeaway from a deep design conversation. Read this if you only have time for one.

1. The big shift

From: vendor PDF → parser → FarmMetric. Vendor was the source of truth; provenance was a string.

To: raster sources → zonal-stats runner → RasterAggregate → ProjectH3Metric → FarmMetric. Source of truth is the public raster + your code; provenance is an FK chain auditor can click through.

What changes:

No more Canopy / C2050 / ClearBlue PDF ingestion
All values come from per-pixel satellite rasters that C6 zonal-stats itself
Every farm metric traces to a reproducible computation: same inputs + same code → same number, every time
The audit story becomes “click through to see the raster and the code”

2. Four-layer data model

Layer	Lives in	Cardinality	What it stores
L0 raster pixels	GCS / AWS / Earth Engine (COGs)	trillions	the actual datasets — never enter Postgres
L0 source registry	`raster_data_sources` (exists)	~30 rows	catalog of MapBiomas, HGB, GEDI, SoilGrids, PRODES, Sentinel, etc., with version + storage tier
L1 raster aggregate	`raster_aggregates` (NEW)	medium	one row per (geometry × source × vintage × method) — the audit anchor
L2 per-cell metric	`project_h3_metric` (exists, extend)	high	canonical-unit values per H3 cell, FK to L1
L2 per-farm metric	`farm_metrics` (exists, extend)	medium	derived from H3 cells via spatial join + weighted aggregation
L3 methodology output	`vintage_ledger` + `project_computation` (exist)	low	engine outputs: gross/net/credits/revenue per project per vintage

Pixels live in GCS forever. Postgres only stores aggregates of pixels — the slices we’ve computed and want to keep.

3. H3 is the primary spatial grain

H3 cells are immutable hexagonal grid cells with deterministic indexing. They’re the unit of carbon accounting.

Resolution 8 (~74 ha per cell) is the default, because:

Matches HGB biomass native resolution (~74 pixels per cell — statistically stable mean)
~822 MapBiomas pixels per cell — generous for class composition
Most cells are single-stratum (~70%); mixed cells get composition stored explicitly
VM0048 risk maps (~100 m native) average cleanly to res 8
Auditable at human scale — a 74 ha hexagon is something a person can see

Two exceptions:

Smallholder farms (<500 ha): promote to res 9 (~10.5 ha cells) so individual fields are visible
SOC (SoilGrids 250 m native) and GEDI L4B (1 km native): store at res 7, inherit down to res 8 via H3 parent/child relationships

H3 has the nested property: every res-9 cell sits inside a res-8 parent. Mixing resolutions is well-defined.

4. Why H3 over farm-grain rollup

H3 cells are better than farm polygons for the redesign:

Eternal: cell 882ac4f9ffffff always covers the same hexagon. No geometry_version invalidation needed. Farm polygons get edited; H3 cells don’t.
Per-stratum natural: at 74 ha, ~70% of cells are single-stratum. Stratum composition is stored per cell as a normal column instead of needing JSONB filter slices.
Spatially queryable: “show me forest cells with AGB > 200” is a simple SQL filter.
Audit-aligned: UI rows = data rows. Each row a user sees on screen is one cell, with one provenance chain.

FarmMetric becomes a derived rollup of H3 cells inside the farm polygon. Engine reads farm-level summary numbers; per-cell detail backs every claim.

5. Stratification — composition, not polygons

Strata are not stored as polygons. Each H3 cell has:

stratum_dominant (enum): the majority pixel class — drives UI color and filters
stratum_composition (JSONB): normalized class histogram — {primary_forest: 0.78, pasture: 0.20, water: 0.02}
forest_cover_pct (numeric): derived from composition
agb_density_mg_ha: AGB masked to forest pixels only within the cell

Engine math per cell respects composition:

cell_baseline_tco2e =
    agb_forest_only       (185 Mg/ha)
  × forest_cover_pct/100  (0.78)
  × cell_area_ha          (73.6)
  × carbon_fraction       (0.46)
  × (44/12)               (CO₂ stoichiometry)
  × baseline_defor_rate   (varies — see §7)

Mixed cells get credited only for their forest fraction. No double-counting of pasture as forest.

6. The `raster_aggregates` table — the audit anchor

One row per scientific computation. Carries:

What dataset: raster_source FK + version + asset URLs + asset hash
What geometry: kind (h3_cell / farm / project / leakage_belt) + id + geometry_version
What slice: stratum_filter JSONB (rarely used at H3 grain — composition columns cover most needs)
What time: vintage + period_start/end for windowed metrics
What number: aggregate_method (mean/sum/histogram/p95/std), value, unit, pixel_count
Uncertainty: uncertainty_pct_ci95 from the raster’s own uncertainty layer (HGB SE, ESA CCI Biomass uncertainty band, SoilGrids SD)
Reproducibility: code_version (git sha), runner_key, inputs_fingerprint (SHA of geometry + asset_hash + code_sha + params)

Same inputs_fingerprint = same number. Different fingerprint = re-run was triggered. Auditor can verify by hash.

7. Multi-baseline switching — baselines are per-cell layers

Old: “the baseline rate is 0.82%/yr” — a project-wide scalar.

New: each baseline method produces a per-cell rate layer, with a baseline_source discriminator. Switching baselines means querying a different layer slice.

Method	What it produces	Storage
C6 internal historical	scalar from PRODES inside project polygon	per-cell, uniform
PRODES national reference	scalar from state/biome trend	per-cell, uniform
Verra VM0048 jurisdictional	per-cell rate from Verra risk raster + JAD allocation	per-cell, varies spatially
ART TREES national NFMS	scalar from national forest monitoring system	per-cell, uniform per jurisdiction
VM0047 dynamic control	matched-control ΔSI, no rate concept	per-cell, layer code differs

Engine math becomes uniform across methods:

SUM(agb × forest_pct/100 × cell_area × cf × 44/12 × baseline_rate)
WHERE baseline_rate.baseline_source = :active

Switching = changing one parameter. Same SQL, different number.

UI gain: side-by-side scenario table (“under C6 internal: $2.66M; under PRODES national: $3.30M; under Verra VM0048: $2.26M”) + per-cell baseline-rate heatmap + per-cell delta map between methods.

8. Per-pool uncertainty — wired through

Uncertainty is captured at zonal-stats time from each raster’s own uncertainty layer:

HGB ships AGB SE (band 2) → uncertainty_pct_ci95 on the AGB RasterAggregate
ESA CCI Biomass ships per-pixel uncertainty → same
SoilGrids ships per-pixel SD → same
MapBiomas / PRODES ship no uncertainty → uncertainty_method = 'none', engine applies a conservative default deduction

Engine helper combined_uncertainty_pct(metrics) propagates per-pool CIs into a project total per VMD0017 (Verra) or TREES guidance.

New vintage_ledger row inserted between buffer and issuable: “Uncertainty deduction (X% CI → Y% deduction)”. Today’s ledger is gross → leakage → buffer → issuable. Becomes gross → leakage → buffer → uncertainty deduction → issuable.

9. Pool & methodology configuration

Pools (AGB, BGB, DW, L, SOC, HWP) are not separate types in storage. They’re metric_codes like everything else. “Pool-ness” is a methodology-config role, not a data-model concept.

Methodology defines its own input list:

# app/methodologies/vm0048.py
REQUIRED_INPUTS = (
    InputSpec(code="agb_forest_only", ...),
    InputSpec(code="forest_cover_pct", ...),
    InputSpec(code="baseline_rate_pct_yr", source="active_baseline", ...),
    # DW, L, HWP not in this list — VM0048 lets us exclude as de minimis
)

A logging project methodology adds DW + HWP to its list:

# app/methodologies/vm0007_ifm.py
REQUIRED_INPUTS = (
    InputSpec(code="agb_forest_only", ...),
    InputSpec(code="dw_density", ...),
    InputSpec(code="hwp_carbon", ...),
    ...
)

Engine reads only what the methodology asks for. Adding a new pool = config + new runner, no schema change.

10. Storage shape — example of one cell

For H3 cell 882ac4f9ffffff in project C6-BR-2406, vintage 2026:

Five raster_aggregates rows (one per source × layer):

HGB AGB mean (forest-only mask) → 185 Mg/ha, CI ±18.6%
MapBiomas LULC histogram → {primary_forest: 0.78, pasture: 0.20, ...}
MapBiomas Alerta defor → 0.60 ha
SoilGrids SOC (inherited from res-7 parent) → 73.4 tC/ha, CI ±41%
MapBiomas Fogo cumulative burned → 4.2% of cell

~10 project_h3_metric rows (canonical-unit, FK to aggregates):

agb_density_mg_ha = 185.0 (FK → HGB aggregate)
agb_uncertainty_pct_ci95 = 22.4 (FK → HGB aggregate)
forest_cover_pct = 78.0 (FK → MapBiomas histogram)
stratum_dominant = “primary_forest” (derived)
stratum_composition = JSONB
defor_ha_yr = 0.60 (FK → MapBiomas Alerta aggregate)
soc_density_tc_ha = 73.4 (FK → SoilGrids aggregate)
baseline_rate_pct_yr [c6_internal_historical] = 0.82 (uniform scalar)
baseline_rate_pct_yr [verra_vm0048_mt] = 2.41 (FK → Verra risk map aggregate)
flag = “ok” (derived)

Two L3 outputs (engine math per cell, also project_h3_metric rows):

baseline_tco2e [under active baseline] = 499 t
issuable_tco2e [post all deductions] = 389 t

The UI you’ve already built (per-cell table with H3 INDEX / AGB / DEFOR / BASELINE / ISSUABLE / FLAG columns) renders these rows directly.

11. What changes from today

New tables

raster_aggregates — the audit anchor. Per-(geometry × source × vintage × method) row.
project_baselines — per-project metadata about each baseline scenario (source, label, is_active, computed_at, project_total summary, FKs to L1 inputs).
farm_metric_h3_source (optional) — join table linking FarmMetric rollups to constituent H3 cells.

Modified tables

project_h3_metric — gains raster_aggregate_id FK + baseline_source discriminator.
farm_metrics — gains raster_aggregate_id FK; eventually drops parser_version, source_document_id, carbon_stock_source_id (vendor-era columns).

Deprecated

Vendor parsers: app/services/vendor_report/parsers/canopy.py, c2050.py, clearblue.py
carbon_measurements table (already marked deprecated)

Engine changes

Engine reads CI from L1 aggregate via FK; applies methodology uncertainty deduction
Engine takes (project, vintage, active_baseline_id); routes to per-cell baseline_rate layer for that source
Engine math becomes mostly methodology-agnostic — methodology choice = which layer codes + which baseline_source to read
New vintage_ledger row for uncertainty deduction

UI gains

Click any cell row → audit chain to RasterAggregate → RasterDataSource → reproducible
Side-by-side baseline scenario comparison
Per-cell baseline rate heatmap (especially valuable for VM0048 risk visualization)
Per-cell delta map between two baselines
Project-type-from-satellite classifier (“this farm looks like REDD-AUD; this one looks like ARR”)

12. Migration phases

Each phase shippable independently.

Phase	Scope	Estimate
0	`raster_aggregates` table; runners write to it (nothing reads yet)	1 wk
1	`project_h3_metric.raster_aggregate_id` + `farm_metrics.raster_aggregate_id` FKs; runners populate going forward	1 wk
2	Engine reads CI from L1; new `vintage_ledger` uncertainty-deduction row	1–2 wk
3	`project_baselines` table + scenario UI; `baseline_source` discriminator on per-cell layer; engine takes baseline arg	2 wk
4	Per-stratum composition columns on cells; engine multiplies by `forest_cover_pct` for accurate mixed-cell math	1 wk
5	Smallholder res-9 promotion; SOC + GEDI res-7 inheritance	1 wk
6	Retire vendor parsers + drop legacy columns; finalize `carbon_measurements` removal	1 wk

Total: ~7–9 weeks with testing.

Recommended first move: don’t start with the migration. Pick the HGB biomass runner and do a vertical slice — write one RasterAggregate row + one ProjectH3Metric row, wire UI click-through, verify reproducibility — before migrating other runners.

13. Capabilities this unlocks

For audit conversations:

Auditor: “Where did this 78.4 tC/ha for farm 553 come from?” C6: clicks farm metric → H3 cells → RasterAggregate → HGB raster URL + hash + git sha + computed timestamp + carbon fraction source. Reproducible to 4 decimals.

For methodology migrations (VM0007 → VM0048):

User: “What’s our credit forecast under VM0048 vs VM0007 vs PRODES national?” UI: three project_baselines rows side-by-side. Toggle active. vintage_ledger recomputes. Both stay in audit trail.

For new project types (logging, ARR, soil carbon):

Marc: “We just signed an IFM project, need DW + HWP.” Dev: writes new runner using GEDI L4A + national logging records. Adds dw_density to lookup_metric_type, units registry, IFM methodology’s REQUIRED_INPUTS. No schema migration. Done in 2 days.

For project-type discovery:

Onboarding wizard: shows new farm uploaded as KMZ. Satellite signals parse. UI suggests “Based on 89% forest cover + 0.6%/yr historical defor, this farm looks like a REDD-AUD candidate. VM0048 eligible: yes (Mato Grosso). Confirm?“

14. Open questions

Stratum class taxonomy — fixed C6 enum + mapping tables from MapBiomas / PRODES / Hansen class IDs. Who owns the decision?
Smallholder threshold — exact farm-area cutoff for res-9 promotion. 200 ha? 500 ha? Per project, per registry?
Vendor retirement timeline — active projects relying on vendor numbers need parallel-run + numerical comparison before cutover.
VM0048 ingestion path — direct from Verra (data licensing) or via partner (Everland, Wildlife Works, etc.)?
ARR & Plan Vivo on roadmap? — affects whether to invest in V47 control-plot matching, livelihood/ecosystem indicators, etc.
Plot data path — if C6 ever collects field plots (validation campaigns), separate field_observation table vs. shoehorn into RasterAggregate with runner_key='field_plot'?
Histogram column — collapse the 8 separate lulc_*_ha per-class FarmMetric rows into a single histogram aggregate, or keep as-is for backward compat?

15. The principle

Pixels live in GCS forever. Postgres stores the aggregates we’ve asked for, with full provenance back to source rasters and code. Methodologies are configurations on top — pool lists, baseline sources, deduction rules — not separate data paths. Adding new methodologies, baseline methods, or carbon pools is a config + runner change, not a schema migration.

That’s the v2 model in one paragraph. Everything else is implementation detail.