C6 Carbon Model V2 Plan
C6 Carbon Model v2 — Satellite-Native Redesign Plan
Date: 2026-05-07
Pairs with: c6-carbon-model.md (the gap analysis), methodology-inputs-comparison.md (the methodology survey).
Status: Plan for discussion. No code yet.
1. Decisions feeding this design
- No more vendor PDFs. Canopy, ClearBlue, C2050 ingestion paths get retired. All values come from per-pixel satellite rasters that C6 zonal-stats itself.
- Audit-first. Every farm metric must trace back to: which raster source, which version, which farm geometry, which code path, on what date. Auditor must be able to reproduce the number from inputs.
- Coverage > completeness. Cover the metrics actively-used methodologies need. Don’t build for hypothetical future projects. But the schema must let new metrics drop in without redesign — if a logging project shows up, dead wood + HWP + per-tree allometric data must be addable in a migration, not a rewrite.
- Carry forward design constraints from the gap analysis:
baseline_sourcediscriminator (project can switch between C6-internal / Verra-jurisdictional / national PRODES baselines and compare)- Per-pool uncertainty stored and used by the engine
- Stratification (per-stratum EFs, not farm-wide averages)
- Pool list is methodology-config, not engine-hardcoded
2. Architectural shift
Today
Vendor PDF → parser → FarmMetric.value_canonical
source_tag = "Canopy v4"
The vendor is the source of truth. Provenance is a string label.
Tomorrow
Raster sources → zonal-stats runner → RasterAggregate row → FarmMetric row
(MapBiomas, (deterministic from (per farm × source (canonical farm-level
HGB, GEDI, inputs + code) × vintage; the value, FK to the
SoilGrids, "scientific receipt") aggregate(s) it derived from)
SRTM, …)
Provenance is a chain of FKs, not a string. Auditor can click through FarmMetric → RasterAggregate → RasterDataSource → external dataset URL + version + hash.
The big win: reproducibility. Given the input rasters and the code, the same aggregate produces the same number. Vendor PDFs were opaque — your number was whatever the vendor said. Now your number is whatever your code computes from public rasters, and anyone can re-run it.
3. Four-layer data model
| Layer | Table(s) | What it stores | Cardinality |
|---|---|---|---|
| L0 — Raster source registry | raster_data_sources (exists) | One row per external dataset family + version (MapBiomas Col 10, HGB 2010, SoilGrids v2, …). Key, version, storage tier, URI pattern. | low (~30 rows) |
| L1 — Raster aggregate (NEW) | raster_aggregates | One row per (geometry, raster_source, vintage, stratum_filter, aggregate_method). The output of running zonal-stats. The audit anchor. | medium (~10–50 per farm-year) |
| L2 — Farm metric (modify) | farm_metrics (exists) | Canonical-unit farm-level values that the engine reads. Now linked to L1 via FK. | medium (~30 per farm-year) |
| L3 — Methodology output | vintage_ledger, project_computation (exist) | Engine output: gross/net/credits/revenue per vintage, per project. | low (~5 per project-year) |
L0 + L3 are mostly there. L1 is new and is the main lift. L2 changes lightly.
4. The new raster_aggregates table
This is the heart of the redesign. One table to anchor everything.
class RasterAggregate(IdMixin, TimestampedMixin, Base):
"""Per-(geometry × raster_source × vintage × stratum × method) aggregate value.
The reproducible scientific receipt. Every FarmMetric points to one or
more RasterAggregate rows; each RasterAggregate points to a
RasterDataSource and carries enough metadata to re-run.
"""
__tablename__ = "raster_aggregates"
# What dataset
raster_source_id: Mapped[int] # FK → raster_data_sources
raster_version: Mapped[str] # e.g. "col-10", "2010", "v5"
raster_asset_hrefs: Mapped[list[str]] # JSONB: URIs read
raster_asset_hash: Mapped[str | None] # SHA of the asset(s) at read time
# What geometry
geometry_kind: Mapped[str] # 'farm' | 'project' | 'leakage_belt' | 'stratum'
geometry_id: Mapped[int] # FK target depends on kind
geometry_version: Mapped[str | None] # geometry hash to detect re-runs after edits
# What slice of the geometry (NULL = whole geometry)
stratum_filter: Mapped[dict | None] # JSONB: {"forest_type": "primary_amazon", "disturbance": "intact"}
# What time
vintage_year: Mapped[int]
period_start: Mapped[date | None] # for windowed metrics (cumulative defor, fire history)
period_end: Mapped[date | None]
# What number
aggregate_method: Mapped[str] # 'mean' | 'sum' | 'class_histogram' | 'p95' | 'std'
metric_code: Mapped[str] # FK-by-string → lookup_metric_type.code (e.g. "agb_density_mg_ha")
value: Mapped[Decimal | None] # primary scalar
value_unit: Mapped[str] # raw unit before canonical conversion
histogram: Mapped[dict | None] # JSONB: {class_id: pixel_count} for class_histogram
pixel_count: Mapped[int]
pixel_nodata_count: Mapped[int]
# Uncertainty (from raster's own uncertainty layer if present)
uncertainty_pct_ci95: Mapped[Decimal | None] # 95% CI half-width as %
uncertainty_method: Mapped[str | None] # 'raster_layer' | 'pixel_std' | 'none'
# Reproducibility
code_version: Mapped[str] # git sha of zonal-stats runner
runner_key: Mapped[str] # 'biomass.hgb' | 'mapbiomas.lulc' | …
computed_at: Mapped[datetime]
inputs_fingerprint: Mapped[str] # SHA(geometry_hash + asset_hash + code_sha + params)
# Optional context (carbon fraction used, biome, depth band, etc.)
context: Mapped[dict | None] # JSONB: {"carbon_fraction": 0.46, "biome": "cerrado", "depth_cm": 30}
Why this shape
- One row per scientific computation, not per metric. A single zonal-stats run can produce a histogram (LULC class areas) or a scalar (mean AGB) — both fit. Multiple metric_codes for one geometry+source can share
inputs_fingerprintso the UI groups them. stratum_filterJSONB lets you pre-compute per-stratum aggregates without a polygon-per-stratum table. You run zonal-stats with a mask {forest_type=primary, disturbance=intact} and store one row per stratum slice.inputs_fingerprintis what makes audit easy: same fingerprint = same number, no re-run needed. Different fingerprint = re-run was triggered by an input change (geometry edit, raster version bump, code change). UI surfaces the diff.uncertainty_pct_ci95lives here, not onfarm_metrics. Engine reads uncertainty from the underlying L1 row. If a raster ships uncertainty (HGB does, ESA CCI Biomass does), we capture it. If not, we setuncertainty_method = 'none'and the methodology engine knows to apply a conservative default deduction.
5. farm_metrics changes (small)
Keep most of the existing schema. Two changes:
- Add
raster_aggregate_id: int | NoneFK. Nullable because not every farm_metric is satellite-derived (e.g. project-level config metrics still hand-set). - Drop vendor-specific columns once vendor ingestion is gone:
parser_version,source_document_id,carbon_stock_source_id. Move to a deprecation phase first so nothing breaks. source_tagstays as a human-readable label, but the FK is now the source of truth. UI shows the tag; auditor follows the FK.
class FarmMetric(...):
# existing columns kept
raster_aggregate_id: Mapped[int | None] = mapped_column(
ForeignKey("raster_aggregates.id", ondelete="SET NULL"),
nullable=True, index=True,
)
# parser_version, source_document_id, carbon_stock_source_id → drop in phase 3
For metrics derived from multiple raster aggregates (e.g. forest_cover_pct = forest_cover_ha / total_area_ha — two aggregates), use a join table:
class FarmMetricSource(Base):
farm_metric_id: Mapped[int]
raster_aggregate_id: Mapped[int]
role: Mapped[str] # 'numerator', 'denominator', 'primary'
Most metrics have one source — the simple FK on farm_metrics covers them. Composite metrics use the join table.
6. Stratification — recommend doing it without a stratum table
Two designs were on the table. After thinking about it: don’t store strata as polygons. Store per-stratum aggregates as L1 rows.
Why
Strata in LATAM are entirely pixel-derived. MapBiomas Col 10 already classifies every pixel by forest type + secondary-vs-primary + age class. PRODES adds disturbance vintage. There’s no human input to a stratum boundary — it’s a deterministic function of the raster stack.
If you materialize strata as polygons, you’ve duplicated information that already exists in the rasters and now have to keep them in sync when MapBiomas releases a new collection. That’s a maintenance trap.
Instead
Run zonal-stats with a class mask. For a farm:
- Aggregate 1: AGB mean, geometry=farm, stratum_filter=NULL → farm-wide AGB
- Aggregate 2: AGB mean, geometry=farm, stratum_filter={forest_type: “primary_amazon”} → primary-forest-only AGB
- Aggregate 3: AGB mean, geometry=farm, stratum_filter={forest_type: “secondary”} → secondary-only AGB
- Aggregate 4: pixel_count by stratum → area per stratum (the stratum-level activity weights)
Engine consumes per-stratum aggregates and weights by per-stratum area when computing emissions. The “stratum” is virtual — it’s just a mask.
When you’d materialize strata anyway
If a methodology requires a defended stratum map (signed off by the auditor), serialize it: take the latest pixel-class output and dump to a polygon table at verification time, snapshotted with the verification record. That’s an artifact, not the primary store. farm_stratum_snapshot if/when needed; not v2-day-one.
7. Extensibility — adding a new metric or pool
Adding a metric (e.g. dead-wood density for a future logging project):
- Add row to
lookup_metric_type(dw_density,tC/ha) - Add converter to
app/units/registry.py(dw_density: Measurement(canonical="tc_per_ha", accepts=…, convert=biomass_convert)) - Find a raster source that produces it. If none exists, register it in
raster_data_sources. - Write a runner:
app/services/satellite/{provider}/runner.pythat fetches the raster, zonal-stats over farm geometry, writes aRasterAggregaterow, then writes/updates aFarmMetricrow pointing at it. - Add
dw_densityto the relevant methodology’sREQUIRED_INPUTSinapp/methodologies/{vmXXXX}.py.
Five steps. None require schema changes (assuming we’re inside the registry-and-runner pattern). That’s the goal.
If you ever do need a non-satellite input (e.g. survey-collected livelihood indicator for Plan Vivo), add a runner kind 'survey' that writes RasterAggregate rows with raster_source_id = NULL and runner_key = 'survey.<form>'. The audit semantics still work — the receipt points to the survey instrument and date.
8. Per-pool uncertainty — wired through the new model
Today: value_min_canonical / value_max_canonical exist on farm_metrics but no engine path uses them.
Tomorrow:
RasterAggregate.uncertainty_pct_ci95is the source of truth. Captured at zonal-stats time from the raster’s own uncertainty layer (HGB ships AGB SE per pixel; ESA CCI Biomass ships per-pixel uncertainty; SoilGrids ships SOC SD).FarmMetricreads it via the FK. No need to copy it.- New methodology engine helper
combined_uncertainty_pct(metrics)propagates per-pool uncertainties into a total per VMD0017 (Verra) or TREES guidance. - New
vintage_ledgerrow inserted between buffer and issuable: “Uncertainty deduction (X% combined CI → Y% deduction per VM0048 Table 14)”.
This is engine work, not schema work. Schema gives the engine the data to compute against.
9. Baseline switching — the project_baseline table
From earlier conversation. Make baselines first-class.
class ProjectBaseline(IdMixin, TimestampedMixin, Base):
__tablename__ = "project_baselines"
project_id: Mapped[int]
methodology_protocol_id: Mapped[int] # which methodology this baseline is intended for
source: Mapped[str] # enum: 'historical_projected' | 'jurisdictionally_allocated' | 'dynamic_control_matched' | 'community_PRA' | 'national_NFMS'
label: Mapped[str] # "PRODES 10-yr trend (2014-2023)" / "Verra MT risk map v1.2 (2024)" / "C6 internal initial estimate"
# Core baseline numbers (sparse — only some are non-null per source kind)
rate_pct_per_yr: Mapped[Decimal | None] # historical_projected
jad_tco2e_yr: Mapped[Decimal | None] # jurisdictionally_allocated
allocated_baseline_tco2e_yr: Mapped[Decimal | None] # jurisdictionally_allocated post-allocation
control_match_id: Mapped[int | None] # FK to dynamic match record (V47)
# Provenance (raster-aggregate-anchored where applicable)
raster_aggregate_ids: Mapped[list[int]] # JSONB array
external_data_ref: Mapped[dict | None] # JSONB: {"verra_jad_file_url": "...", "version": "...", "hash": "..."}
is_active: Mapped[bool] # only one active per (project, methodology) at a time
notes: Mapped[str | None]
computed_at: Mapped[datetime]
Engine takes (project, baseline) instead of project. UI shows side-by-side: “C6 initial 9,200 tCO₂e/yr | Verra jurisdictional 7,800 | PRODES national 11,400”. User toggles active baseline; vintage_ledger re-computes.
This is the single biggest UX upgrade in v2. It turns “what’s our credit forecast” from a one-shot number into a defensible scenario comparison.
10. What changes in the satellite-runner code
Today each runner (mapbiomas/runner.py, biomass/runner.py, overlay_runner.py, mapbiomas/fire_runner.py) reads zonal stats, builds a ZonalStatsResult, and the writer turns it into FarmMetric rows. source_tag is a string.
Tomorrow:
- Each runner first writes one
RasterAggregaterow per (geometry × raster_source × vintage × stratum_filter × metric_code). - Then writes/updates the
FarmMetricrow withraster_aggregate_idpointing to it. source_tagstays as a display string; FK is the auditable link.
The ZonalStatsResult TypedDict in app/services/satellite/base.py needs an extension to carry per-metric uncertainty + stratum_filter. Or, cleaner: rename it RasterAggregatePayload and have the writer split the payload into a RasterAggregate + FarmMetric pair.
Roll out one runner at a time. Start with biomass (highest-value, simplest provenance story) — already has cf=0.47:ipcc_default style tagging, easy to upgrade.
11. Migration path
Five phases, each shippable independently.
| Phase | Scope | Why |
|---|---|---|
| 0 | raster_aggregates table + write-side only (runners populate it; nothing reads from it yet) | De-risk the new table by getting data flowing without depending on it |
| 1 | farm_metrics.raster_aggregate_id column added; runners populate FK on new writes | Backfill provenance going forward without touching historical rows |
| 2 | Engine reads uncertainty from raster_aggregates; new vintage_ledger row for uncertainty deduction | Methodology accuracy lift; observable in the UI |
| 3 | project_baselines table + scenario UI; engine takes baseline arg | Baseline switching — biggest UX win |
| 4 | Stratification: runners emit per-stratum aggregates for biomass + activity data; engine weights by stratum area | Per-stratum EFs, defensible math |
| 5 | Retire vendor parsers + parser_version / source_document_id columns; deprecate carbon_measurements table fully | Cleanup once nothing reads vendor paths |
Phase 0 is one migration + one writer change. Could be done in a week. Each subsequent phase 1–2 weeks. Total: ~6–8 weeks for the full v2 with proper testing.
12. What this gets you (concretely)
For an audit conversation:
Auditor: “Where did this 78.4 tC/ha for farm X come from?” C6: “Click here. RasterAggregate #12947. Source = HGB 2010 v2 (raster_data_source key
hgb-global, asset href s3://…, hash sha256:abc…). Geometry = farm 553 polygon hash xyz. Method = pixel mean over 1,247 pixels, 0 nodata. Carbon fraction = 0.46 (Cerrado biome, lookup_biome row 4). Uncertainty = 18% CI95 from HGB SE layer. Code = git sha def123, runnerbiomass.hgb. Computed 2026-04-12. Re-running with the same inputs reproduces the value to 4 decimals.”
For a methodology migration conversation (VM0007 → VM0048):
User: “What’s our credit forecast under VM0048 vs VM0007?” C6 UI: shows two ProjectBaseline rows side by side. Toggle to make either active. vintage_ledger re-computes against the active baseline. Both stay in the audit trail.
For a logging project showing up:
Marc: “We just signed an IFM project, need dead wood + HWP.” C6 dev: writes new runner using GEDI L4A + national logging records. Adds
dw_densityto lookup_metric_type, units registry, VM0007 IFM module’s REQUIRED_INPUTS. No schema migration. Done in 2 days.
13. Open questions to resolve before coding
- Stratum class taxonomy. Need a fixed C6-internal taxonomy:
{forest_type, disturbance, age_class}enum. MapBiomas has its own classes; we need a mapping table from MapBiomas class IDs → C6 classes. Same for PRODES, Hansen GFC, RAISG. Who owns the taxonomy decision? - Geometry versioning. Today farm geometry can be edited. Edit invalidates every downstream RasterAggregate. Should an edit auto-reschedule re-runs, or surface a “this farm has stale aggregates” badge and let user trigger? (Recommend: badge + manual trigger; auto-rerun is expensive and silent.)
- Storage cost. RasterAggregate could grow to ~50 rows per farm per year × N farms × N vintages × N stratum filters × N raster sources. For 200 farms × 10 vintages × 5 strata × 8 sources = 80k rows. Manageable. But if we go per-month aggregates, blows up. Stay annual.
- Histogram column. LULC class-area aggregation today is 8 separate FarmMetric rows. Could collapse to a single RasterAggregate with a
histogramJSONB. Cleaner but breaks the existing pattern. Decide: keep 1-row-per-class for backward compat, or histogram-first? - Plot-level data path. If we ever do collect field plots (challenge campaigns, validation), do they go in RasterAggregate with
runner_key='field_plot'? Or a separatefield_observationtable? Leaning toward: separate table because the cardinality is per-tree, not per-aggregate, but they roll up into a RasterAggregate-shaped row for engine consumption. - Retirement timeline for vendor code. When exactly does Canopy/C2050 ingestion stop? If active projects still rely on those numbers, retiring the parsers strands them. Probably need: freeze writes to vendor-derived rows, run satellite pipelines in parallel, compare numbers, then cut over per-farm.
14. Recommended first move
Don’t start with the migration. Start with one runner end-to-end as a vertical slice:
- Take the HGB biomass runner (smallest, cleanest)
- Spike
raster_aggregatestable with the schema in §4 - Have the runner write a RasterAggregate row + a FarmMetric row that FKs to it
- Wire the FarmMetric detail UI to show “Source: HGB 2010 v2 → click to see RasterAggregate”
- Confirm: numbers reproducible, audit trail visible, no engine changes
Once that vertical works on one runner, the rest is mechanical translation. Doing the full schema migration up-front and then refitting all runners is harder than slicing one runner top-to-bottom and growing.
This is also where you discover the schema mistakes cheaply. If stratum_filter JSONB doesn’t pan out, you find out on one runner, not eight.