C6 Carbon Model V2 Plan

C6 Carbon Model v2 — Satellite-Native Redesign Plan

Date: 2026-05-07 Pairs with: c6-carbon-model.md (the gap analysis), methodology-inputs-comparison.md (the methodology survey). Status: Plan for discussion. No code yet.

1. Decisions feeding this design

No more vendor PDFs. Canopy, ClearBlue, C2050 ingestion paths get retired. All values come from per-pixel satellite rasters that C6 zonal-stats itself.
Audit-first. Every farm metric must trace back to: which raster source, which version, which farm geometry, which code path, on what date. Auditor must be able to reproduce the number from inputs.
Coverage > completeness. Cover the metrics actively-used methodologies need. Don’t build for hypothetical future projects. But the schema must let new metrics drop in without redesign — if a logging project shows up, dead wood + HWP + per-tree allometric data must be addable in a migration, not a rewrite.
Carry forward design constraints from the gap analysis:
- baseline_source discriminator (project can switch between C6-internal / Verra-jurisdictional / national PRODES baselines and compare)
- Per-pool uncertainty stored and used by the engine
- Stratification (per-stratum EFs, not farm-wide averages)
- Pool list is methodology-config, not engine-hardcoded

2. Architectural shift

Today

Vendor PDF  →  parser  →  FarmMetric.value_canonical
                          source_tag = "Canopy v4"

The vendor is the source of truth. Provenance is a string label.

Tomorrow

Raster sources  →  zonal-stats runner  →  RasterAggregate row  →  FarmMetric row
(MapBiomas,         (deterministic from    (per farm × source     (canonical farm-level
 HGB, GEDI,          inputs + code)         × vintage; the        value, FK to the
 SoilGrids,                                 "scientific receipt") aggregate(s) it derived from)
 SRTM, …)

Provenance is a chain of FKs, not a string. Auditor can click through FarmMetric → RasterAggregate → RasterDataSource → external dataset URL + version + hash.

The big win: reproducibility. Given the input rasters and the code, the same aggregate produces the same number. Vendor PDFs were opaque — your number was whatever the vendor said. Now your number is whatever your code computes from public rasters, and anyone can re-run it.

3. Four-layer data model

Layer	Table(s)	What it stores	Cardinality
L0 — Raster source registry	`raster_data_sources` (exists)	One row per external dataset family + version (MapBiomas Col 10, HGB 2010, SoilGrids v2, …). Key, version, storage tier, URI pattern.	low (~30 rows)
L1 — Raster aggregate (NEW)	`raster_aggregates`	One row per (geometry, raster_source, vintage, stratum_filter, aggregate_method). The output of running zonal-stats. The audit anchor.	medium (~10–50 per farm-year)
L2 — Farm metric (modify)	`farm_metrics` (exists)	Canonical-unit farm-level values that the engine reads. Now linked to L1 via FK.	medium (~30 per farm-year)
L3 — Methodology output	`vintage_ledger`, `project_computation` (exist)	Engine output: gross/net/credits/revenue per vintage, per project.	low (~5 per project-year)

L0 + L3 are mostly there. L1 is new and is the main lift. L2 changes lightly.

4. The new `raster_aggregates` table

This is the heart of the redesign. One table to anchor everything.

class RasterAggregate(IdMixin, TimestampedMixin, Base):
    """Per-(geometry × raster_source × vintage × stratum × method) aggregate value.

    The reproducible scientific receipt. Every FarmMetric points to one or
    more RasterAggregate rows; each RasterAggregate points to a
    RasterDataSource and carries enough metadata to re-run.
    """
    __tablename__ = "raster_aggregates"

    # What dataset
    raster_source_id: Mapped[int]                                   # FK → raster_data_sources
    raster_version: Mapped[str]                                     # e.g. "col-10", "2010", "v5"
    raster_asset_hrefs: Mapped[list[str]]                           # JSONB: URIs read
    raster_asset_hash: Mapped[str | None]                           # SHA of the asset(s) at read time

    # What geometry
    geometry_kind: Mapped[str]                                      # 'farm' | 'project' | 'leakage_belt' | 'stratum'
    geometry_id: Mapped[int]                                        # FK target depends on kind
    geometry_version: Mapped[str | None]                            # geometry hash to detect re-runs after edits

    # What slice of the geometry (NULL = whole geometry)
    stratum_filter: Mapped[dict | None]                             # JSONB: {"forest_type": "primary_amazon", "disturbance": "intact"}

    # What time
    vintage_year: Mapped[int]
    period_start: Mapped[date | None]                               # for windowed metrics (cumulative defor, fire history)
    period_end: Mapped[date | None]

    # What number
    aggregate_method: Mapped[str]                                   # 'mean' | 'sum' | 'class_histogram' | 'p95' | 'std'
    metric_code: Mapped[str]                                        # FK-by-string → lookup_metric_type.code (e.g. "agb_density_mg_ha")
    value: Mapped[Decimal | None]                                   # primary scalar
    value_unit: Mapped[str]                                         # raw unit before canonical conversion
    histogram: Mapped[dict | None]                                  # JSONB: {class_id: pixel_count} for class_histogram
    pixel_count: Mapped[int]
    pixel_nodata_count: Mapped[int]

    # Uncertainty (from raster's own uncertainty layer if present)
    uncertainty_pct_ci95: Mapped[Decimal | None]                    # 95% CI half-width as %
    uncertainty_method: Mapped[str | None]                          # 'raster_layer' | 'pixel_std' | 'none'

    # Reproducibility
    code_version: Mapped[str]                                       # git sha of zonal-stats runner
    runner_key: Mapped[str]                                         # 'biomass.hgb' | 'mapbiomas.lulc' | …
    computed_at: Mapped[datetime]
    inputs_fingerprint: Mapped[str]                                 # SHA(geometry_hash + asset_hash + code_sha + params)

    # Optional context (carbon fraction used, biome, depth band, etc.)
    context: Mapped[dict | None]                                    # JSONB: {"carbon_fraction": 0.46, "biome": "cerrado", "depth_cm": 30}

Why this shape

One row per scientific computation, not per metric. A single zonal-stats run can produce a histogram (LULC class areas) or a scalar (mean AGB) — both fit. Multiple metric_codes for one geometry+source can share inputs_fingerprint so the UI groups them.
stratum_filter JSONB lets you pre-compute per-stratum aggregates without a polygon-per-stratum table. You run zonal-stats with a mask {forest_type=primary, disturbance=intact} and store one row per stratum slice.
inputs_fingerprint is what makes audit easy: same fingerprint = same number, no re-run needed. Different fingerprint = re-run was triggered by an input change (geometry edit, raster version bump, code change). UI surfaces the diff.
uncertainty_pct_ci95 lives here, not on farm_metrics. Engine reads uncertainty from the underlying L1 row. If a raster ships uncertainty (HGB does, ESA CCI Biomass does), we capture it. If not, we set uncertainty_method = 'none' and the methodology engine knows to apply a conservative default deduction.

5. `farm_metrics` changes (small)

Keep most of the existing schema. Two changes:

Add raster_aggregate_id: int | None FK. Nullable because not every farm_metric is satellite-derived (e.g. project-level config metrics still hand-set).
Drop vendor-specific columns once vendor ingestion is gone: parser_version, source_document_id, carbon_stock_source_id. Move to a deprecation phase first so nothing breaks.
source_tag stays as a human-readable label, but the FK is now the source of truth. UI shows the tag; auditor follows the FK.

class FarmMetric(...):
    # existing columns kept
    raster_aggregate_id: Mapped[int | None] = mapped_column(
        ForeignKey("raster_aggregates.id", ondelete="SET NULL"),
        nullable=True, index=True,
    )
    # parser_version, source_document_id, carbon_stock_source_id → drop in phase 3

For metrics derived from multiple raster aggregates (e.g. forest_cover_pct = forest_cover_ha / total_area_ha — two aggregates), use a join table:

class FarmMetricSource(Base):
    farm_metric_id: Mapped[int]
    raster_aggregate_id: Mapped[int]
    role: Mapped[str]   # 'numerator', 'denominator', 'primary'

Most metrics have one source — the simple FK on farm_metrics covers them. Composite metrics use the join table.

Two designs were on the table. After thinking about it: don’t store strata as polygons. Store per-stratum aggregates as L1 rows.

Why

Strata in LATAM are entirely pixel-derived. MapBiomas Col 10 already classifies every pixel by forest type + secondary-vs-primary + age class. PRODES adds disturbance vintage. There’s no human input to a stratum boundary — it’s a deterministic function of the raster stack.

If you materialize strata as polygons, you’ve duplicated information that already exists in the rasters and now have to keep them in sync when MapBiomas releases a new collection. That’s a maintenance trap.

Instead

Run zonal-stats with a class mask. For a farm:

Aggregate 1: AGB mean, geometry=farm, stratum_filter=NULL → farm-wide AGB
Aggregate 2: AGB mean, geometry=farm, stratum_filter={forest_type: “primary_amazon”} → primary-forest-only AGB
Aggregate 3: AGB mean, geometry=farm, stratum_filter={forest_type: “secondary”} → secondary-only AGB
Aggregate 4: pixel_count by stratum → area per stratum (the stratum-level activity weights)

Engine consumes per-stratum aggregates and weights by per-stratum area when computing emissions. The “stratum” is virtual — it’s just a mask.

When you’d materialize strata anyway

If a methodology requires a defended stratum map (signed off by the auditor), serialize it: take the latest pixel-class output and dump to a polygon table at verification time, snapshotted with the verification record. That’s an artifact, not the primary store. farm_stratum_snapshot if/when needed; not v2-day-one.

7. Extensibility — adding a new metric or pool

Adding a metric (e.g. dead-wood density for a future logging project):

Add row to lookup_metric_type (dw_density, tC/ha)
Add converter to app/units/registry.py (dw_density: Measurement(canonical="tc_per_ha", accepts=…, convert=biomass_convert))
Find a raster source that produces it. If none exists, register it in raster_data_sources.
Write a runner: app/services/satellite/{provider}/runner.py that fetches the raster, zonal-stats over farm geometry, writes a RasterAggregate row, then writes/updates a FarmMetric row pointing at it.
Add dw_density to the relevant methodology’s REQUIRED_INPUTS in app/methodologies/{vmXXXX}.py.

Five steps. None require schema changes (assuming we’re inside the registry-and-runner pattern). That’s the goal.

If you ever do need a non-satellite input (e.g. survey-collected livelihood indicator for Plan Vivo), add a runner kind 'survey' that writes RasterAggregate rows with raster_source_id = NULL and runner_key = 'survey.<form>'. The audit semantics still work — the receipt points to the survey instrument and date.

8. Per-pool uncertainty — wired through the new model

Today: value_min_canonical / value_max_canonical exist on farm_metrics but no engine path uses them.

Tomorrow:

RasterAggregate.uncertainty_pct_ci95 is the source of truth. Captured at zonal-stats time from the raster’s own uncertainty layer (HGB ships AGB SE per pixel; ESA CCI Biomass ships per-pixel uncertainty; SoilGrids ships SOC SD).
FarmMetric reads it via the FK. No need to copy it.
New methodology engine helper combined_uncertainty_pct(metrics) propagates per-pool uncertainties into a total per VMD0017 (Verra) or TREES guidance.
New vintage_ledger row inserted between buffer and issuable: “Uncertainty deduction (X% combined CI → Y% deduction per VM0048 Table 14)”.

This is engine work, not schema work. Schema gives the engine the data to compute against.

9. Baseline switching — the `project_baseline` table

From earlier conversation. Make baselines first-class.

class ProjectBaseline(IdMixin, TimestampedMixin, Base):
    __tablename__ = "project_baselines"

    project_id: Mapped[int]
    methodology_protocol_id: Mapped[int]                   # which methodology this baseline is intended for
    source: Mapped[str]                                    # enum: 'historical_projected' | 'jurisdictionally_allocated' | 'dynamic_control_matched' | 'community_PRA' | 'national_NFMS'
    label: Mapped[str]                                     # "PRODES 10-yr trend (2014-2023)" / "Verra MT risk map v1.2 (2024)" / "C6 internal initial estimate"

    # Core baseline numbers (sparse — only some are non-null per source kind)
    rate_pct_per_yr: Mapped[Decimal | None]                # historical_projected
    jad_tco2e_yr: Mapped[Decimal | None]                   # jurisdictionally_allocated
    allocated_baseline_tco2e_yr: Mapped[Decimal | None]    # jurisdictionally_allocated post-allocation
    control_match_id: Mapped[int | None]                   # FK to dynamic match record (V47)

    # Provenance (raster-aggregate-anchored where applicable)
    raster_aggregate_ids: Mapped[list[int]]                # JSONB array
    external_data_ref: Mapped[dict | None]                 # JSONB: {"verra_jad_file_url": "...", "version": "...", "hash": "..."}

    is_active: Mapped[bool]                                # only one active per (project, methodology) at a time
    notes: Mapped[str | None]
    computed_at: Mapped[datetime]

Engine takes (project, baseline) instead of project. UI shows side-by-side: “C6 initial 9,200 tCO₂e/yr | Verra jurisdictional 7,800 | PRODES national 11,400”. User toggles active baseline; vintage_ledger re-computes.

This is the single biggest UX upgrade in v2. It turns “what’s our credit forecast” from a one-shot number into a defensible scenario comparison.

10. What changes in the satellite-runner code

Today each runner (mapbiomas/runner.py, biomass/runner.py, overlay_runner.py, mapbiomas/fire_runner.py) reads zonal stats, builds a ZonalStatsResult, and the writer turns it into FarmMetric rows. source_tag is a string.

Tomorrow:

Each runner first writes one RasterAggregate row per (geometry × raster_source × vintage × stratum_filter × metric_code).
Then writes/updates the FarmMetric row with raster_aggregate_id pointing to it.
source_tag stays as a display string; FK is the auditable link.

The ZonalStatsResult TypedDict in app/services/satellite/base.py needs an extension to carry per-metric uncertainty + stratum_filter. Or, cleaner: rename it RasterAggregatePayload and have the writer split the payload into a RasterAggregate + FarmMetric pair.

Roll out one runner at a time. Start with biomass (highest-value, simplest provenance story) — already has cf=0.47:ipcc_default style tagging, easy to upgrade.

11. Migration path

Five phases, each shippable independently.

Phase	Scope	Why
0	`raster_aggregates` table + write-side only (runners populate it; nothing reads from it yet)	De-risk the new table by getting data flowing without depending on it
1	`farm_metrics.raster_aggregate_id` column added; runners populate FK on new writes	Backfill provenance going forward without touching historical rows
2	Engine reads uncertainty from `raster_aggregates`; new vintage_ledger row for uncertainty deduction	Methodology accuracy lift; observable in the UI
3	`project_baselines` table + scenario UI; engine takes baseline arg	Baseline switching — biggest UX win
4	Stratification: runners emit per-stratum aggregates for biomass + activity data; engine weights by stratum area	Per-stratum EFs, defensible math
5	Retire vendor parsers + `parser_version` / `source_document_id` columns; deprecate `carbon_measurements` table fully	Cleanup once nothing reads vendor paths

Phase 0 is one migration + one writer change. Could be done in a week. Each subsequent phase 1–2 weeks. Total: ~6–8 weeks for the full v2 with proper testing.

12. What this gets you (concretely)

For an audit conversation:

Auditor: “Where did this 78.4 tC/ha for farm X come from?” C6: “Click here. RasterAggregate #12947. Source = HGB 2010 v2 (raster_data_source key hgb-global, asset href s3://…, hash sha256:abc…). Geometry = farm 553 polygon hash xyz. Method = pixel mean over 1,247 pixels, 0 nodata. Carbon fraction = 0.46 (Cerrado biome, lookup_biome row 4). Uncertainty = 18% CI95 from HGB SE layer. Code = git sha def123, runner biomass.hgb. Computed 2026-04-12. Re-running with the same inputs reproduces the value to 4 decimals.”

For a methodology migration conversation (VM0007 → VM0048):

User: “What’s our credit forecast under VM0048 vs VM0007?” C6 UI: shows two ProjectBaseline rows side by side. Toggle to make either active. vintage_ledger re-computes against the active baseline. Both stay in the audit trail.

For a logging project showing up:

Marc: “We just signed an IFM project, need dead wood + HWP.” C6 dev: writes new runner using GEDI L4A + national logging records. Adds dw_density to lookup_metric_type, units registry, VM0007 IFM module’s REQUIRED_INPUTS. No schema migration. Done in 2 days.

13. Open questions to resolve before coding

Stratum class taxonomy. Need a fixed C6-internal taxonomy: {forest_type, disturbance, age_class} enum. MapBiomas has its own classes; we need a mapping table from MapBiomas class IDs → C6 classes. Same for PRODES, Hansen GFC, RAISG. Who owns the taxonomy decision?
Geometry versioning. Today farm geometry can be edited. Edit invalidates every downstream RasterAggregate. Should an edit auto-reschedule re-runs, or surface a “this farm has stale aggregates” badge and let user trigger? (Recommend: badge + manual trigger; auto-rerun is expensive and silent.)
Storage cost. RasterAggregate could grow to ~50 rows per farm per year × N farms × N vintages × N stratum filters × N raster sources. For 200 farms × 10 vintages × 5 strata × 8 sources = 80k rows. Manageable. But if we go per-month aggregates, blows up. Stay annual.
Histogram column. LULC class-area aggregation today is 8 separate FarmMetric rows. Could collapse to a single RasterAggregate with a histogram JSONB. Cleaner but breaks the existing pattern. Decide: keep 1-row-per-class for backward compat, or histogram-first?
Plot-level data path. If we ever do collect field plots (challenge campaigns, validation), do they go in RasterAggregate with runner_key='field_plot'? Or a separate field_observation table? Leaning toward: separate table because the cardinality is per-tree, not per-aggregate, but they roll up into a RasterAggregate-shaped row for engine consumption.
Retirement timeline for vendor code. When exactly does Canopy/C2050 ingestion stop? If active projects still rely on those numbers, retiring the parsers strands them. Probably need: freeze writes to vendor-derived rows, run satellite pipelines in parallel, compare numbers, then cut over per-farm.

14. Recommended first move

Don’t start with the migration. Start with one runner end-to-end as a vertical slice:

Take the HGB biomass runner (smallest, cleanest)
Spike raster_aggregates table with the schema in §4
Have the runner write a RasterAggregate row + a FarmMetric row that FKs to it
Wire the FarmMetric detail UI to show “Source: HGB 2010 v2 → click to see RasterAggregate”
Confirm: numbers reproducible, audit trail visible, no engine changes

Once that vertical works on one runner, the rest is mechanical translation. Doing the full schema migration up-front and then refitting all runners is harder than slicing one runner top-to-bottom and growing.

This is also where you discover the schema mistakes cheaply. If stratum_filter JSONB doesn’t pan out, you find out on one runner, not eight.