Re-cut the system

Moving live IaC to the target shape.

Repair starts before the first PR. This migration moved a Terraform-managed platform into Pulumi, but the part that mattered happened before most Pulumi components existed: the old boundary had to be named, and the new one had to stop being negotiable.

The previous parts in this series set the target - ownership, contracts, lifecycle-based membership, resolver gates. This part is the order for reaching it in a live system: name the inherited cut, freeze the target, publish the new contract beside the old path, move consumers, add the gate, then delete. Change that order and repair becomes outage: a deleted output still in use breaks the next apply, a gate built before contracts exist has nothing to check.

Name and freeze the cut

The starting layout followed the provider tree. That looked tidy from the outside: one place for GCP, one for DNS, one for GitHub, one for monitoring, one for CI/CD. It said very little about lifecycle.

The old Terraform repo had about twenty root modules and roughly 22k lines of HCL. The app module held nearly two thousand lines: network, service accounts, IAM bindings, DNS records, load balancer configuration, secrets, and workload exceptions accumulated because the resource lived nearby.

You likely saw this before - inherited provider-tree cut:

  cloud/              Google Cloud
  ├─ org/             folders, audit logging
  ├─ app/             ~1,900 lines: network, service accounts,
  │                   IAM bindings, DNS records, load balancer,
  │                   secrets, workload exceptions
  ├─ monitoring/      observability hub, alerting credentials
  └─ cicd/            CI/CD hub, deploy identities, registry access
  cloudflare/         zones, rulesets, email auth
  github/             orgs, repos, teams
  dnsimple/           registrar, zone records

  ~20 root modules · ~22,000 lines of HCL total

Provider tree hides lifecycle

A workload release pulled an environment review along with it because the workload's service account and DNS records lived in the environment module. One bad route or record could block shared-network changes for every service. Removing that workload meant an environment migration too: service account, IAM bindings, records, and any release exceptions had to be untangled from the shared module.

Ownership and tiers had collapsed into a folder layout, and size only made the boundary visible: a tier was carrying workload behaviour that changed with workload releases.

The failure crossed four boundaries at once:

ownership: workload-owned members lived in an environment module
dependency direction: lower-tier changes had to account for workload details
release gating: a workload release could pull the shared tier into review
deletion order: removing a service meant cutting resources out of a producer that other services still depended on

The target was declared early and then used as a freeze line. New work went to the target shape immediately, even while old resources still sat in the old place.

target shape - tiers, apply order, and the workload release path:

  shared-infra pipeline, applied in wave order:
    (1) organisation   service catalogue · DNS roots · identity & KMS roots
    (2) environment    workload projects · network · deploy access · OIDC · rules
    (3) edge           load balancers · routing · certificates · routing records

  workload release path, deploys separately:
        workload       services · app identity · workload-specific bindings ·
                       delegated records

  contracts (consumer reads the producer):
    environment ──▶ workload    workload reads the env contract
    workload ──▶ edge           service outputs, read after they exist

Tiers, apply order, and the workload release path

Edge stayed separate because routing, certificates, and shared load balancers sat between organisation roots and workload-owned services: a routing-only change should not preview the whole environment, and one backend change should not make a shared edge stack belong to that application.

It also passed the . A per-environment load balancer consumed backends with routing configuration from several workloads plus DNS zone output from organisation. It could not live inside any single workload, and would give per-environment routing a low-cadence root owner.

The shared-infra deploy pipeline matched the cut. The TypeScript planner expanded the requested manifest stacks into wave matrices: organisation, each environment, then edge. Every stack previewed first. A failed preview blocked the default deploy path. Staging and production required named approval. At this stage the freeze was still mostly a review rule. The code gate comes later because enforcement needs a contract to check.

Publish the contract beside the old path

The old workload boundary had no contract to replace. Projects, service accounts, DNS records, load balancer rules, runtime settings, and CI/CD identities were all inside the same environment-shaped blob. A few projects existed, but they were embedded resources, not assigned references.

The first explicit contract had to appear while the embedded old shape still existed. New consumers used it. Existing consumers moved next. The old resources stayed until there were no live consumers left on that path.

The new system built four main producer/consumer surfaces:

serviceCatalog: which services exist and what shared capabilities they have
serviceProjects: which project, KMS key, and deploy identity a service is assigned
services / staticBackends: which workload backends edge routes to
domainTopology: which zone owns a name in a given environment

The assigned-project surface was the easiest to see. The environment tier published serviceProjects through buildServiceProjectsOutput(): a map from service name to its assigned project and attached resources. Instead of inheriting local module wiring or copied literals, a workload called getServiceProject(ref, name) and asked for its assigned project by name.

assigned project, read two ways:

  embedded (old):
    app module creates the project
    wires deploy identity, KMS, DNS, runtime
    workload values are locals or copied literals
    - no reference to check

  contracted (new):
    environment publishes serviceProjects
    workload calls getServiceProject(ref, name)
    - one reference, checked at preview

Assigned project, read two ways

The infrastructure I have inherited did not fail through a bad reference because there was no reference to check. The values were local to the blob or copied into the consumer, so the dependency appeared later as review scope, deletion risk, and hard-to-audit literals.

The contracted path makes that reference explicit through a helper that resolves the entry or refuses with a named error:

references/service-project.ts

type ServiceMap = Record<string, ServiceProjectRef>

function getServiceProject(
  ref: pulumi.StackReference,
  name: string,
): pulumi.Output<ServiceProjectRef> {
  return ref
    .getOutput('serviceProjects')
    .apply((projects: ServiceMap) => {
      if (!projects) {
        throw new Error("'serviceProjects' missing")
      }

      const project = projects[name]
      if (project) return project

      const keys = Object.keys(projects).join(', ')
      throw new Error(
        `'${name}' not found. Available: ${keys}`,
      )
    })
}

The helper is a named capability . Type and release-context checks belong to the resolver gate in Fail before apply.

Both paths existed during the move. Where transitional raw outputs existed, the producer kept them while consumers moved to the helper, then removed the raw path only after the raw-read inventory reached zero. Without that overlap, a contract migration becomes another breaking output rename.

For assigned projects, context came from the environment reference chosen by the release path: a staging workload received the staging producer, and a production workload received the production producer. Domain topology had to make context explicit: getDomainTopology(orgRef, parent, app, env) required the caller to pass the environment instead of guessing from stack naming. The publisher also refused duplicate zone ownership in the same environment, stopping and naming the duplicate rather than choosing one.

The service catalogue added another pressure point: . Adding a registry helper meant changing the service's catalogue entry, so the shared surface stayed visible at the producer boundary instead of appearing later inside workload code.

Shortly after the worst of the re-cut was done, I moved the state backend off Pulumi Cloud to a self-managed GCS backend to cut cost. The contract seam held: consumer call-sites did not change, because helper callers never saw the storage.

The workaround was rough: a fake StackReference wrapped around a JSON file on disk. CI fetched organisation outputs into /tmp/org-outputs.json so existing helpers kept working. A missing file failed loudly; a missing key could still become undefined. The stable part was the helper call; the fake reference was a transitional transport. It exposed the seam the resolver should occupy: producer data materialised at the apply boundary, then consumed through contract addresses.

Move membership by access profile

The lower tier publishes the boundary, and the higher tier stays inside it. That membership line is drawn in Define tier membership.

The environment tier created the workload boundary as a reusable component. For services that had moved into the pattern, it provisioned the project, enabled APIs through the bootstrap path, created the deploy identity, attached OIDC trust, granted Shared VPC access, created the KMS surface, and published the assigned reference. The workload tier consumed that reference and created its Cloud Run services, application service accounts, workload-specific bindings, and record declarations inside the assigned project.

Those records need more than a declaration. Workload-owned DNS records are safe only when the binding rule exists too: admitted name pattern, record type, operation, and principal. Until that boundary is present, certificate and routing records stay with edge, and domain ownership stays with organisation.

The forbidden reversal stayed explicit. If a workload creates its own project, its deploy identity needs authority over the environment or organisation layer. Environment discovering those projects after the fact reverses the same graph: it would then depend on workload-published identity.

That changed the old failure mode. Workload stacks stayed out of project creation, and workload deploy identities stayed below environment- or organisation-level authority. That was blast-radius reduction, not automatic least privilege: broad project-local deploy roles still needed their own review and tightening. The shared CI service account with broad rights across a shared project was retired as a written anti-pattern.

The lower tier still owned the shared surface; workloads joined it through the identity that matched the rule. Network access stayed on the host project: compute.networkUser was granted to deploy identities allowed to attach services to Shared VPC, instead of adding a subnet grant for every consumer.

This did not remove every hand-authored entry. Service membership still moved through lower-tier catalogues and project lists during the migration. Moving membership by access profile stopped new bespoke IAM, subnet, and secret exceptions on the surfaces that had rules, while old entries moved on their own schedule.

Secrets by prefix

The shared secrets here were platform-owned and used by workloads and GitHub Environments. Secrets a workload provisions for itself stay with the workload, defined by its stack.

Two access shapes used the rule. CI identities held the write shape: create, update, and add versions, each gated to its own name prefix, preview under preview-* and deploy under deploy-*. The application held the read shape, under sidecar-*, at deploy and at runtime.

shared secrets, two access shapes:

  CI · distribute
    preview   create, update, version under preview-*
    deploy    create, update, version under deploy-*

  runtime · read
    sidecar   access only under sidecar-*

Shared-secret access, two shapes that do not mix

Each binding was concrete: IAM member, allowed operations from the role or custom role, and admitted name prefix in the IAM condition. A request outside the admitted prefix was denied by the condition; a disallowed operation or wrong principal was denied by the binding. Google Cloud IAM enforced this binding after IaC wrote the policy: provider-side, after the call.

Gate, then delete

The release gate belongs after the contract exists. Before that, it mostly checks naming conventions. The migration ended up with several refusal points across two layers:

release-path refusals:

  validated helpers
    missing aggregate, missing service entry,
    unreadable aggregate or absent required fields
    - caught during preview and named in the error

  domain topology
    two zone entries claimed the same domain name
    in the same environment
    - refused while building the published topology

  wave planner
    default workflow previews stacks
    in dependency order
    - failed preview blocks the deploy

shared-surface enforcement:

  secret IAM bindings
    wrong principal, disallowed operation, or
    prefix-violating Secret Manager request
    is unauthorised at the cloud API boundary

Release-path refusals across two layers

What landed was narrower than the full resolver, and removal never depended on it being complete. The helpers, topology check, and wave ordering refused before any provider call; the secret bindings were enforced by the cloud after it. The gaps stayed on the resolver backlog: raw StackReference keys that could still read undefined, required-field checks short of full type validation, consumer-supplied context not yet matched to the release context, and catalogue cycles between the output fetch and preview.

Deletion came after consumers moved and these gates were active. The safe order stayed small enough to review:

Publish the new contract beside the old path.
Publish binding rules for each shared surface touched by the removal, where mutation is delegated across tiers.
Send new consumers to the contract only.
Move existing consumers from raw reads to contract resolution.
Move workload-owned members into the workload tier.
Block new reads of the old path.
Remove legacy outputs and resources after consumers are gone.

Deletion followed a simple rule: vacate first, then destroy the empty property. A path came out only when its inventory showed zero consumers and the landed checks covered the failures that could keep it alive - missing helper entries or ambiguous topology. After consumers moved to the new contract, most old resources had no reason to survive. The exceptions were stateful data and DNS, where recovery or propagation outlived the apply.

DNS was the migration I ran by hand. Its records moved through numbered manual stages, each applied by impersonating the CI service account instead of going through the normal deploy. That work existed only because I placed DNS in the wrong boundary at the start.

A complete repair leaves boundaries matching ownership and dependencies resolving by contract. This migration still carried shims and raw reads, but they were tracked work, and the old boundary could no longer grow around them.