Enterprise Rollout

Rollout playbook for DNS teams and application teams adopting AID in production.

Enterprise Rollout Playbook

Use this playbook when DNS ownership and application ownership are split across teams.

AID rollout usually fails on coordination, not syntax. Treat the _agent.<domain> record, DNSSEC, TLS, and PKA as one change set.

Ownership Model

Split responsibilities clearly before the first rollout.

DNS team

Own the _agent.<domain> TXT record.
Own DNSSEC enablement and DS record publication.
Own TTL changes and rollback windows.
Own delegated subdomain records when the application team does not control the parent zone.

Application team

Own the agent endpoint, TLS certificate, and protocol behavior.
Own pka and kid generation, storage, and rotation.
Own .well-known only when that fallback is intentionally allowed.
Own post-change verification with aid-doctor and SDK smoke tests.

Shared handoff

Agree on these values before rollout:

queried hostname
published _agent name
target uri
proto
TTL during change window
DNSSEC expectation
PKA requirement
rollback owner

Deployment Patterns

AID discovery is exact-host only. Do not rely on parent-domain walking.

Pattern A: apex or standard host deployment

Use this when the domain itself should resolve directly.

_agent.example.com. 300 IN TXT "v=aid1;u=https://api.example.com/mcp;p=mcp;i=g1;k=z6Mk..."

Use this for:

example.com
api.example.com
agent.example.com

Pattern B: delegated subdomain deployment

Use this when the application team needs isolated control.

Parent zone:

_agent.team.example.com. 300 IN NS ns1.team-dns.example.net.
_agent.team.example.com. 300 IN NS ns2.team-dns.example.net.

Delegated zone:

_agent.app.team.example.com. 300 IN TXT "v=aid1;u=https://app.team.example.com/mcp;p=mcp;i=g1;k=z6Mk..."

Use this when:

the DNS team owns example.com
the app team owns team.example.com or a delegated _agent subtree
you need team-level isolation without changing client discovery behavior

Do not publish multiple valid AID TXT records at the same queried name. Use distinct hostnames or route behind one endpoint.

Rollout Sequence

Use a controlled change window.

1. Prepare

Lower TTL to 60-120 seconds if you expect a near-term cutover.
Confirm the endpoint is live before DNS changes.
Generate pka and kid if the environment will use balanced or strict with identity proof.
Decide whether .well-known fallback is allowed. In strict, it is not.

2. Validate before publish

Run:

aid-doctor check example.com --security-mode balanced --check-downgrade

For stricter environments, run:

aid-doctor check example.com --security-mode strict --check-downgrade

Confirm:

exactly one valid record exists
remote endpoints use https:// or wss://
DNSSEC status matches policy
pka verification passes when required

3. Publish

Publish the TXT record at the exact queried host.
Publish DS records and enable DNSSEC before requiring it in client policy.
If rotating keys, deploy the new private key before updating DNS.

4. Verify after publish

Run:

aid-doctor check example.com --security-mode balanced --check-downgrade

Then run one SDK smoke test from the client environment that will actually consume the record.

5. Restore TTL

After caches settle and validation stays clean, restore the normal TTL.

Security Mode Adoption Ladder

Do not start with strict unless DNSSEC and PKA are already operational.

Stage 1: baseline

Use default discovery behavior while the DNS and app teams prove basic correctness.

Exit criteria:

exact-host record is stable
TLS is valid
no duplicate valid TXT records exist

Stage 2: `balanced`

Use balanced when you want warnings instead of hard failures.

pka: if-present
dnssec: prefer
well-known: auto
downgrade: warn

Exit criteria:

PKA is deployed and verifiable where expected
DNSSEC is live or the remaining gap is explicitly accepted
downgrade warnings are monitored

Stage 3: `strict`

Use strict only after the organization can support hard failures.

pka: require
dnssec: require
well-known: disable
downgrade: fail

Exit criteria:

DNSSEC validation works from real client networks
PKA rotation runbook has been exercised
teams agree on rollback ownership and escalation path

Rollback Guidance

Treat rollback as part of the rollout plan.

DNS rollback

restore the last known good TXT record
keep the same queried hostname
keep TTL low until validation recovers
rerun aid-doctor check <domain>

PKA rollback

restore the previous private key only if it is still trusted and available
otherwise publish a new pka with a new kid
do not remove pka silently if clients may have downgrade memory enabled

DNSSEC rollback

avoid disabling DNSSEC during an incident unless the DNS team is sure DS and signing state are the root cause
if strict clients exist, disabling DNSSEC is a breaking change

Incident Runbooks

Downgrade alert: `pka` missing or `kid` changed

Confirm whether a planned rotation or rollback happened.
Compare the current DNS answer with the last known good value.
If the change was expected, document it and update caches or rollout notes.
If the change was not expected, treat it as a security incident and revert to a known good state.

Multiple valid TXT records detected

Inspect authoritative DNS, not only recursive resolver output.
Remove duplicate valid AID payloads at the queried name.
Keep one valid record only.
If multiple agents are needed, move them to distinct hostnames.

Exact-host lookup failed after delegation

Confirm the client is querying the intended hostname.
Confirm the delegated _agent zone exists and serves the target name.
Confirm the parent zone does not rely on implicit inheritance.
Verify with dig and aid-doctor from outside the authoritative network.

Enterprise Checklist

Before rollout

DNS team and app team owners named
exact queried hostname agreed
rollback owner named
TTL lowered for cutover if needed
endpoint deployed and TLS valid
PKA key pair generated and stored securely if used
DNSSEC plan confirmed

During rollout

exactly one valid TXT record published
aid-doctor check <domain> passes in the target security mode
one real SDK/client smoke test passes
logs and alerting are monitored during propagation

After rollout

TTL restored
rollback notes updated
downgrade baseline stored if applicable
key rotation procedure documented and assigned