Back to portfolio

Case Study | Remediation | Guardrails

Auto remediation playbooks with safety checks built in.

Built remediation logic for cloud misconfigurations where automation can reduce exposure quickly, while still respecting ownership, exceptions, and operational risk.

Auto Remediation Architecture

The remediation workflow starts in Wiz. A custom control detects a misconfigured resource, then a Wiz automation rule sends the finding payload into an AWS event pipeline. The platform account validates and routes the event before a remediation Lambda assumes a controlled role in the customer account and applies the approved fix.

Detection Plane Wiz
01 Wiz Control

Detects a policy violation or cloud resource misconfiguration.

02 Automation Rule

Triggers on the control result and sends a structured finding payload.

Platform Account AWS Event Pipeline
03 SNS Topic

Receives the event and fans out messages for downstream processing.

04 SQS Queue

Buffers events, supports retry behavior, and protects Lambda from spikes.

05 Lambda Orchestrator

Validates eligibility, loads config, and selects a remediation path.

Trust Boundary Cross-Account Access
06 AssumeRole

Uses external ID, scoped permissions, and traceable session context.

07 Customer Account

Applies the approved resource update in the affected account.

Evidence Plane Audit + Feedback
08 Audit Log

Records decision, action, status, failure reason, and evidence.

09 Notify / Review

Routes skipped, failed, or sensitive findings for human review.

Event durability

SQS absorbs bursts, supports retry, and gives failed events a replay path.

Least privilege

Customer account remediation roles expose only the APIs needed for approved fixes.

Decision evidence

Every automated, skipped, and failed action is tied back to the Wiz finding.

Challenge

Some cloud security findings are repetitive and time-sensitive, but remediation can create risk if automation changes resources without understanding context. The goal was to automate only the cases where conditions were clear and defensible.

Approach

  • Created Wiz controls that identify specific misconfiguration patterns and provide a consistent finding payload.
  • Configured Wiz automation rules to trigger only when the control detects a qualifying resource.
  • Used SNS and SQS as a durable handoff between Wiz and Lambda so remediation events can be buffered and retried.
  • Used Lambda to validate payload shape, severity, affected resource, account metadata, and remediation eligibility.
  • Assumed a scoped remediation role in the customer account before applying any cloud resource change.
  • Defined remediation eligibility checks for severity, resource type, environment, ownership, and exception state.
  • Separated detection, decision, action, and audit logging so each step could be tested independently.
  • Routed ambiguous findings, protected environments, and missing metadata to notification or manual review instead of automatic action.

Implementation Details

The playbook separates the remediation decision from the remediation action. A finding must pass eligibility checks before any change is attempted, and ambiguous cases move to notification or review rather than automatic execution.

eligible_for_remediation =
  severity in ["HIGH", "CRITICAL"]
  and environment != "restricted-production"
  and owner is not None
  and exception_status != "approved"
  and resource_type in approved_resource_types
  and proposed_change in allowed_actions

Cross-account remediation uses a dedicated execution role in the platform account and a customer account role that grants only the permissions needed for the approved fix. The assumed-role session should include traceable session names and correlation IDs from the Wiz payload so actions can be connected back to the original finding.

Trust Boundaries and Failure Handling

  • Wiz to AWS: Only accepted automation payloads are processed; malformed events are logged and rejected.
  • SNS to SQS: Queueing creates back pressure protection and gives failed events a retry path.
  • Lambda to customer account: Remediation requires explicit role trust, scoped permissions, and account allowlisting.
  • Failure path: Failed role assumption, missing resource context, or denied API calls are recorded with the finding ID and routed for review.
  • Audit path: Every skipped, failed, and successful decision writes evidence for security operations and customer reporting.

Outcomes

Guardrailed response

Remediation only proceeds when severity, environment, owner, exception, and action checks align.

Event-driven scale

SNS and SQS decouple Wiz from Lambda so findings can be processed reliably without losing events during spikes.

Human review where needed

Findings with missing metadata or sensitive production context are routed for review instead of changed blindly.

Auditable changes

Each decision can be logged with the finding, resource, eligibility result, action, and final status.

Security Controls Demonstrated

  • Reduced response time for recurring high-confidence misconfigurations.
  • Improved trust in automation by making the decision path explicit.
  • Kept human review in the loop for cases where automated action was not safe enough.
  • Produced audit records that explain what was changed and why.

Production Considerations

  • Policy-as-code rules for remediation eligibility and exception handling.
  • Change approval integration for sensitive production resources.
  • Rollback metadata for actions that can be safely reversed.
  • Dead-letter queue processing and replay tooling for failed remediation events.
  • Per-customer role permission reviews to keep remediation access least privilege over time.