Marico's space

CloudFormation 生产避坑指南：14 个常见故障与修复方法

服务器技术 2026-04-27 21:28:21 18

# 生产环境 CloudFormation：常见故障与修复方法

超越 YAML 模板，聚焦故障处理、安全性和实际权衡

开始之前

这是 Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns 的续篇。

那篇文章涵盖了模板、stacks、嵌套 stacks、CI/CD 和生产环境最佳实践。

本文涵盖的是这些最佳实践不够用的情况。当事情以文档没有警告过的方式崩溃时。当你在午夜阅读 CloudFormation 错误信息并需要答案时。

第 1 部分：Stack 部署失败

故障 1：IAM Role 创建成功后 Lambda/EC2 立即失败（"Role does not exist"）

症状：

IAM role 创建成功（状态：CREATE_COMPLETE）
Lambda 或 EC2 资源紧随其后失败
错误：The role named 'xxx' does not exist or is not authorized

根本原因：

IAM 具有最终一致性（Eventual Consistency）。CloudFormation 在 API 调用返回时即将 role 标记为完成，但该 role 可能需要 5-10 秒才能在 AWS 分区之间传播。

修复：

LambdaFunction: Type: AWS::Lambda::Function DependsOn: LambdaExecutionRole Properties: Role: !GetAtt LambdaExecutionRole.Arn

DependsOn 强制 CloudFormation 等待 role 资源完全创建（包括其传播）后再创建 Lambda 函数。

预防：

当资源使用在同一 stack 中创建的 IAM role 时，始终添加 DependsOn。

故障 2：Stack 超时，无明显原因

症状：

Stack 创建或更新在配置的时间限制后超时
事件日志中没有明显错误
某些资源显示 CREATE_IN_PROGRESS 长达数小时

根本原因：

具有 CreationPolicy 或 WaitCondition 的资源正在等待永远不会到达的信号。常见原因：

EC2 实例 user data 脚本静默失败
自定义资源 Lambda 超时
应用程序代码从未调用 cfn-signal

诊断：

# Check if any resources have CreationPolicy aws cloudformation describe-stack-resources --stack-name prod-stack \ --query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']" # For EC2, check user data logs on the instance cat /var/log/cloud-init-output.log

修复：

对于带 user data 的 EC2：

#!/bin/bash # Do your setup here # Signal success or failure /opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \ --resource WebServerInstance --region ${AWS::Region}

对于自定义资源，实现超时处理：

def handler(event, context):
    try:
        # Do work
        send_response(event, context, "SUCCESS")
    except Exception as e:
        # CRITICAL: Always send a response
        send_response(event, context, "FAILED", reason=str(e))

预防：

始终先用 --disable-rollback 测试 CreationPolicy 路径，这样可以在自动清理之前检查失败的资源。

故障 3：嵌套 Stack 更新失败，根本原因不可见

症状：

父 stack 更新失败
错误信息：Nested stack failed to update
没有关于嵌套 stack 失败原因的详细信息

根本原因：

CloudFormation 不会将嵌套 stack 的失败详情冒泡到父 stack。你必须逐一检查每个嵌套 stack。

诊断：

# List nested stacks from the parent aws cloudformation list-stack-resources --stack-name parent-stack \ --query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]" # Check each nested stack's events aws cloudformation describe-stack-events --stack-name nested-stack-1

修复：

在更新父 stack 前添加显式验证：

# Validate nested template before updating parent aws cloudformation validate-template --template-body file://nested.yaml # Check nested stack for drift aws cloudformation detect-stack-drift --stack-name nested-stack-1

预防：

最小化嵌套 stack 深度（最多 2 层）。对于复杂的依赖关系，使用 StackSets 或拆分为单独的父 stacks。

第 2 部分：Drift 和配置不匹配

故障 4：CloudFormation 外部的生产资源被修改

症状：

安全组规则允许意外流量
S3 bucket 变为公开
RDS 备份保留期变更
Git 历史记录中没有相应变更

根本原因：

有人在 AWS 控制台或通过 CLI 直接修改了资源，绕过了 CloudFormation。

诊断：

# Detect drift on a stack aws cloudformation detect-stack-drift --stack-name prod-web # Get detailed drift results aws cloudformation describe-stack-drift-detection-status \ --stack-drift-detection-id # List drifted resources aws cloudformation list-stack-resources --stack-name prod-web \ --query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_CHECKED']"

修复 — 手动：

# Import drifted resource back to CloudFormation
aws cloudformation import-stack-to-drift --stack-name prod-web \
  --template-body file://template.yaml \
  --resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResourceId":"DataBucket"}]'

修复 — 自动化：

# CloudWatch Event to detect drift weekly DriftDetectionRule: Type: AWS::Events::Rule Properties: ScheduleExpression: "cron(0 12 1)" # Every Monday at noon Targets: - Arn: !GetAtt DriftLambda.Arn Input: '{"stackName": "prod-web"}'

预防：

强制执行 IAM 策略，防止在 CloudFormation 外部修改资源
在所有生产 stacks 上启用 drift 检测
每周审查 drift 报告

故障 5：Stack drift 导致删除保护阻止清理

症状：

尝试删除 stack
错误：Cannot delete stack because resource X has deletion protection
该资源本不应该有删除保护

根本原因：

有人在 RDS 数据库或 S3 bucket 上直接启用了删除保护。CloudFormation 不知道这个情况。

诊断：

# Find which resource is blocking deletion aws cloudformation describe-stack-resources --stack-name prod-stack \ --query "StackResources[?ResourceStatus=='DELETE_FAILED']"

修复：

# Remove deletion protection from the resource directly aws rds modify-db-instance --db-instance-identifier mydb \ --no-deletion-protection # Or for S3 aws s3api put-bucket-versioning --bucket mybucket \ --versioning-configuration Status=Suspended # Retry stack deletion aws cloudformation delete-stack --stack-name prod-stack

预防：

对于有状态资源，在模板中包含 DeletionPolicy: Retain，而不是删除保护。CloudFormation 理解 DeletionPolicy，但不理解删除保护。

第 3 部分：回滚失败

故障 6：回滚失败，因为资源无法删除

症状：

Stack 更新失败
回滚开始
回滚失败
Stack 卡在 ROLLBACK_FAILED

根本原因：

在失败的更新期间创建的资源无法被删除。常见原因：

S3 bucket 启用了版本控制且包含对象
RDS 启用了删除保护
网络接口仍处于连接状态
自定义资源执行了外部操作

诊断：

# Find which resource caused rollback failure aws cloudformation describe-stack-events --stack-name prod-stack \ --query "StackEvents[?ResourceStatus=='DELETE_FAILED']"

修复 — 对于 S3：

# Empty the bucket first aws s3 rm s3://bucket-name --recursive # Disable versioning aws s3api put-bucket-versioning --bucket bucket-name \ --versioning-configuration Status=Suspended # Retry stack deletion aws cloudformation delete-stack --stack-name prod-stack

修复 — 对于 RDS：

# Disable deletion protection aws rds modify-db-instance --db-instance-identifier mydb \ --no-deletion-protection # Skip final snapshot if you want fast cleanup aws rds delete-db-instance --db-instance-identifier mydb \ --skip-final-snapshot

预防：

在生产环境中设计有状态资源时使用 DeletionPolicy: Retain。接受你需要手动清理的事实。不要让有状态资源阻止自动回滚。

故障 7：回滚时间过长，延长停机时间

症状：

Stack 更新在第 15 分钟失败
回滚又花了 20 分钟
总停机时间：35+ 分钟

根本原因：

具有 DeletionPolicy: Snapshot 的资源在回滚时需要时间来创建快照。RDS 快照可能需要 10-20 分钟。EBS 快照每个卷增加数分钟。

诊断：

# Check which resource is taking time during rollback aws cloudformation describe-stack-events --stack-name prod-stack \ --query "StackEvents[?contains(ResourceStatus, 'DELETE')]"

事件期间的修复：

一旦回滚开始，你的选择有限。最快的路径通常是让它完成，即使很慢。

预防：

将有状态资源（数据库、buckets）分离到它们自己的 stack 中。这个 stack 很少更改。应用程序 stacks 频繁更改但不包含有状态资源。

# Stack 1: Data (deploys monthly, rollback takes time but happens rarely) DatabaseStack: Type: AWS::RDS::DBInstance DeletionPolicy: Snapshot # Stack 2: Application (deploys daily, rollback is fast) AppStack: Type: AWS::AutoScaling::AutoScalingGroup DeletionPolicy: Delete # No snapshot, instant deletion

当 AppStack 失败时，回滚只需几秒，而不是几分钟。数据库不受影响。

第 4 部分：IAM 和权限失败

故障 8：CI/CD 流水线报错缺少 CloudFormation 权限

症状：

CI/CD pipeline 失败
错误信息显示缺少权限
同样的权限昨天还能用

根本原因：

IAM 策略已更改。添加了条件。权限被移除。CI/CD 使用的角色不再具有所需访问权限。

诊断：

# Simulate policy to find missing permission aws cloudformation create-stack --stack-name test-stack \ --template-body file://test.yaml \ --dry-run # Check effective permissions for the role aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \ --action-names cloudformation:CreateStack \ --resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*

修复：

将缺少的权限添加到 CI/CD 角色：

{ "Effect": "Allow", "Action": "cloudformation:CreateStack", "Resource": "arn:aws:cloudformation:region:account:stack/*" }

预防：

使用 IAM 边界和权限防护栏。在部署到生产环境之前，在暂存账户中测试 CI/CD 角色权限。

故障 9：跨账户 Stack 操作失败

症状：

账户 A 中的 stack 尝试在账户 B 中创建资源
错误：Access denied 或 Role does not exist

根本原因：

CloudFormation 原生不支持跨账户资源创建。你需要在两个账户中设置具有信任关系的 IAM 角色。

修复 — 在目标账户中设置跨账户角色：

# In Account B (target) CrossAccountRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Statement: - Effect: Allow Principal: AWS: arn:aws:iam::AccountA:root Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/AdministratorAccess # Scope down in production

修复 — 从源账户担任角色：

# In Account A (source) CustomResource: Type: Custom::CrossAccount Properties: ServiceToken: !GetAtt CrossAccountLambda.Arn TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole

预防：

设计 stack 时应考虑账户特异性。使用 AWS Organizations 和 StackSets 进行多账户部署，而不是跨账户资源引用。

第 5 部分：仅在部署时出现的模板验证失败

故障 10：模板验证通过但部署失败

症状：

aws cloudformation validate-template --template-body file://template.yaml # Returns: Template is valid

但部署失败并显示：Encountered unsupported property 或 Resource handler returned invalid request

根本原因：

validate-template 检查语法和基本 schema。它不检查：

资源属性组合无效（例如，SourceSecurityGroupId 和 CidrIp 的某些组合）
区域特定限制（某些资源并非在所有区域都可用）
服务限制（例如，请求 2000 IOPS 但限制为 1000）

诊断：

使用 --disable-rollback 部署以保留失败的资源供检查：

aws cloudformation create-stack --stack-name test-stack \ --template-body file://template.yaml \ --disable-rollback

然后检查失败资源的状态原因：

aws cloudformation describe-stack-resources --stack-name test-stack \
  --query "StackResources[?ResourceStatus=='CREATE_FAILED']"

修复：

修正特定的属性组合。检查区域可用性。在部署前请求服务限制增加。

预防：

首先在暂存区域测试。在 CI/CD 中使用 cfn-lint——它能捕获 validate-template 遗漏的属性组合错误。

# Install cfn-lint pip install cfn-lint # Run locally before commit cfn-lint template.yaml

第 6 部分：Change Set 失败

故障 11：Change Set 显示替换，而你预期的是修改

症状：

Change set 指示生产资源为 Replacement
你预期的是原地修改
替换意味着停机

根本原因：

某些属性变更会强制替换。对于 RDS：更改 EngineVersion 或 DBInstanceClass 有时需要替换，取决于版本差异。

诊断：

检查触发替换的属性：

aws cloudformation describe-change-set --change-set-name my-change-set \
  --query "Changes[?ResourceChange.Replacement=='True']"

强制替换的常见属性：

资源强制替换的属性

AWS::RDS::DBInstance Engine、EngineVersion（主版本）、DBSubnetGroupName

AWS::EC2::Instance ImageId、InstanceType（有时）、SubnetId

AWS::S3::Bucket BucketName（无法更改）、AccessControl（有时）

AWS::Lambda::Function Code（S3 bucket/key 变更）

资源	强制替换的属性
AWS::RDS::DBInstance	Engine、EngineVersion（主版本）、DBSubnetGroupName
AWS::EC2::Instance	ImageId、InstanceType（有时）、SubnetId
AWS::S3::Bucket	BucketName（无法更改）、AccessControl（有时）
AWS::Lambda::Function	Code（S3 bucket/key 变更）

修复：

接受替换并计划停机
使用蓝绿部署实现零停机替换
在 AWS 控制台中直接修改资源（不建议用于 IaC）

预防：

在生产环境之前始终在暂存环境中审查 change sets。了解关键资源的哪些属性会导致替换。

故障 12：Change Set 执行失败，因为更新冲突

症状：

Change set 创建成功
execute-change-set 失败
错误：Cannot update stack because another update is in progress

根本原因：

另一个进程（CI/CD pipeline、另一位工程师、计划自动化）在你的 change set 等待执行时启动了 stack 更新。

诊断：

# Check current stack status aws cloudformation describe-stacks --stack-name prod-stack \ --query "Stacks[0].StackStatus" # Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked

修复：

等待其他更新完成。然后根据最新的 stack 状态创建新的 change set。不要执行旧的 change set——它现在已经过时了。

# Delete old change set aws cloudformation delete-change-set --change-set-name old-change-set # Create new change set against current stack aws cloudformation create-change-set --stack-name prod-stack \ --change-set-name new-change-set --template-body file://template.yaml # Execute fresh change set aws cloudformation execute-change-set --change-set-name new-change-set

预防：

通过 S3 条件键或自定义资源实现 stack 级锁定
协调 CI/CD pipelines，确保永远不会同时部署到同一个 stack
使用单独的 stacks 用于单独的环境

第 7 部分：性能和配额失败

故障 13：Stack 部署因 API 速率限制而超时

症状：

Stack 部署在数百个资源后显著变慢
各种 AWS API 的错误：Rate exceeded
某些资源需要 5-10 次重试才能成功

根本原因：

CloudFormation 发出许多 API 调用来创建资源。AWS API 有速率限制。大型 stacks 会达到这些限制。

诊断：

# Check CloudTrail for throttle errors aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingException

立即修复：

拆分 stack。CloudFormation 建议每个 stack 最多 200 个资源以获得最佳性能。

# List resources by type to see distribution
aws cloudformation list-stack-resources --stack-name large-stack \
  --query "StackResources[*].[ResourceType]" --output text | sort | uniq -c

长期修复：

设计模块化 stacks：

network-stack.yaml     (VPC, subnets, route tables)
data-stack.yaml        (RDS, ElastiCache, S3)
compute-stack.yaml     (ASG, launch templates)
app-stack.yaml         (Lambda, API Gateway)

预防：

监控 stack 创建时间。如果无状态资源的创建时间超过 15 分钟，则拆分 stack。

故障 14：部署期间超出服务配额

症状：

部署失败
错误：You have reached your limit of X resources

根本原因：

AWS 账户有默认服务限制。你尝试创建的资源超过了允许数量。

常见配额：

每个区域的 VPC：5
每个 VPC 的安全组：500
每个区域的 RDS 实例：40
Lambda 并发执行：1000

诊断：

# Check current usage against quota aws service-quotas get-service-quota \ --service-code ec2 --quota-code L-12345678 # List all quotas for a service aws service-quotas list-service-quotas --service-code rds

立即修复：

通过 AWS Support 或 Service Quotas API 请求配额增加：

aws service-quotas request-service-quota-increase \ --service-code ec2 --quota-code L-12345678 \ --desired-value 100

战术修复：

减少当前部署中的资源数量。使用更小的实例大小。跨 stacks 共享资源。

预防：

在 CI/CD pipeline 中包含部署前的配额检查：

# Script to check quotas before deploying python scripts/check_quotas.py --template template.yaml

第 8 部分：故障排除工作流程 — 从哪里开始

当 CloudFormation 部署失败时，按此工作流程进行：

步骤 1：获取原始错误

aws cloudformation describe-stack-events --stack-name prod-stack \
  --max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"

查找 ResourceStatusReason 字段。这是你的主要线索。

步骤 2：识别失败的资源

错误信息会告诉你哪个逻辑资源失败了。在模板中找到它的类型和属性。

步骤 3：检查是否是已知的故障模式

错误信息模式可能原因修复章节

Role does not exist IAM 最终一致性第 1 部分，故障 1

Rate exceeded API 节流第 7 部分，故障 13

Limit exceeded 服务配额第 7 部分，故障 14

Deletion protection 回滚被阻止第 3 部分，故障 6

Another update in progress 并发更新第 6 部分，故障 12

错误信息模式	可能原因	修复章节
Role does not exist	IAM 最终一致性	第 1 部分，故障 1
Rate exceeded	API 节流	第 7 部分，故障 13
Limit exceeded	服务配额	第 7 部分，故障 14
Deletion protection	回滚被阻止	第 3 部分，故障 6
Another update in progress	并发更新	第 6 部分，故障 12

步骤 4：使用 --disable-rollback 部署进行调试

aws cloudformation create-stack --stack-name debug-stack \ --template-body file://template.yaml \ --disable-rollback

失败的资源保持原样，以便你可以直接检查。

步骤 5：直接检查失败的资源

对于 EC2：

aws ec2 describe-instances --instance-ids i-12345 ssh ec2-user@instance-ip # Check logs

对于 Lambda：

aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)

对于 RDS：

aws rds describe-db-instances --db-instance-identifier mydb aws rds describe-events --source-identifier mydb --source-type db-instance

步骤 6：修复，然后继续

如果 stack 处于 ROLLBACK_FAILED，你有两个选项：

选项 A：删除失败的 stack 并重新创建

aws cloudformation delete-stack --stack-name prod-stack # Wait for deletion aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml

选项 B：修复阻止因素后继续回滚

# Fix the blocking resource (empty S3 bucket, disable deletion protection)
# Then retry rollback (CloudFormation may need manual intervention via support)

生产环境 CloudFormation 检查清单

在部署到生产环境之前，验证以下内容：

Drift 检测

在所有生产 stacks 上启用
配置每周自动化 drift 检查
为 drift 发现配置警报

回滚策略

有状态资源具有 DeletionPolicy: Retain 或 Snapshot
无状态资源具有 DeletionPolicy: Delete
有状态和无状态资源在不同的 stacks 中

IAM 和安全

策略中没有 Action: "*"
密钥使用 {{resolve:secretsmanager:...}} 而不是参数
CI/CD 角色具有最小必需权限
CI 中运行 cfn-guard 或 cfn-lint

故障处理

CreationPolicy 包括超时和信号处理
自定义资源始终发送 SUCCESS 或 FAILURE 响应
嵌套 stack 深度 2 层以内

性能

没有 stack 超过 200 个资源
没有 stack 持续部署超过 15 分钟
部署前检查服务配额

故障排除准备

在 runbook 中记录 describe-stack-events 命令
可用失败资源日志（EC2、Lambda、RDS）的访问权限
在暂存部署中使用 --disable-rollback

原文：CloudFormation in Production: What Breaks and How to Fix It

https://dev.to/leonardkachi/cloudformation-in-production-what-breaks-and-how-to-fix-it-4if1