运维
运维相关知识和内容
混沌工程实战:用Chaos Mesh构建生产级故障注入与韧性测试体系
混沌工程实战:用Chaos Mesh构建生产级故障注入与韧性测试体系
为什么需要混沌工程
"假设它会出故障"是SRE(站点可靠性工程师)的基本信条。混沌工程的核心原则:与其等待真实故障发生,不如主动制造故障、验证系统的应对能力。
Netflix的Chaos Monkey项目证明:经过混沌工程磨炼的系统,在真实故障发生时的MTTR(平均恢复时间)降低了67%。
一、Chaos Mesh安装与配置
1.1 安装
# 使用Helm安装Chaos Mesh v2.7
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create ns chaos-mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-mesh \
--version 2.7.0 \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
--set dashboard.create=true \
--set dashboard.serviceType=NodePort \
--set dnsServer.create=true # 支持DNS故障注入
# 验证安装
kubectl get pods -n chaos-mesh
1.2 RBAC权限配置
# 为混沌工程团队创建专用ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-operator
namespace: staging # 限定在staging命名空间
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-operator-role
namespace: staging
rules:
# 允许管理混沌实验
- apiGroups: ["chaos-mesh.org"]
resources: ["*"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# 禁止在production命名空间操作
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"] # 只读!
二、常用故障场景配置
2.1 Pod故障
# 随机删除Pod(模拟节点故障/OOM Kill)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
namespace: staging
spec:
action: pod-kill
mode: random-max-percent # 随机比例
value: "30" # 最多30%的Pod
selector:
namespaces:
- staging
labelSelectors:
app: api-service
# 安全守卫:高级别保护
gracePeriod: 10 # 优雅停机10秒
# 定时触发(可选)
scheduler:
cron: "@hourly" # 每小时随机干掉30%的Pod
2.2 网络故障
# 网络延迟注入(模拟跨区域调用)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-test
spec:
action: delay
mode: all
selector:
namespaces: [staging]
labelSelectors:
app: frontend
delay:
latency: "200ms"
correlation: "50" # 50%的相关性(更真实的网络抖动模拟)
jitter: "50ms" # ±50ms抖动
direction: to # 只影响出方向流量
target:
selector:
namespaces: [staging]
labelSelectors:
app: api-service
mode: all
duration: "5m" # 持续5分钟
---
# 网络丢包(模拟不稳定网络)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-test
spec:
action: loss
mode: all
selector:
namespaces: [staging]
labelSelectors:
app: payment-service
loss:
loss: "10" # 10%丢包率
correlation: "25"
duration: "3m"
2.3 磁盘IO故障(模拟磁盘性能下降)
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-stress-test
spec:
action: latency # 增加IO延迟
mode: one # 只影响一个Pod
selector:
namespaces: [staging]
labelSelectors:
app: mysql
volumePath: /var/lib/mysql
delay: "10ms" # 每次IO增加10ms延迟
percent: 50 # 50%的IO请求受影响
duration: "10m"
2.4 内存压力(模拟内存泄漏)
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-test
spec:
mode: one
selector:
namespaces: [staging]
labelSelectors:
app: java-service
stressors:
memory:
workers: 4
size: "512MB" # 消耗512MB内存
duration: "5m"
三、混沌工作流编排
# Workflow:模拟完整的"数据库故障恢复"场景
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: db-failover-scenario
spec:
entry: entry
templates:
# 入口节点:顺序执行
- name: entry
templateType: Serial
children:
- pre-check
- inject-db-failure
- wait-for-recovery
- verify-data-integrity
- cleanup
# 前置检查
- name: pre-check
templateType: Task
task:
container:
name: check
image: bitnami/kubectl:latest
command:
- sh
- -c
- |
# 确认备库已就绪
kubectl exec mysql-replica-0 -- mysql -e "SHOW SLAVE STATUS\G" | \
grep -q "Seconds_Behind_Master: 0"
echo "前置检查通过:备库延迟为0"
# 注入主库故障
- name: inject-db-failure
templateType: Suspend
deadline: 10m
children:
- db-pod-kill
- monitor-failover
- name: db-pod-kill
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: mysql
role: primary
# 等待自动故障转移
- name: monitor-failover
templateType: Task
deadline: 5m
task:
container:
image: bitnami/kubectl:latest
command:
- sh
- -c
- |
# 等待直到新主库就绪
until kubectl exec mysql-replica-0 -- mysql -e \
"SELECT 1" > /dev/null 2>&1; do
sleep 5
echo "等待故障转移..."
done
echo "故障转移完成!"
# 数据完整性验证
- name: verify-data-integrity
templateType: Task
task:
container:
image: python:3.11
command:
- python
- /scripts/verify_data.py # 自定义数据一致性检查脚本
四、集成到CI/CD
# .github/workflows/chaos-test.yml
name: Chaos Engineering Tests
on:
schedule:
- cron: '0 2 * * 1' # 每周一凌晨2点
workflow_dispatch: # 支持手动触发
jobs:
chaos-test:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Setup kubectl
uses: azure/setup-kubectl@v4
- name: Run network chaos test
run: |
# 注入网络故障
kubectl apply -f chaos/network-delay-test.yaml
# 同时运行负载测试
kubectl apply -f chaos/load-test-job.yaml
# 等待5分钟
sleep 300
# 检查服务健康状态
SUCCESS_RATE=$(curl -s http://prometheus/api/v1/query \
--data 'query=rate(http_requests_total{status="200"}[1m]) / rate(http_requests_total[1m])' | \
jq -r '.data.result[0].value[1]')
echo "当前成功率:${SUCCESS_RATE}"
if (( $(echo "$SUCCESS_RATE < 0.95" | bc -l) )); then
echo "❌ 成功率低于95%,混沌测试失败"
exit 1
fi
echo "✅ 混沌测试通过,服务在网络故障下保持>95%成功率"
- name: Cleanup
if: always()
run: |
kubectl delete -f chaos/ --ignore-not-found
五、混沌工程安全守则
- 从小范围开始:先在staging,再在生产的1%流量,循序渐进
- 明确爆炸半径:每次实验前清晰定义"最坏情况"并确认可接受
- 准备紧急停止:Chaos Mesh提供一键暂停所有实验的功能
- 自动化监控:实验期间自动监控SLO,违反则立即中止
- 完整记录:每次实验记录假设、结果和改进措施
混沌工程不是破坏系统,而是在受控条件下提前发现系统的薄弱点,让真实故障来临时胸有成竹。