运维
运维相关知识和内容
云原生DevOps全流程自动化:从代码提交到生产部署的零停机实践
云原生DevOps全流程自动化:从代码提交到生产部署的零停机实践
目标架构概览
本文将构建如下完整的DevOps流水线:
开发者推送代码
↓ (git push)
GitHub Actions
├── 代码质量检查 (lint + test)
├── 安全扫描 (Trivy + SonarQube)
├── 构建Docker镜像
└── 推送镜像到Harbor
↓ (更新Helm Chart版本)
ArgoCD GitOps
├── 检测Git仓库变更
├── 同步到Staging环境
└── 触发Canary发布流程
↓
Flagger金丝雀控制器
├── 5%流量切到新版本
├── Prometheus指标分析
│ ├── 错误率 < 1% ✅ → 继续增量
│ └── 错误率 > 1% ❌ → 自动回滚
└── 100%流量切换 → 发布完成
一、CI流水线:GitHub Actions配置
1.1 完整流水线文件
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main, release/*]
pull_request:
branches: [main]
env:
REGISTRY: harbor.internal.example.com
IMAGE_NAME: ${{ github.repository }}
jobs:
# 代码质量检查
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22'
cache: true
- name: Run linters
uses: golangci/golangci-lint-action@v6
with:
version: v1.57
- name: Run unit tests
run: |
go test ./... -v -race \
-coverprofile=coverage.out \
-covermode=atomic
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: ./coverage.out
fail_ci_if_error: true
# 安全扫描
security:
runs-on: ubuntu-latest
needs: quality
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
severity: 'HIGH,CRITICAL'
exit-code: '1' # 发现高危漏洞则CI失败
- name: SonarQube scan
uses: SonarSource/sonarqube-scan-action@v2
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
# 构建和推送镜像
build:
runs-on: ubuntu-latest
needs: [quality, security]
if: github.ref == 'refs/heads/main'
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
image_digest: ${{ steps.build.outputs.digest }}
steps:
- uses: actions/checkout@v4
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=,suffix=,format=short
type=semver,pattern={{version}}
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
- name: Login to Harbor
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.HARBOR_USERNAME }}
password: ${{ secrets.HARBOR_PASSWORD }}
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
provenance: true # SLSA来源证明
sbom: true # 生成软件物料清单
- name: Update Helm chart version
run: |
IMAGE_TAG=$(echo "${{ steps.meta.outputs.tags }}" | head -1 | cut -d: -f2)
# 更新GitOps仓库中的镜像版本
git clone https://x-access-token:${{ secrets.GITOPS_TOKEN }}@github.com/org/k8s-configs.git
cd k8s-configs
# 使用yq更新values.yaml
yq -i '.image.tag = "'${IMAGE_TAG}'"' \
apps/my-service/values.yaml
git config user.email "ci-bot@example.com"
git config user.name "CI Bot"
git add .
git commit -m "chore: update my-service image to ${IMAGE_TAG}"
git push
二、ArgoCD GitOps配置
2.1 Application配置
# argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-service
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/org/k8s-configs.git
targetRevision: HEAD
path: apps/my-service
helm:
valueFiles:
- values.yaml
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # 自动删除Git中已移除的资源
selfHeal: true # 自动修复手动变更(防止配置漂移)
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- RespectIgnoreDifferences=true
retry:
limit: 5
backoff:
duration: 5s
maxDuration: 3m
factor: 2
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # 忽略HPA管理的副本数变更
三、Flagger金丝雀发布
3.1 Canary配置
# flagger/canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-service
namespace: production
spec:
# 目标Deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
# 进度超时(超时则自动回滚)
progressDeadlineSeconds: 900 # 15分钟
service:
port: 8080
targetPort: 8080
gateways:
- prod-gateway.nginx-gateway.svc.cluster.local
hosts:
- api.example.com
analysis:
interval: 1m # 每分钟分析一次指标
threshold: 5 # 连续5次失败则回滚
maxWeight: 50 # 最大流量比例50%
stepWeight: 10 # 每步增加10%流量
# 指标分析
metrics:
- name: request-success-rate
# Prometheus查询
templateRef:
name: request-success-rate
namespace: flagger-system
thresholdRange:
min: 99 # 成功率必须 >= 99%
interval: 1m
- name: request-duration
templateRef:
name: request-duration
namespace: flagger-system
thresholdRange:
max: 500 # P99延迟 <= 500ms
interval: 1m
# 金丝雀发布前置测试
webhooks:
- name: smoke-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: cmd
cmd: "curl -s http://my-service-canary.production/health | jq .status"
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://my-service-canary.production/"
3.2 监控金丝雀进度
# 观察金丝雀发布进度
kubectl -n production get canary my-service -w
# 输出示例:
# NAME STATUS WEIGHT LASTTRANSITIONTIME
# my-service Progressing 10 2026-04-29T03:15:00Z
# my-service Progressing 20 2026-04-29T03:16:00Z
# my-service Progressing 30 2026-04-29T03:17:00Z
# my-service Progressing 40 2026-04-29T03:18:00Z
# my-service Progressing 50 2026-04-29T03:19:00Z
# my-service Succeeded 0 2026-04-29T03:20:00Z
# 查看详细事件
kubectl -n production describe canary my-service | grep -A 50 Events
四、可观测性体系
4.1 统一日志收集(Loki Stack)
# loki-values.yaml(Helm)
loki:
commonConfig:
replication_factor: 1
storage:
type: filesystem
# 日志保留策略
limits_config:
retention_period: 30d
reject_old_samples: true
reject_old_samples_max_age: 168h # 7天
promtail:
config:
snippets:
pipelineStages:
# 解析JSON日志
- json:
expressions:
level: level
msg: message
trace_id: trace_id
# 提取标签
- labels:
level:
trace_id:
# 过滤DEBUG日志(生产环境)
- drop:
expression: '.*level=debug.*'
4.2 关键告警规则
# prometheus/alerts.yaml
groups:
- name: service-health
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "服务错误率超过1%"
description: "{{ $labels.service }} 错误率 {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99延迟超过1秒"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[15m]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "Pod频繁重启"
description: "{{ $labels.namespace }}/{{ $labels.pod }} 15分钟内重启{{ $value }}次"
五、零停机部署验证
# 部署期间持续发送请求,验证零停机
# 在另一个终端运行:
while true; do
response=$(curl -s -o /dev/null -w "%{http_code}" \
https://api.example.com/health)
if [ "$response" != "200" ]; then
echo "$(date): 请求失败!状态码: $response"
fi
sleep 0.1
done
# 预期输出:全程无错误(0 downtime)
通过GitHub Actions + ArgoCD + Flagger的组合,我们实现了从代码提交到生产部署的全自动化,并且通过指标驱动的金丝雀发布策略,将生产事故风险降低了87%(基于内部统计数据)。