Prometheus 监控容器启动时间并自动化处理

2025-03-24 tech prometheus 9 mins 1 图 3251 字

背景

在云原生场景中，容器的实际运行时长（非首次启动时间）是关键监控指标。本文通过 Node Exporter + 自定义脚本实现容器启动时间的精准采集，解决 cAdvisor 无法获取容器重启后运行时长的问题：

cAdvisor 提供了 container_start_time_seconds 指标，记录容器的启动时间戳（Unix 时间），但其记录的是容器的 初始创建时间 ，而非最近一次重启后的时间。

方案设计

数据源：通过 docker inspect 获取容器的 StartedAt 时间（最后一次启动时间）
暴露指标：利用 Node Exporter 的 textfile collector 将时间戳转换为 Prometheus 指标
可视化：通过 PromQL 计算 time() - container_last_started_time_seconds 获得运行时长

实现步骤

1. 编写采集脚本

#!/bin/bash
PROMETHEUS_FILE="/var/lib/node-exporter/textfile-collector/container_started.prom"

# 初始化指标文件
cat <<EOF > $PROMETHEUS_FILE
# HELP container_last_started_time_seconds Container last started time in Unix epoch
# TYPE container_last_started_time_seconds gauge
EOF

for s in $(docker ps -q); do
  # 获取容器最后一次启动时间
  started_at=$(docker inspect -f '{{.State.StartedAt}}' "$s" 2>/dev/null)
  clean_date=$(echo "$started_at" | sed 's/\.[0-9]*Z/Z/')  # 删除纳秒部分
  timestamp=$(date -u -d "$clean_date" +%s)  # 转换为时间戳

  # 写入指标文件
  name=$(docker inspect -f '{{.Name}}' "$s" | cut -c 2-)
  echo "container_last_started_time_seconds{name=\"$name\",id=\"$s\"} $timestamp" >> $PROMETHEUS_FILE
done

生成的内容如下：

image-20250325午後55240479

缺少 HELP 或 TYPE 注释会导致指标被忽略.

2. 配置 Node Exporter

# docker-compose.yml
node-exporter:
  image: prom/node-exporter:v0.17.0
  volumes:
    - ./textfile-collector:/var/lib/node-exporter/textfile-collector  # 挂载目录而非文件
  command:
    - '--collector.textfile'          # 启用 textfile collector
    - '--collector.textfile.directory=/var/lib/node-exporter/textfile-collector'

3. 验证指标

# 检查指标文件
cat /var/lib/node-exporter/textfile-collector/container_started.prom
# 输出示例：
# container_last_started_time_seconds{name="nginx",id="abc123"} 1717182000

# Prometheus 查询
time() - container_last_started_time_seconds{name="nginx"}

监控告警

设置 alert.rules，如下配置：

- name: docker-alerts
  rules:
  - alert: 容器-长时间运行
    expr: |
      (time() - container_last_started_time_seconds)/ 86400 > 14
    for: 5m                     # 持续5分钟触发
    labels:
      category: docker_long_uptime  # 新增：标记告警类别
    annotations:
      summary: "{{ $labels.node }}容器{{ $labels.name }}运行过长"
      description: |
        {{ $labels.node }}容器{{ $labels.name }}已运行 {{ printf "%.1f" $value }} 天

自动化处理

例如重启容器：

# 获取 Prometheus 告警信息
response=$(curl -s $PROMETHEUS_ALERT_API)
if [[ $? -ne 0 ]]; then
    echo "Error fetching alerts from Prometheus."
    exit 0
fi

# 检查是否是 JSON 格式
if ! echo "$response" | jq . >/dev/null 2>&1; then
    echo "Response is not valid JSON: $response"
    exit 1
fi

# 提取 firing 状态的告警
alerts=$(echo "$response" | jq -c '.data.alerts[] | select(.state=="firing")')

if [[ -z "$alerts" ]]; then
    echo -n "No firing "
    exit 0
fi

while read -r alert; do
    instance=$(echo "$alert" | jq -r '.labels.instance')
    node=$(echo "$alert" | jq -r '.labels.node')
    name=$(echo "$alert" | jq -r '.labels.name')
    category=$(echo "$alert" | jq -r '.labels.category')  # 新增：提取告警类别

    case "$category" in
      "docker_long_uptime")
        # 时间窗口判断（03:00-06:00）
        current_hour=$(date +%H)
        if [[ $current_hour -ge 3 && $current_hour -lt 6 ]]; then
            echo "$node restarting $name ..."
        fi
        ;;
      "high_cpu"|"disk_full")
        # 执行其他分类告警处理逻辑
        ;;
      *)
        echo "Unknown category: $category: $instance $node $name"
        ;;
    esac
    
done <<< "$alerts"
wait