Python Selenium 忽略 HTTPS 证书错误

2025-04-15 tech python selenium chrome 1 mins 473 字

Chrome

通过 ChromeOptions 添加启动参数直接忽略证书验证：

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')  

driver = webdriver.Chrome(options=options)
driver.get("https://expired.badssl.com") 

Firefox

设置 accept_insecure_certs 属性为 True：

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.accept_insecure_certs = True  

driver = webdriver.Firefox(options=options)
driver.get("https://self-signed.badssl.com")  

Prometheus调优：干掉高基数和慢查询

2025-03-29 tech prometheus 5 mins 1849 字

运维不是请客吃饭，Prometheus调优就得快准狠。今天记录我调优的一点内容。

一、PromQL优化三斧头

砍掉无用计算

所有查询必须带标签过滤，别让Prometheus扫全表

 # 烂代码
 sum(rate(http_requests_total[5m])) 
 # 正确姿势
 sum(rate(http_requests_total{cluster="prod",service!~"test-.*"}[5m]))

预计算

高频查询全做成 Recording Rules

 # prometheus.yml 配置示例
 - record: service:http_errors:rate5m
   expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

函数选型要精准
- 拒绝count()滥用，数值计算只用sum()
- rate()和irate()的区别：前者看趋势，后者抓毛刺

二、高基数指标

什么是高基数？

一个指标标签组合超过500个时间序列就是高基数
典型案例：把user_id当标签。

当某个标签的值具备以下特征时，就会引发高基数：

动态生成：如用户ID、SessionID、TraceID、IP地址
不可预测数量：如错误信息error_message（可能无限增长）
高唯一性：如订单号、UUID

用这个PromQL量化风险：

# 直接找出Top10问题指标
topk(10, count by (__name__)({__name__=~".+"}))

# 查看单个指标的时间序列数量
count({__name__="your_metric_name"})

# 按标签维度分析
count by (label_name)({__name__="your_metric_name"})

风险阈值参考（根据集群规模调整）

时间序列数量	风险等级	处理建议
< 500	安全	无需处理
500-2000	警告	监控增长趋势
> 2000	危险	必须立即优化标签设计

tips：

动态标签死刑名单 永远不要用作指标标签：

user_id, ip, session_id, trace_id, request_id, 
order_no, email, phone, uuid, error_message

安全标签范例 这些标签值可控，放心使用：

status_code（如200/404/500）, 
http_method（GET/POST）, 
env（prod/stage/dev）, 
region（固定机房编号）

紧急止血方案 发现高基数指标后立即执行：

# prometheus.yml 配置
metric_relabel_configs:
  - source_labels: [user_id]  # 要清理的标签
    regex: (.+)
    action: labeldrop        # 直接丢弃该标签

三、性能监控三板斧

内存防线

# 当内存突破16GB立即告警
prometheus_tsdb_head_series > 10000000

慢查询追杀令

# 查询延迟超过2秒的全给我记下来
rate(prometheus_engine_query_duration_seconds_sum{slice="query"}[1m]) > 2

存储健康检查

# 直接分析TSDB状态
promtool tsdb analyze /data/prometheus/wal

四、巡检

每周必查topk(10, count by (__name__)({__name__=~".+"}))
发现高基数指标，24小时内必须解决
查询超过1秒的统统优化

多维度优化降低Prometheus资源消耗的实践

2025-03-28 tech prometheus 4 mins 1664 字

最近在玩 Prometheus ，我在grafana上使用了以下PromQL查询：

sum by (user) (
  rate(iftop_traffic_bytes_total{direction="send",user!=""}[1m])
) > 1000

这个查询导致 Prometheus 主机系统负载高的问题。通过count by (__name__)({__name__=~".+"})分析发现，iftop_traffic_bytes_total指标存在约5万个时间序列，其中user标签的基数占比超过75%。

一、分阶段优化过程

阶段一：查询层优化

1. 过滤无效标签

user!~"(?:^$|default|system)"

该操作减少约8%的时间序列数量，但查询延迟仅从3.2s降至3.0s，效果有限。

2. 调整时间窗口

将计算窗口从[1m]缩短至[30s]：

原始执行时间：3.2s → 2.7s

但观察到流量曲线有些波动，pass。

阶段二：数据源治理

3. 指标生成层过滤

通过改造指标生成脚本，在数据源头排除无效数据：

# 原始数据采集脚本片段
awk '/iftop_traffic_bytes_total/ && $6 != "" {print $0}' raw_metrics.log > metrics.prom

优化效果：

指标生成量减少12%
user标签基数从5200下降至4600
Prometheus TSDB写入速率降低15%

4. 采集配置增强

在prometheus.yml中补充过滤规则：

metric_relabel_configs:
- source_labels: [user]
  regex: '^(?:|default|system)$'
  action: drop

双重保障确保无效数据不会进入存储环节。

阶段三：计算层优化

5. 预计算规则

配置Recording Rules实现查询逻辑固化：

- record: job:iftop_traffic_bytes:rate1m
  expr: |
    rate(iftop_traffic_bytes_total{
      direction="send"
    }[1m])

优化后的查询语句简化为：

sum by (user) (job:iftop_traffic_bytes:rate1m) > 1000

二、效果验证

通过多维度优化措施组合实施，系统指标变化如下：

优化阶段	时间序列数量	查询延迟	内存消耗
原始状态	52,000	3200ms	1.2GB
查询层优化后	47,800	2700ms	900MB
数据源治理后	42,000	2200ms	750MB
预计算规则启用后	820	420ms	150MB

通过rate(prometheus_engine_query_duration_seconds_sum[1h])指标观测，查询延迟P99值下降89%。

三、衍生效益

基于优化后的数据模型，我扩展实现了以下监控指标：

# 实时在线用户数
count(sum by (user)(job:iftop_traffic_bytes:rate1m) > bool 1)

四、经验总结

源头治理优先：在指标生成环节过滤无效数据，比后期处理更高效
分层防御体系：结合脚本过滤+metric_relabel_configs构建双重保障
预计算价值：Recording Rules将复杂查询转换为简单检索，效果显著
基数监控：建立count(count by (user)({__name__=~".+"}))例行检查机制

Shell 脚本 while 循环只执行一次的问题

2025-03-26 tech linux shell ssh 5 mins 1968 字

最近在做 Prometheus 告警的自动化运维场景，我通过 Shell 脚本循环处理多个 long_uptime 告警。然而，脚本在首次执行 SSH 命令后直接退出循环，导致后续告警未被处理。本文记录完整的排错过程。

问题现象

原始代码片段：

while read -r alert; do
    # 提取告警信息...
    ssh "$node" "docker restart $container_name"
    # 其他逻辑...
done <<< "$alerts"

表现：

循环仅处理第一个告警后退出，未遍历所有符合条件的告警。
取消注释 ssh 命令后问题消失，证明与 SSH 执行相关。

原因分析

1. SSH 阻塞导致循环中断

默认行为：SSH 会读取标准输入（stdin），可能导致后续的 read 命令获取到空值
错误传播：若 SSH 连接失败且未处理退出码，Bash 可能因 set -e 或默认行为终止脚本

2. 缺乏并发与超时控制

串行执行：每个 SSH 命令需等待前一个完成，若节点响应慢，总耗时过长
超时缺失：网络波动或节点宕机时，SSH 可能无限期挂起

解决方案

1. 非阻塞 SSH 执行

```bash
ssh -n -o ConnectTimeout=10 "$node" "docker restart $container_name" </dev/null &>/dev/null &
```

- **关键参数**：  
  - `-n`：禁用 stdin 输入
  - `ConnectTimeout=10`：10秒连接超时
  - `&`：后台执行，立即返回控制权 

2. 并发控制与错误处理

```bash
max_jobs=5  # 最大并发数
while read -r alert; do
    # 处理告警逻辑...
    ssh "$node" "docker restart $container_name" &
    
    # 控制并发
    if [[ $(jobs -r -p | wc -l) -ge $max_jobs ]]; then
        wait -n  # 等待任意一个任务完成
    fi
done <<< "$alerts"
wait  # 等待所有后台任务
```
- **`jobs -r -p`**：获取当前运行的后台任务 PID
- **`wait -n`**：避免资源耗尽，动态控制并发数  

3. 错误容忍设计

```bash
if ! ssh "$node" "docker restart $container_name"; then
    echo "Failed to restart $container_name on $node" >&2
    continue  # 跳过失败任务，继续处理后续告警
fi
```
- **`continue`**：即使单个 SSH 失败，仍继续循环

验证步骤

模拟多告警输入：

alerts='[
  {"labels": {"category": "long_uptime", "name": "app1", "node": "node1"}},
  {"labels": {"category": "long_uptime", "name": "app2", "node": "node2"}}
]'

确认脚本处理所有告警

压力测试：
使用 tc 模拟高延迟网络，观察脚本是否仍正常执行
```
tc qdisc add dev eth0 root netem delay 2000ms
```

最终优化代码

while read -r alert; do
    # 提取变量（省略部分代码）
    if [[ "$category" == "long_uptime" ]]; then
        # 并行执行 SSH 命令
        ssh -n -o ConnectTimeout=10 "$node" "docker restart $container_name" </dev/null &>/dev/null &
        
        # 控制并发（示例：最大 10 个并行任务）
        if [[ $(jobs -r -p | wc -l) -ge 10 ]]; then
            wait -n
        fi
    fi
done <<< "$alerts"
wait  # 等待所有后台任务完成

Prometheus 数据保留时间配置

2025-03-25 tech prometheus 2 mins 1 图 874 字

今天在调整Prometheus的数据保留时间，将默认的15天存储延长到30天，记录下过程。

配置

按照网上查到的教程，在启动命令中添加了--storage.tsdb.retention.time=30d，结果Prometheus直接报错

Error parsing commandline arguments: unknown long flag '--storage.tsdb.retention.time'

通过prometheus --version确认当前版本为2.6.0，资料显示：

- **3.0.0+版本**才支持`--storage.tsdb.retention.time`参数
- **旧版本**需要使用`--storage.tsdb.retention=30d`（无`.time`后缀）

将参数改为旧版格式：

./prometheus --storage.tsdb.retention=30d --config.file=prometheus.yml

配置优先级验证

为了确认配置生效方式，做了两组测试：

命令行 vs 配置文件冲突

- `prometheus.yml`中设置`storage.tsdb.retention: 15d`
- 命令行参数`--storage.tsdb.retention=30d`

结果：命令行参数覆盖配置文件，实际生效30天

可以通过这个页面查看生效的配置：

http://<prometheus-server>/flags

image-20250325午前111838386

完整配置示例

# 启动脚本示例（适用于v2.6）
./prometheus \
  --storage.tsdb.retention=30d \   # 旧版参数
  --config.file=prometheus.yml \   # 配置文件路径
  --web.enable-lifecycle           # 可选：启用热加载

# prometheus.yml（兼容新旧版本）
storage:
  tsdb:
    retention: 30d  # 新版本写成retention.time

Prometheus 监控容器启动时间并自动化处理

2025-03-24 tech prometheus 9 mins 1 图 3251 字

背景

在云原生场景中，容器的实际运行时长（非首次启动时间）是关键监控指标。本文通过 Node Exporter + 自定义脚本实现容器启动时间的精准采集，解决 cAdvisor 无法获取容器重启后运行时长的问题：

cAdvisor 提供了 container_start_time_seconds 指标，记录容器的启动时间戳（Unix 时间），但其记录的是容器的 初始创建时间 ，而非最近一次重启后的时间。

方案设计

数据源：通过 docker inspect 获取容器的 StartedAt 时间（最后一次启动时间）
暴露指标：利用 Node Exporter 的 textfile collector 将时间戳转换为 Prometheus 指标
可视化：通过 PromQL 计算 time() - container_last_started_time_seconds 获得运行时长

实现步骤

1. 编写采集脚本

#!/bin/bash
PROMETHEUS_FILE="/var/lib/node-exporter/textfile-collector/container_started.prom"

# 初始化指标文件
cat <<EOF > $PROMETHEUS_FILE
# HELP container_last_started_time_seconds Container last started time in Unix epoch
# TYPE container_last_started_time_seconds gauge
EOF

for s in $(docker ps -q); do
  # 获取容器最后一次启动时间
  started_at=$(docker inspect -f '{{.State.StartedAt}}' "$s" 2>/dev/null)
  clean_date=$(echo "$started_at" | sed 's/\.[0-9]*Z/Z/')  # 删除纳秒部分
  timestamp=$(date -u -d "$clean_date" +%s)  # 转换为时间戳

  # 写入指标文件
  name=$(docker inspect -f '{{.Name}}' "$s" | cut -c 2-)
  echo "container_last_started_time_seconds{name=\"$name\",id=\"$s\"} $timestamp" >> $PROMETHEUS_FILE
done

生成的内容如下：

image-20250325午後55240479

缺少 HELP 或 TYPE 注释会导致指标被忽略.

2. 配置 Node Exporter

# docker-compose.yml
node-exporter:
  image: prom/node-exporter:v0.17.0
  volumes:
    - ./textfile-collector:/var/lib/node-exporter/textfile-collector  # 挂载目录而非文件
  command:
    - '--collector.textfile'          # 启用 textfile collector
    - '--collector.textfile.directory=/var/lib/node-exporter/textfile-collector'

3. 验证指标

# 检查指标文件
cat /var/lib/node-exporter/textfile-collector/container_started.prom
# 输出示例：
# container_last_started_time_seconds{name="nginx",id="abc123"} 1717182000

# Prometheus 查询
time() - container_last_started_time_seconds{name="nginx"}

监控告警

设置 alert.rules，如下配置：

- name: docker-alerts
  rules:
  - alert: 容器-长时间运行
    expr: |
      (time() - container_last_started_time_seconds)/ 86400 > 14
    for: 5m                     # 持续5分钟触发
    labels:
      category: docker_long_uptime  # 新增：标记告警类别
    annotations:
      summary: "{{ $labels.node }}容器{{ $labels.name }}运行过长"
      description: |
        {{ $labels.node }}容器{{ $labels.name }}已运行 {{ printf "%.1f" $value }} 天

自动化处理

例如重启容器：

# 获取 Prometheus 告警信息
response=$(curl -s $PROMETHEUS_ALERT_API)
if [[ $? -ne 0 ]]; then
    echo "Error fetching alerts from Prometheus."
    exit 0
fi

# 检查是否是 JSON 格式
if ! echo "$response" | jq . >/dev/null 2>&1; then
    echo "Response is not valid JSON: $response"
    exit 1
fi

# 提取 firing 状态的告警
alerts=$(echo "$response" | jq -c '.data.alerts[] | select(.state=="firing")')

if [[ -z "$alerts" ]]; then
    echo -n "No firing "
    exit 0
fi

while read -r alert; do
    instance=$(echo "$alert" | jq -r '.labels.instance')
    node=$(echo "$alert" | jq -r '.labels.node')
    name=$(echo "$alert" | jq -r '.labels.name')
    category=$(echo "$alert" | jq -r '.labels.category')  # 新增：提取告警类别

    case "$category" in
      "docker_long_uptime")
        # 时间窗口判断（03:00-06:00）
        current_hour=$(date +%H)
        if [[ $current_hour -ge 3 && $current_hour -lt 6 ]]; then
            echo "$node restarting $name ..."
        fi
        ;;
      "high_cpu"|"disk_full")
        # 执行其他分类告警处理逻辑
        ;;
      *)
        echo "Unknown category: $category: $instance $node $name"
        ;;
    esac
    
done <<< "$alerts"
wait

macOS Xinference 安装记录

2025-03-13 tech llm xinference 4 mins 12 图 1422 字

Xorbits inference 是一个强大且通用的分布式推理框架，可用于大语言模型（LLM），语音识别模型，多模态模型等各种模型的推理。可以轻松地一键部署自己的模型或内置的前沿开源模型。我主要是为了在 dify 上使用 Rerank，然后运行的Xinference。

一、安装

官方手册：https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

conda create -n xinference python=3.11
conda activate xinference
pip install "xinference[all]"
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

image-20250313午後32636314

遇到了一些错误，也觉得正常：

/private/var/folders/sl/j0g8fv0d5h97tc_xxsy3fkyr0000gn/T/pip-install-6lcea4jt/llama-cpp-python_47e4373c3e314d09bee185d9fcb17bda/vendor/llama.cpp/ggml/src/ggml-quants.c
ninja: build stopped: subcommand failed.

*** CMake build failed
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

image-20250313午後32211672

安装 ninja即可:

brew install ninja

之后运行：

xinference-local # 本地
xinference-local -H 0.0.0.0 # 建议用这个，因为要和dify配合

image-20250313午後33149271

image-20250313午後34651133

image-20250313午後33234731

可以改语言成中文，顺眼一点：

image-20250313午後40911619

二、加载模型

查看内置模型：https://inference.readthedocs.io/zh-cn/latest/models/builtin/index.html

xinference launch --model-name bge-reranker-large --model-type rerank # 加载模型

也可以ui界面操作：

image-20250313午後40457075

xinference 默认的是从 huggingface 下载大模型，网络原因根本下载不下来，需要更换为国内的源，重新启动：

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0

看日志终于开始下载了：

image-20250313午後41453455

下载完成：

image-20250313午後45657070

三、dify接入

添加供应商：

image-20250313午後45918906

image-20250313午後45958268

修改知识库检索设置：

image-20250313午後50127250

接入成功🏅

四、其他

xinference terminate --model-uid ${model_uid}   # 结束模型

« 1 2 3 4 5 ... 163 164 165 166 167 »