副标题:go web网站服务器 + tengine负载均衡 + keepalived高可用 + mysql数据库主从复制以及读写分离 + prometheus指标采集 + grafana可视化 + 飞书告警
本项目模拟企业网站服务的部署流程和常用技术手段,基于tengine代理和keepalived实现了网站服务的高并发性和高可用性,同时针对mysql数据库集群完成了主从复制备份和读写分离结构部署,并借助tenginx的拓展模块(主动健康检查、指标统计等)、prometheus指标采集服务、grafana绘图服务以及开源告警消息转发系统prometheusAlert对集群的核心服务进行监控告警,并对一些关键指标(延迟、流量、错误、饱和度)做可观测性,以便及时暴露问题、预测风险、规避风险。
- 数据库
采用mysql社区版,为了方便搭建,这里依托docker容器技术部署,镜像mysql:5.7.43
,mysql-master-001只写、mysql-slave-002只读、mysql-slave-003只读并设置延迟备份一小时,作为灾备数据库。mysql读写分离由proxySQL实现。
- 网站
用gin + nodejs开发的一个开源网站https://github.com/flipped-aurora/gin-vue-admin
,网站服务器选择tenginx,并挂载nfs共享目录来统一前端静态资源。Dockerfile制作镜像部署三个节点在dokcer中
- 负载均衡
采用tengine负载均衡,并设置健康检查以及暴露统计指标,keepalived负责负载均衡器健康检查和高可用实现
- 指标采集、可观测性、告警
prometheus作为指标采集服务器主动拉取exporter端数据,用grafana对采集的指标进行绘图展示以及配合告警组件prometheusAlert设置告警规则告警给飞书
- 操作系统:centos-7.9.2009
grafana-v10.4.1
mysqld_exporter-0.15.1.linux-amd64
node_exporter-1.8.0.linux-amd64
prometheusAlert-v4.8.2
tengine-exporter
tengine-3.0.0
prometheus-2.45.5.linux-amd64
mysql-5.7.43
gin-vue-admin-v2.6.2
docker-26.1.1
ProxySQL-2.5.3-89-g86ce115
- 启动三个mysql实例
docker run -p 30001:3306 --name mysql-master-001 -v ~/docments/docker/mysql/my_master_001.cnf:/etc/my.cnf -e MYSQL_ROOT_PASSWORD=200122@Root -d mysql:5.7.43
docker run -p 30002:3306 --name mysql-slave-001 -v ~/docments/docker/mysql/my_slave_001.cnf:/etc/my.cnf -e MYSQL_ROOT_PASSWORD=200122@Root -d mysql:5.7.43
docker run -p 30003:3306 --name mysql-slave-002 -v ~/docments/docker/mysql/my_slave_002.cnf:/etc/my.cnf -e MYSQL_ROOT_PASSWORD=200122@Root -d mysql:5.7.43
[root@loadbalancer-001 ~]# docker ps | grep mysql:5.7.43
abba83d34105 mysql:5.7.43 "docker-entrypoint.s…" 8 hours ago Up 8 hours 33060/tcp, 0.0.0.0:30003->3306/tcp, :::30003->3306/tcp mysql-slave-002
87a1d1584fc6 mysql:5.7.43 "docker-entrypoint.s…" 8 hours ago Up 8 hours 33060/tcp, 0.0.0.0:30002->3306/tcp, :::30002->3306/tcp mysql-slave-001
f9b211c2920d mysql:5.7.43 "docker-entrypoint.s…" 8 hours ago Up 8 hours 33060/tcp, 0.0.0.0:30001->3306/tcp, :::30001->3306/tcp mysql-master-001
mysql-master-001配置文件:
[mysqld]
server_id=1
log_bin
gtid-mode=on
enforce-gtid-consistency=on
mysql-slave-001配置文件:
[mysqld]
server_id=2
log_bin
gtid-mode=on
enforce-gtid-consistency=on
mysql-slave-002配置文件:
[mysqld]
server_id=3
log_bin
gtid-mode=on
enforce-gtid-consistency=on
- mysql-master-001创建备份专用用户
create user 'backup'@'%' identified by '200122@Backup';
grant replication slave on *.* to 'backup'@'%';
- 配置mysql-slave-001和mysql-slave-002为从节点,其中后者配置为延迟1小时同步
CHANGE MASTER TO MASTER_HOST='192.168.31.200',
MASTER_USER='backup',
MASTER_PASSWORD='200122@Backup',
MASTER_PORT=30001,
master_auto_position=1;
# mysql-slave-002配置延迟备份
CHANGE MASTER TO MASTER_DELAY = 3600;
# 启动从节点开始同步
start slave;
- 在mysql服务里面创建给proxySql访问的用户
创建主用户,用来暴露自己的api给proxySql
# mysql-master-001上执行
GRANT ALL PRIVILEGES ON *.* TO 'proxy'@'%' identified by '200122@Proxy' WITH GRANT OPTION;
创建监控用户,该用户可以查看本机mysql服务的只读属性
create user proxy_monitor@'%' identified by '200122@Proxy_monitor';
grant replication client on *.* to proxy_monitor@'%';
设置从服务器为只读
SET GLOBAL read_only = ON;
SHOW GLOBAL VARIABLES LIKE 'read_only';
- 添加mysql_servers表数据
通过在mysql_servers表中添加、更新或删除记录,可以动态地配置ProxySQL与后端MySQL服务器的连接和负载均衡设置。
mysql -uadmin -padmin -h127.0.0.1 -P6032 main
insert into mysql_servers(hostgroup_id,hostname,port) values(20,'192.168.31.200',30002);
insert into mysql_servers(hostgroup_id,hostname,port) values(10,'192.168.31.200',30001);
# 数据持久化
load mysql servers to runtime;
save mysql servers to disk;
- 配置读写库组id
insert into mysql_replication_hostgroups(writer_hostgroup,reader_hostgroup,check_type) values(10,20,'read_only');
load mysql servers to runtime;
save mysql servers to disk;
- 配置用户表
insert into mysql_users(username,password,default_hostgroup) values('proxy','200122@Proxy',10);
load mysql users to runtime;
save mysql users to disk;
- 设置mysql的监控用户
set mysql-monitor_username='proxy_monitor';
set mysql-monitor_password='200122@Proxy_monitor';
load mysql variables to runtime;
save mysql variables to disk;
- 配置proxySql读写分离规则
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (1, 1, '^SELECT.*', 20, 1),
(2, 1, '.*', 10, 1);
load mysql query rules to runtime;
save mysql query rules to disk;
- 验证读写分离 登录proxySQL管理服务:mysql -uadmin -padmin -h127.0.0.1 -P6032 stats
查看每条sql语句是被谁执行的:select hostgroup,schemaname,username,digest_text,count_star from stats_mysql_query_digest\G;
[root@loadbalancer-001 ~]# mysql -uadmin -padmin -h127.0.0.1 -P6032 stats
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MySQL connection id is 32
Server version: 5.5.30 (ProxySQL Admin Module)
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MySQL [stats]> select hostgroup,schemaname,username,digest_text,count_star from stats_mysql_query_digest\G;
*************************** 1. row ***************************
hostgroup: 10
schemaname: information_schema
username: proxy
digest_text: flush privileges
count_star: 2
*************************** 2. row ***************************
hostgroup: 20
schemaname: mysql
username: proxy
digest_text: SELECT DATABASE()
count_star: 1
*************************** 3. row ***************************
hostgroup: 10
schemaname: information_schema
username: proxy
digest_text: drop go_gin_web
count_star: 1
*************************** 4. row ***************************
hostgroup: 10
schemaname: go_gin_web
username: proxy
digest_text: ALTER TABLE `sys_base_menus` ADD `id` bigint unsigned AUTO_INCREMENT,ADD PRIMARY KEY (`id`)
count_star: 2
*************************** 5. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT count(*) FROM information_schema.statistics WHERE table_schema = ? AND table_name = ? AND index_name = ?
count_star: 8
*************************** 6. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: select * from sys_users
count_star: 9
*************************** 7. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT * FROM `sys_users` WHERE ?=?
count_star: 1
*************************** 8. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT * FROM `sys_base_menus` WHERE ?=?
count_star: 1
*************************** 9. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT * FROM `sys_authority_menus` WHERE ?=?
count_star: 1
*************************** 10. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT * FROM `sys_apis` WHERE ?=?
count_star: 1
*************************** 11. row ***************************
hostgroup: 10
schemaname: go_gin_web
username: proxy
digest_text: CREATE TABLE `sys_users` (`id` bigint unsigned AUTO_INCREMENT,`created_at` datetime(?) NULL,`updated_at` datetime(?) NULL,`deleted_at` datetime(?) NULL,`uuid` varchar(?) COMMENT ?,`username` varchar(?) COMMENT ?,`password` varchar(?) COMMENT ?,`nick_name` varchar(?) DEFAULT ? COMMENT ?,`side_mode` varchar(?) DEFAULT ? COMMENT ?,`header_img` varchar(?) DEFAULT ? COMMENT ?,`base_color` varchar(?) DEFAULT ? COMMENT ?,`active_color` varchar(?) DEFAULT ? COMMENT ?,`authority_id` bigint unsigned DEFAULT ? COMMENT ?,`phone` varchar(?) COMMENT ?,`email` varchar(?) COMMENT ?,`enable` bigint DEFAULT ? COMMENT ?,PRIMARY KEY (`id`),INDEX `idx_sys_users_deleted_at` (`deleted_at`),INDEX `idx_sys_users_uuid` (`uuid`),INDEX `idx_sys_users_username` (`username`))
count_star: 1
*************************** 12. row ***************************
hostgroup: 10
schemaname: go_gin_web
username: proxy
digest_text: show tables
count_star: 2
*************************** 13. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT * FROM `sys_users` WHERE username = ? AND `sys_users`.`deleted_at` IS NULL ORDER BY `sys_users`.`id` LIMIT ?
count_star: 6
*************************** 14. row ***************************
hostgroup: 10
schemaname: go_gin_web
username: proxy
digest_text: ALTER TABLE `sys_dictionaries` ADD `id` bigint unsigned AUTO_INCREMENT,ADD PRIMARY KEY (`id`)
count_star: 1
*************************** 15. row ***************************
hostgroup: 20
schemaname: go_gin_web
username: proxy
digest_text: SELECT column_name,column_default,is_nullable = ?,data_type,character_maximum_length,column_type,column_key,extra,column_comment,numeric_precision,numeric_scale,datetime_precision FROM information_schema.columns WHERE table_schema = ? AND table_name = ? ORDER BY ORDINAL_POSITION
count_star: 10
这里参考https://github.com/flipped-aurora/gin-vue-admin
cat >> /etc/exports << EOF
/mnt/shared/nginx *(rw,all_squash)
EOF
exportfs -a
chmod -R 0777 /mnt/shared/nginx
[root@loadbalancer-001 ~]# ls -al /mnt/shared/nginx/
total 70720
drwxrwxrwx. 6 root root 93 May 5 17:13 .
drwxr-xr-x. 3 root root 19 May 5 16:18 ..
-rwxrwxrwx. 1 root root 4934 May 5 17:12 config.yaml
drwxrwxrwx. 3 root root 38 May 5 16:19 dist
drwxrwxrwx. 3 root root 24 May 5 16:35 log
drwxrwxrwx. 5 root root 64 May 5 16:58 resource
-rwxrwxrwx. 1 root root 72405744 May 5 16:19 server
drwxr-xr-x. 3 root root 18 May 5 17:13 uploads
- 制作tengine镜像
tengine:3.0.0
[root@loadbalancer-001 tengine-docker-image]# tree .
.
├── conf.d
│ ├── default.conf
│ └── loadbalancer.conf
├── Dockerfile
├── nginx.conf
└── README.md
1 directory, 5 files
[root@loadbalancer-001 tengine-docker-image]# cat Dockerfile
FROM centos:7.9.2009
ENV TENGINE_VERSION 3.0.0
ENV CONFIG "\
--prefix=/usr/local/nginx \
--add-module=modules/ngx_http_reqstat_module \
--add-module=modules/ngx_http_upstream_check_module \
"
WORKDIR /opt/tengine
RUN yum install -y gcc pcre pcre-devel openssl openssl-devel \
&& curl -L "https://github.com/alibaba/tengine/archive/$TENGINE_VERSION.tar.gz" -o tengine.tar.gz \
&& mkdir -p /usr/local/src \
&& tar -zxC /usr/local/src -f tengine.tar.gz \
&& rm tengine.tar.gz \
&& cd /usr/local/src/tengine-$TENGINE_VERSION \
&& ./configure $CONFIG \
&& make \
&& make install \
&& mkdir -p /usr/local/nginx/conf/conf.d
COPY nginx.conf /usr/local/nginx/conf/nginx.conf
COPY conf.d/default.conf /usr/local/nginx/conf/conf.d/default.conf
EXPOSE 80 443
CMD ["/usr/local/nginx/sbin/nginx", "-g", "daemon off;"]
- 基于
tengine:3.0.0
制作go web镜像go-gin-web:2.6.2
[root@loadbalancer-001 docker]# cat Dockerfile
FROM tengine:3.0.0
WORKDIR /opt/go-gin-web
ENV LANG=en_US.utf8
COPY deploy/docker/entrypoint.sh .
COPY deploy/docker/build/ /usr/local/nginx/html/
COPY web/.docker-compose/nginx/conf.d/nginx.conf /usr/local/nginx/conf/conf.d/default.conf
RUN set -ex \
&& echo "LANG=en_US.utf8" > /etc/locale.conf \
&& chmod +x ./entrypoint.sh \
&& echo "start" > /dev/null
EXPOSE 80 443
ENTRYPOINT ["/opt/go-gin-web/entrypoint.sh"]
[root@loadbalancer-001 docker]# cat entrypoint.sh
#!/bin/bash
cd /usr/local/nginx/html/ && ./server &
/usr/local/nginx/sbin/nginx -g "daemon off;"
echo "gva ALL start!!!"
tail -f /dev/null
docker run -d -p 20001:80 --name go-gin-web-001 -v /mnt/shared/nginx:/usr/local/nginx/html go-gin-web:2.6.2
docker run -d -p 20002:80 --name go-gin-web-002 -v /mnt/shared/nginx:/usr/local/nginx/html go-gin-web:2.6.2
docker run -d -p 20003:80 --name go-gin-web-003 -v /mnt/shared/nginx:/usr/local/nginx/html go-gin-web:2.6.2
[root@loadbalancer-001 ~]# docker ps | grep go-gin-web:2.6.2
caf5c3827ed4 go-gin-web:2.6.2 "/opt/go-gin-web/ent…" 7 hours ago Up 7 hours 443/tcp, 0.0.0.0:20003->80/tcp, :::20003->80/tcp go-gin-web-003
29d5cdfe8666 go-gin-web:2.6.2 "/opt/go-gin-web/ent…" 7 hours ago Up 7 hours 443/tcp, 0.0.0.0:20002->80/tcp, :::20002->80/tcp go-gin-web-002
af2d138dbb60 go-gin-web:2.6.2 "/opt/go-gin-web/ent…" 7 hours ago Up 7 hours 443/tcp, 0.0.0.0:20001->80/tcp, :::20001->80/tcp go-gin-web-001
浏览器地址栏输入:192.168.31.200:20003
主配置文件/usr/local/nginx/conf/nginx.conf
#user nobody;
worker_processes auto;
#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
#error_log "pipe:rollback logs/error_log interval=1d baknum=7 maxsize=2G";
#pid logs/nginx.pid;
events {
worker_connections 5000;
}
http {
include mime.types;
default_type application/octet-stream;
#log_format main '$remote_addr - $remote_user [$time_local] "$request" '
# '$status $body_bytes_sent "$http_referer" '
# '"$http_user_agent" "$http_x_forwarded_for"';
#access_log logs/access.log main;
#access_log "pipe:rollback logs/access_log interval=1d baknum=7 maxsize=2G" main;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
#gzip on;
include /usr/local/nginx/conf/conf.d/*.conf;
}
cat /usr/local/nginx/conf/conf.d/loadbalancer.conf
req_status_zone server "$host $server_addr:$server_port" 10M;
upstream web_cluster_001 {
server 192.168.31.200:20001;
server 192.168.31.200:20002;
server 192.168.31.200:20003;
check interval=3000 rise=2 fall=5 timeout=1000 type=http;
check_keepalive_requests 100;
check_http_expect_alive http_2xx http_3xx;
}
server {
listen 80;
server_name localhost;
#access_log /var/log/nginx/host.access.log main;
location /upstream_check {
check_status;
}
location /nginx_reqstat {
req_status_show;
}
req_status server;
location / {
proxy_pass http://web_cluster_001;
}
location /nginx_status {
stub_status on;
}
#error_page 404 /404.html;
# redirect server error pages to the static page /50x.html
#
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
# proxy the PHP scripts to Apache listening on 127.0.0.1:80
#
#location ~ \.php$ {
# proxy_pass http://127.0.0.1;
#}
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
#location ~ \.php$ {
# root html;
# fastcgi_pass 127.0.0.1:9000;
# fastcgi_index index.php;
# fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name;
# include fastcgi_params;
#}
# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
# deny all;
#}
}
keepavlied配置:/etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
router_id master-node
}
vrrp_script chk_http_port {
script "/etc/keepalived/check_nginx.sh"
interval 1
weight -5
fall 1
rise 1
}
vrrp_instance VI_1 {
state MASTER
interface enp0s3
virtual_router_id 51
priority 101
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.31.254
}
track_script {
chk_http_port
}
}
loadbalancer健康检查脚本:/etc/keepalived/check_nginx.sh
#!/bin/sh
APISERVER_DEST_PORT=80
APISERVER_VIP=192.168.31.254
errorExit() {
echo "*** $*" 1>&2
exit 1
}
curl --silent --max-time 2 --insecure http://localhost:${APISERVER_DEST_PORT}/ -o /dev/null || errorExit "Error GET http://localhost:${APISERVER_DEST_PORT}/"
if ip addr | grep -q ${APISERVER_VIP}; then
curl --silent --max-time 2 --insecure http://${APISERVER_VIP}:${APISERVER_DEST_PORT}/ -o /dev/null || errorExit "Error GET http://${APISERVER_VIP}:${APISERVER_DEST_PORT}/"
fi
keepavlied配置:/etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
router_id master-node
}
vrrp_script chk_http_port {
script "/etc/keepalived/check_nginx.sh"
interval 1
weight -5
fall 1
rise 1
}
vrrp_instance VI_1 {
state BACKUP
interface enp0s3
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.31.254
}
track_script {
chk_http_port
}
}
其他配置同loadbalancer-001一致
通过以上配置可以得到VIP 192.168.31.254
- 安装mysqld-exporter
参考:https://github.com/prometheus/mysqld_exporter
- 安装node-exporter
参考:https://github.com/prometheus/node_exporter
- 开发tengine-exporter,自定义采集网站相关指标
#!/usr/bin/env python3
import re
import time
import requests
from prometheus_client import start_http_server, Gauge
# Nginx服务器列表
NGINX_SERVERS = [
{'name': 'loadbalancer_vip', 'type': 'loadbalancer', 'url': 'http://192.168.31.254:80/'},
{'name': 'web_001', 'type': 'web', 'url': 'http://192.168.31.200:20001/'},
{'name': 'web_002', 'type': 'web', 'url': 'http://192.168.31.200:20002/'},
{'name': 'web_003', 'type': 'web', 'url': 'http://192.168.31.200:20003/'}
]
# 资源路径
PATH = {"status": "nginx_status", "reqstat": "nginx_reqstat", "upstream": "upstream_check?format=json"}
# 创建nginx status指标
nginx_active_connections = Gauge('nginx_active_connections', 'Number of active connections in NGINX', labelnames=['server'])
nginx_accepted_connections = Gauge('nginx_accepted_connections', 'Number of accepted connections in NGINX', labelnames=['server'])
nginx_handled_connections = Gauge('nginx_handled_connections', 'Number of handled connections in NGINX', labelnames=['server'])
nginx_total_requests = Gauge('nginx_total_requests', 'Total number of requests handled by NGINX', labelnames=['server'])
nginx_request_time = Gauge('nginx_request_time', 'Average request time in NGINX (in milliseconds)', labelnames=['server'])
nginx_reading_connections = Gauge('nginx_reading_connections', 'Number of connections in NGINX reading request header', labelnames=['server'])
nginx_writing_connections = Gauge('nginx_writing_connections', 'Number of connections in NGINX writing response to client', labelnames=['server'])
nginx_waiting_connections = Gauge('nginx_waiting_connections', 'Number of idle connections in NGINX waiting for request', labelnames=['server'])
# 创建nginx restat指标
nginx_kv = Gauge('nginx_kv', 'kv', labelnames=['server'])
nginx_bytes_in = Gauge('nginx_bytes_in', 'Total bytes received from clients in NGINX', labelnames=['server'])
nginx_bytes_out = Gauge('nginx_bytes_out', 'Total bytes sent to clients in NGINX', labelnames=['server'])
nginx_conn_total = Gauge('nginx_conn_total', 'Total connections handled by NGINX', labelnames=['server'])
nginx_req_total = Gauge('nginx_req_total', 'Total requests handled by NGINX', labelnames=['server'])
nginx_http_2xx = Gauge('nginx_http_2xx', 'Total number of 2xx responses in NGINX', labelnames=['server'])
nginx_http_3xx = Gauge('nginx_http_3xx', 'Total number of 3xx responses in NGINX', labelnames=['server'])
nginx_http_4xx = Gauge('nginx_http_4xx', 'Total number of 4xx responses in NGINX', labelnames=['server'])
nginx_http_5xx = Gauge('nginx_http_5xx', 'Total number of 5xx responses in NGINX', labelnames=['server'])
nginx_http_other_status = Gauge('nginx_http_other_status', 'Total number of other responses in NGINX', labelnames=['server'])
nginx_rt = Gauge('nginx_rt', 'Total request time in NGINX', labelnames=['server'])
nginx_ups_req = Gauge('nginx_ups_req', 'Total requests to upstream in NGINX', labelnames=['server'])
nginx_ups_rt = Gauge('nginx_ups_rt', 'Total upstream request time in NGINX', labelnames=['server'])
nginx_ups_tries = Gauge('nginx_ups_tries', 'Total upstream tries in NGINX', labelnames=['server'])
nginx_http_200 = Gauge('nginx_http_200', 'Total number of 200 responses in NGINX', labelnames=['server'])
nginx_http_206 = Gauge('nginx_http_206', 'Total number of 206 responses in NGINX', labelnames=['server'])
nginx_http_302 = Gauge('nginx_http_302', 'Total number of 302 responses in NGINX', labelnames=['server'])
nginx_http_304 = Gauge('nginx_http_304', 'Total number of 304 responses in NGINX', labelnames=['server'])
nginx_http_403 = Gauge('nginx_http_403', 'Total number of 403 responses in NGINX', labelnames=['server'])
nginx_http_404 = Gauge('nginx_http_404', 'Total number of 404 responses in NGINX', labelnames=['server'])
nginx_http_416 = Gauge('nginx_http_416', 'Total number of 416 responses in NGINX', labelnames=['server'])
nginx_http_499 = Gauge('nginx_http_499', 'Total number of 499 responses in NGINX', labelnames=['server'])
nginx_http_500 = Gauge('nginx_http_500', 'Total number of 500 responses in NGINX', labelnames=['server'])
nginx_http_502 = Gauge('nginx_http_502', 'Total number of 502 responses in NGINX', labelnames=['server'])
nginx_http_503 = Gauge('nginx_http_503', 'Total number of 503 responses in NGINX', labelnames=['server'])
nginx_http_504 = Gauge('nginx_http_504', 'Total number of 504 responses in NGINX', labelnames=['server'])
nginx_http_508 = Gauge('nginx_http_508', 'Total number of 508 responses in NGINX', labelnames=['server'])
nginx_http_other_detail_status = Gauge('nginx_http_other_detail_status', 'Total number of other detailed status responses in NGINX', labelnames=['server'])
nginx_http_ups_4xx = Gauge('nginx_http_ups_4xx', 'Total number of upstream 4xx responses in NGINX', labelnames=['server'])
nginx_http_ups_5xx = Gauge('nginx_http_ups_5xx', 'Total number of upstream 5xx responses in NGINX', labelnames=['server'])
# upstream_check指标
total_servers = Gauge('total_servers', 'Total number of servers', labelnames=['server'])
up_servers = Gauge('up_servers', 'Number of servers up', labelnames=['server'])
down_servers = Gauge('down_servers', 'Number of servers down', labelnames=['server'])
generation = Gauge('generation', 'Generation of servers', labelnames=['server'])
server_up = Gauge('server_up', 'Whether server is up', ['index', 'upstream', 'name', 'type'])
def update_metrics():
for server in NGINX_SERVERS:
response_text = get_nginx_metrics(server, PATH['status'])
parse_nginx_status(server, response_text)
if server['type'] == 'loadbalancer':
response_text = get_nginx_metrics(server, PATH['reqstat'])
parse_nginx_reqstat(server, response_text)
response_json = get_nginx_metrics(server, PATH['upstream'])
parse_upstream_check(server, response_json)
def get_nginx_metrics(server, metric):
try:
response = requests.get(url=server["url"] + metric)
if response.status_code == 200:
return response.json() if metric == PATH['upstream'] else response.text
else:
print(f"Failed to fetch NGINX status: {response.status_code}")
except Exception as e:
print(f"Failed to fetch NGINX status: {str(e)}")
return None
def parse_nginx_status(server, nginx_status):
if not nginx_status:
return
active_connections = re.search(r'Active connections:\s+(\d+)', nginx_status)
if active_connections:
nginx_active_connections.labels(server=server["name"]).set(int(active_connections.group(1)))
accepted_handled_requests_time = re.search(r'\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)', nginx_status)
if accepted_handled_requests_time:
accepted = int(accepted_handled_requests_time.group(1))
handled = int(accepted_handled_requests_time.group(2))
requests = int(accepted_handled_requests_time.group(3))
request_time = int(accepted_handled_requests_time.group(4))
nginx_accepted_connections.labels(server=server["name"]).set(accepted)
nginx_handled_connections.labels(server=server["name"]).set(handled)
nginx_total_requests.labels(server=server["name"]).set(requests)
nginx_request_time.labels(server=server["name"]).set(request_time)
reading = re.search(r'Reading:\s+(\d+)', nginx_status)
if reading:
nginx_reading_connections.labels(server=server["name"]).set(int(reading.group(1)))
writing = re.search(r'Writing:\s+(\d+)', nginx_status)
if writing:
nginx_writing_connections.labels(server=server["name"]).set(int(writing.group(1)))
waiting = re.search(r'Waiting:\s+(\d+)', nginx_status)
if waiting:
nginx_waiting_connections.labels(server=server["name"]).set(int(waiting.group(1)))
def parse_nginx_reqstat(server, nginx_status):
if not nginx_status:
return
lines = nginx_status.strip().split('\n')
for line in lines:
kv, bytes_in, bytes_out, conn_total, req_total, http_2xx, http_3xx, http_4xx, http_5xx, http_other_status, rt, ups_req, ups_rt, ups_tries, http_200, http_206, http_302, http_304, http_403, http_404, http_416, http_499, http_500, http_502, http_503, http_504, http_508, http_other_detail_status, http_ups_4xx, http_ups_5xx = line.split(",")
nginx_kv.labels(server=server["name"]).set(1)
nginx_bytes_in.labels(server=server["name"]).set(bytes_in)
nginx_bytes_out.labels(server=server["name"]).set(bytes_out)
nginx_conn_total.labels(server=server["name"]).set(conn_total)
nginx_req_total.labels(server=server["name"]).set(req_total)
nginx_http_2xx.labels(server=server["name"]).set(http_2xx)
nginx_http_3xx.labels(server=server["name"]).set(http_3xx)
nginx_http_4xx.labels(server=server["name"]).set(http_4xx)
nginx_http_5xx.labels(server=server["name"]).set(http_5xx)
nginx_http_other_status.labels(server=server["name"]).set(http_other_status)
nginx_rt.labels(server=server["name"]).set(rt)
nginx_ups_req.labels(server=server["name"]).set(ups_req)
nginx_ups_rt.labels(server=server["name"]).set(ups_rt)
nginx_ups_tries.labels(server=server["name"]).set(ups_tries)
nginx_http_200.labels(server=server["name"]).set(http_200)
nginx_http_206.labels(server=server["name"]).set(http_206)
nginx_http_302.labels(server=server["name"]).set(http_302)
nginx_http_304.labels(server=server["name"]).set(http_304)
nginx_http_403.labels(server=server["name"]).set(http_403)
nginx_http_404.labels(server=server["name"]).set(http_404)
nginx_http_416.labels(server=server["name"]).set(http_416)
nginx_http_499.labels(server=server["name"]).set(http_499)
nginx_http_500.labels(server=server["name"]).set(http_500)
nginx_http_502.labels(server=server["name"]).set(http_502)
nginx_http_503.labels(server=server["name"]).set(http_503)
nginx_http_504.labels(server=server["name"]).set(http_504)
nginx_http_508.labels(server=server["name"]).set(http_508)
nginx_http_other_detail_status.labels(server=server["name"]).set(http_other_detail_status)
nginx_http_ups_4xx.labels(server=server["name"]).set(http_ups_4xx)
nginx_http_ups_5xx.labels(server=server["name"]).set(http_ups_5xx)
def parse_upstream_check(host, upstream_check):
servers_info = upstream_check.get("servers")
total_servers.labels(server=host['name']).set(servers_info.get('total'))
up_servers.labels(server=host['name']).set(servers_info.get('up'))
down_servers.labels(server=host['name']).set(servers_info.get('down'))
generation.labels(server=host['name']).set(servers_info.get('generation'))
for server in servers_info.get('server', []):
server_up.labels(str(server['index']), server['upstream'], server['name'], server['type']).set(1 if server['status'] == 'up' else 0)
if __name__ == '__main__':
# Start up the server to expose the metrics.
start_http_server(port=8888, addr='0.0.0.0')
while True:
update_metrics()
time.sleep(10)
- prometheus主配置文件:prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["192.168.31.200:9090"]
- job_name: "node"
file_sd_configs:
- files:
- 'node_targets.json'
- job_name: "tengine"
file_sd_configs:
- files:
- 'tengine_targets.json'
- job_name: "mysql" # To get metrics about the mysql exporter’s targets
file_sd_configs:
- files:
- 'mysql_targets.json'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
# The mysqld_exporter host:port
replacement: 192.168.31.200:9104
- 配置基于文件的动态服务发现,以便不用重启prometheus服务也能获取新服务的指标
[root@loadbalancer-001 prometheus-2.45.5.linux-amd64]# cat node_targets.json
[
{
"labels": {
"job": "node"
},
"targets": [
"192.168.31.200:9100"
]
},
{
"labels": {
"job": "node"
},
"targets": [
"192.168.31.201:9100"
]
}
]
[root@loadbalancer-001 prometheus-2.45.5.linux-amd64]# cat mysql_targets.json
[
{
"labels": {
"job": "mysql"
},
"targets": [
"192.168.31.200:30001",
"192.168.31.200:30002"
]
}
]
[root@loadbalancer-001 prometheus-2.45.5.linux-amd64]# cat tengine_targets.json
[
{
"labels": {
"job": "tengine"
},
"targets": [
"192.168.31.200:8888"
]
}
]
- 安装并启动服务
nohup bin\grafana server &
也可用systemd或supervisor来管理进程
-
访问localhost:3000,进行账号初始化设置
-
配置数据源为上面搭建好的prometheus服务
- 按需创建dashboard(导入grafana社区模板或自定义出图)
这里主要针对tengine负载均衡器以及后端web服务器的关键指标进行绘图展示
- 过去五分钟的请求错误率(5xx和4xx)
promql:(rate(nginx_http_4xx[5m]) + rate(nginx_http_5xx[5m])) / rate(nginx_req_total[5m]) * 100
- 丢弃的连接数,已 Dropped(丢弃)的连接数等于accept(接收)和 handled(处理)之间的差值,正常情况下断开的连接应为零。
promql:nginx_accepted_connections - nginx_handled_connections
- 平均请求时间,单位为毫秒,处理每个请求的时间,Nginx 记录的请求时间度量记录了每个请求的处理时间,从读取第一个客户端字节到完成请求。较长的响应时间可能指向上游也就是服务器端的响应问题。
promql:nginx_request_time / nginx_req_total
- nginx基本活动指标
Accepts(接受),Handled(处理)和 Requests(请求)会随着计数器不断增加。Active(活动),Waiting(等待),Reading(读)和 Writing(写)随请求量的变化而变化。
- 服务器状态,观测后端服务器状态可用性,1为可用,0为不可用
promql:server_up
- 错误状态码5xx率,诸如“ 502 错误的网关”或“ 503服务暂时不可用”之类的 5xx(服务器错误)代码是一个值得监控的指标,尤其是在总响应代码中所占的比例
promql:rate(nginx_http_ups_5xx[5m]) / rate(nginx_ups_req[5m]) * 100
- 每台服务器的活跃指标(Active connections),每个上游服务器的活动连接数可以验证反向代理是否在服务器组中正确的分配了工作。如果将 Nginx 用作负载均衡,若任何一台服务器处理的连接数存在显著偏差,可能表明该服务器正在努力及时处理请求,或者所配置的负载平衡方法(例如轮询或 IP 散列)对流量模式来说还存在优化空间。
promql:nginx_active_connections
mysql服务需要关注的关键指标:
-
延迟:慢查询
-
流量:写数量(Com_insert + Com_update + Com_delete)、读数量(Com_select)、语句总量(Questions)
-
错误:客户端连接 MySQL 失败了,或者语句发给 MySQL,执行的时候失败了,都需要有失败计数
-
饱和度
当前连接数(Threads_connected)除以最大连接数(max_connections)可以得到连接数使用率,是一个需要重点监控的饱和度指标。
另外就是 InnoDB Buffer pool 相关的指标,一个是 Buffer pool 的使用率,一个是 Buffer pool 的内存命中率。Buffer pool 是一块内存,专门用来缓存 Table、Index 相关的数据,提升查询性能。
主要关注处理器、内存、io、存储使用、网络流量等系统指标
以下以mysql_up指标为例配置告警
-
prometheusAlert安装参考:https://github.com/feiyu563/PrometheusAlert
-
飞书webhook配置参考:https://feiyu563.gitbook.io/prometheusalert/conf/conf-feishu
-
设置告警规则
- 设置联络点
- 配置通知策略
- 触发告警接收消息
ab压力测试
mysql集群故障转移
网站访问日志集成分析,elk搭建
引入消息队列,削峰填谷
引入redis作为缓存服务器,提高数据库性能
cicd平台搭建