訾懵 发表于 6 天前

Kubernetes集群的搭建与DevOps实践(下)- 部署实践篇

本文将详细介绍生产级Kubernetes集群的搭建步骤、CI/CD流水线配置、监控部署和故障排查方法并提供可执行的命令和配置文件。
适合读者:运维工程师、DevOps工程师、想动手搭建K8s集群的技术人员
前置阅读:建议先阅读《架构设计篇》了解整体架构和技术选型
目录


[*]一、环境准备
[*]二、中间件部署
[*]三、Kubernetes集群搭建
[*]四、CI/CD流水线搭建
[*]五、应用部署实践
[*]六、监控与日志
[*]七、故障排查与问题解决
[*]八、总结与检查清单
一、环境准备

1.0 网络规划

网络规划是部署的第一步,合理的网络架构能确保安全隔离和高效通信。
1.0.1 VPC与子网规划

采用三层子网架构,实现网络隔离:
VPC网络: 10.0.0.0/16
├── entry子网:      10.0.10.0/24(公网入口层)
├── middleware子网: 10.0.20.0/24(中间件层)
└── k8s子网:      10.0.30.0/24(应用服务层)子网CIDR用途部署服务器entry子网10.0.10.0/24公网入口、运维管理entry-01, jumpservermiddleware子网10.0.20.0/24中间件服务middleware-01k8s子网10.0.30.0/24K8s集群master-01~03, node-01~02K8s内部网络规划:
网段CIDR用途Pod网段172.16.0.0/16Pod IP分配(Calico管理)Service网段10.96.0.0/12Service ClusterIP1.0.2 安全组配置

Entry子网安全组(公网入口):
方向端口来源/目标用途入站80, 4430.0.0.0/0Web访问入站10022运维IP白名单SSH管理(非标准端口)出站ALL0.0.0.0/0允许所有出站K8s子网安全组:
方向端口来源/目标用途入站6443entry子网K8s API Server入站30080, 30443entry子网Ingress NodePort入站ALLk8s子网内部集群内通信入站ALL172.16.0.0/16Pod网络通信出站ALL0.0.0.0/0允许所有出站Middleware子网安全组:
方向端口来源/目标用途入站3306, 6379, 8848等k8s子网中间件服务端口入站10022entry子网SSH管理出站ALL0.0.0.0/0允许所有出站1.0.3 服务器互访规则

graph LR    subgraph entry子网      Entry      Jump    end      subgraph middleware子网      MW    end      subgraph k8s子网      Master      Worker    end      Internet((互联网)) --> |80/443| Entry    Entry --> |6443| Master    Entry --> |30080| Worker    Jump --> |10022| Master    Jump --> |10022| MW    Worker --> |3306/6379/8848| MW    Master --> |3306/6379/8848| MW1.1 服务器清单

角色主机名IP示例配置说明Entryentry-0110.0.10.102C/4GNginx + Squid代理Middlewaremiddleware-0110.0.20.108C/32GMySQL、Redis等K8s Masterk8s-master-0110.0.30.104C/8G控制平面K8s Masterk8s-master-0210.0.30.114C/8G控制平面K8s Masterk8s-master-0310.0.30.124C/8G控制平面K8s Workerk8s-node-0110.0.30.208C/32G工作节点K8s Workerk8s-node-0210.0.30.218C/32G工作节点JumpServerjumpserver10.0.10.204C/8G堡垒机1.2 基础设施初始化

在部署K8s之前,需要完成基础设施的初始化配置。
1.2.1 服务器基础配置

所有服务器执行:
#!/bin/bash
# 服务器基础配置脚本

# 1. 设置主机名(根据服务器角色修改)
HOSTNAME="k8s-master-01"
hostnamectl set-hostname $HOSTNAME
echo "127.0.0.1 $HOSTNAME" >> /etc/hosts

# 2. 时区配置
timedatectl set-timezone Asia/Shanghai
timedatectl set-ntp yes

# 3. 内核参数优化
cat > /etc/sysctl.d/local.conf << EOF
# 文件描述符
fs.file-max = 512000

# TCP优化
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.somaxconn = 4096
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_max_syn_backlog = 4096

# 开启BBR拥塞控制
net.ipv4.tcp_congestion_control = bbr

# 禁用IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
EOF
sysctl -p /etc/sysctl.d/local.conf

# 4. 系统资源限制
cat > /etc/security/limits.conf << EOF
*         hard    nofile      512000
*         soft    nofile      512000
root      hard    nofile      512000
root      soft    nofile      512000
EOF

# 5. SSH安全配置
cat > /etc/ssh/sshd_config << EOF
Include /etc/ssh/sshd_config.d/*.conf
Port 60022
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
ClientAliveInterval 60
ClientAliveCountMax 5
EOF
systemctl restart sshd2.2 Docker Compose配置

#!/bin/bash
# 入口服务器Nginx配置

# 1. 安装Nginx
apt-get update
apt-get install -y nginx

# 2. 配置Nginx(支持stream模块用于TCP负载均衡)
cat > /etc/nginx/nginx.conf << EOF
user www-data;
worker_processes auto;
pid /run/nginx.pid;

events {
    worker_connections 20480;
    multi_accept on;
}

# TCP负载均衡(用于K8s API Server)
stream {   
    include /data/nginx/stream-sites-enabled/*;
}

http {
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    client_max_body_size 0;
   
    include /etc/nginx/mime.types;
    default_type application/octet-stream;
   
    log_format main '[\$time_local] \$remote_addr -> '
                  '"\$request" \$status \$body_bytes_sent '
                  '"\$http_user_agent" \$request_time';
   
    access_log /data/nginx/logs/access.log main;
    error_log /data/nginx/logs/error.log;
   
    gzip on;
    gzip_types text/plain text/css application/json application/javascript;
   
    include /data/nginx/sites-enabled/*;
}
EOF

# 3. 创建目录结构
mkdir -p /data/nginx/{stream-sites-enabled,logs,sites-enabled,conf.d}
chown -R www-data:www-data /data/nginx2.3 MySQL优化配置

# K8s API Server TCP负载均衡(6443端口)
cat > /data/nginx/stream-sites-enabled/k8s-apiserver.conf << EOF
upstream k8s-apiserver {
    server 10.0.30.10:6443 max_fails=3 fail_timeout=30s;
    server 10.0.30.11:6443 max_fails=3 fail_timeout=30s;
    server 10.0.30.12:6443 max_fails=3 fail_timeout=30s;
}

server {
    listen 6443;
    proxy_pass k8s-apiserver;
    proxy_timeout 3s;
    proxy_connect_timeout 1s;
}
EOF2.4 启动中间件

# Ingress节点HTTP负载均衡
cat > /data/nginx/conf.d/k8s-ingress.conf << EOF
upstream ingress_nodes {
    server 10.0.30.20:30080;
    server 10.0.30.21:30080;
}
EOF

# 应用站点配置示例
cat > /data/nginx/sites-enabled/app.conf << EOF
server {
    listen 80;
    server_name app.example.com;
   
    location / {
      proxy_pass http://ingress_nodes;
      proxy_set_header Host \$host;
      proxy_set_header X-Real-IP \$remote_addr;
      proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
    }
}
EOF

systemctl reload nginx#!/bin/bash
# JumpServer一键部署

# 使用官方脚本快速部署
curl -sSL https://resource.fit2cloud.com/jumpserver/jumpserver/releases/download/v3.10.17/quick_start.sh | bash

# 修改配置(可选)
# vim /opt/jumpserver/config/config.txt
# 常用配置项:
# - HTTP_PORT=80
# - HTTPS_PORT=443
# - DOMAINS="jumpserver.example.com"

# 重启服务
cd /opt/jumpserver-installer-v3.10.17
./jmsctl.sh restart#!/bin/bash
# Docker引擎安装与配置

# 1. 安装依赖
apt-get update
apt-get install -y ca-certificates curl gnupg lsb-release

# 2. 添加Docker官方GPG密钥(使用阿里云镜像)
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
    gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# 3. 添加Docker仓库
echo "deb \
    https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
    tee /etc/apt/sources.list.d/docker.list

# 4. 安装Docker
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

# 5. 配置Docker
mkdir -p /data/docker
cat > /etc/docker/daemon.json << EOF
{
"log-driver": "json-file",
"log-opts": {
    "max-size": "100m",
    "max-file": "3"
},
# 设置代理(可选)
"registry-mirrors": [
    "https://docker.m.daocloud.io"
],
# docker数据目录
"data-root": "/data/docker"
}
EOF

# 6. 启动Docker
systemctl enable docker
systemctl restart docker

# 7. 验证安装
docker info
docker compose version5.2 Ingress路由配置

#!/bin/bash
# 服务器安全加固

# 1. 安装fail2ban防暴力破解
apt-get install -y fail2ban

# 2. 配置fail2ban
cat > /etc/fail2ban/jail.local << EOF

ignoreip = 127.0.0.1/8 ::1
bantime = 3600
maxretry = 3
findtime = 600
banaction = iptables-multiport


enabled = true
port = 10022
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
EOF

# 3. 启动fail2ban
systemctl enable fail2ban
systemctl restart fail2ban

# 4. 查看状态
fail2ban-client status sshd5.3 ConfigMap和Secret使用

# 立即关闭
swapoff -a

# 永久关闭:删除fstab中的swap行
sed -i '/swap/d' /etc/fstab在Deployment中引用:
cat > /etc/modules-load.d/k8s.conf << EOF
overlay      # OverlayFS文件系统
br_netfilter   # 网桥过滤
EOF

modprobe overlay
modprobe br_netfilter

# 验证
lsmod | grep -E "overlay|br_netfilter"六、监控与日志

6.1 Prometheus + Grafana部署

6.1.1 部署Node Exporter

cat > /etc/sysctl.d/k8s.conf << EOF
# K8s必需参数
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1

# 连接跟踪优化
net.netfilter.nf_conntrack_max = 524288

# TCP优化
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.core.somaxconn = 32768

# 文件描述符
fs.file-max = 2097152
EOF

sysctl --system6.1.2 Prometheus配置

cat >> /etc/security/limits.conf << EOF
# Kubernetes resource limits
* soft nofile 655360
* hard nofile 655360
* soft nproc 655360
* hard nproc 655360
EOF6.2 告警规则示例

# 添加Docker镜像源(containerd包含在其中)
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb \
https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
tee /etc/apt/sources.list.d/docker.list

apt-get update
apt-get install -y containerd.io6.3 日志收集方案

使用Filebeat收集容器日志到Elasticsearch:
mkdir -p /etc/containerd

cat > /etc/containerd/config.toml << 'EOF'
version = 2


# 使用国内镜像
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"


   
      runtime_type = "io.containerd.runc.v2"
      
      SystemdCgroup = true# 使用systemd作为cgroup驱动


   
      endpoint = ["https://docker.m.daocloud.io"]
   
      endpoint = ["https://k8s.m.daocloud.io"]
EOF

systemctl daemon-reload
systemctl restart containerd
systemctl enable containerd七、故障排查与问题解决

7.1 常见问题速查表

问题现象可能原因排查命令解决方案Pod一直Pending资源不足kubectl describe pod 增加节点或调整资源请求Pod CrashLoopBackOff应用启动失败kubectl logs 检查应用配置和依赖ImagePullBackOff镜像拉取失败kubectl describe pod 检查镜像地址和凭证Service无法访问Endpoints为空kubectl get endpoints 检查Pod标签和selectorIngress 502后端Pod未就绪kubectl get pods检查readinessProbeOOMKilled内存不足kubectl describe pod 增加内存限制节点NotReady网络或kubelet问题kubectl describe node 检查kubelet和网络插件DNS解析失败CoreDNS问题kubectl logs -n kube-system -l k8s-app=kube-dns重启CoreDNS7.2 排查命令速查

# 添加阿里云Kubernetes源
curl -fsSL https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | \
gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo 'deb \
https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main' | \
tee /etc/apt/sources.list.d/kubernetes.list

# 安装指定版本
apt-get update
apt-get install -y kubelet=1.28.6-1.1 kubeadm=1.28.6-1.1 kubectl=1.28.6-1.1

# 锁定版本,防止意外升级
apt-mark hold kubelet kubeadm kubectl

# 启用kubelet
systemctl enable kubelet7.3 典型案例分析

案例1:Pod频繁OOMKilled

现象:Pod每隔几小时重启,状态显示OOMKilled
排查:
apt-get install -y chrony

cat > /etc/chrony/chrony.conf << 'EOF'
server ntp.aliyun.com iburst
server ntp.tencent.com iburst
driftfile /var/lib/chrony/chrony.drift
makestep 1.0 3
rtcsync
EOF

systemctl restart chrony
systemctl enable chrony原因:JVM堆内存设置与K8s限制不匹配
解决:
cat >> /etc/hosts << EOF
10.0.30.10k8s-master-01
10.0.30.11k8s-master-02
10.0.30.12k8s-master-03
10.0.30.20k8s-node-01
10.0.30.21k8s-node-02
10.0.10.10k8s-api-lb
EOF案例2:跨节点Pod通信504

现象:同节点Pod通信正常,跨节点返回504超时,失败率约50%
快速排查:
# 热数据盘(SSD):MySQL、Redis
mkdir -p /data/hot
mount /dev/vdb1 /data/hot

# 冷数据盘(HDD):Elasticsearch、MinIO
mkdir -p /data/cold
mount /dev/vdc1 /data/cold

# 写入fstab自动挂载
echo '/dev/vdb1 /data/hot ext4 defaults 0 0' >> /etc/fstab
echo '/dev/vdc1 /data/cold ext4 defaults 0 0' >> /etc/fstab根本原因:云平台安全组只允许了节点网络,未允许Pod网络CIDR
解决:在安全组添加规则:

[*]入站:ANY - Pod网络CIDR(如172.16.0.0/16)
[*]出站:ANY - Pod网络CIDR(如172.16.0.0/16)
详细案例分析:参见《故障排查实战》篇
八、总结与检查清单

8.1 部署前检查清单

检查项命令/操作预期结果系统时间同步timedatectlSystem clock synchronized: yesSwap已关闭free -hSwap行全为0内核模块已加载lsmod | grep br_netfilter有输出containerd运行正常systemctl status containerdactive (running)kubelet已启用systemctl is-enabled kubeletenabled网络连通性节点间ping测试全部通镜像源可访问crictl pull nginx成功拉取8.2 部署后验证清单

检查项命令预期结果所有节点Readykubectl get nodesSTATUS全为Ready系统Pod正常kubectl get pods -n kube-system全为Running网络插件正常kubectl get pods -n calico-system全为RunningDNS解析正常kubectl run test --rm -it --image=busybox -- nslookup kubernetes解析成功跨节点通信创建两个Pod,互相ping通信正常Ingress工作创建测试Ingress并访问正常响应8.3 常用命令速查

# docker-compose.yml
version: '3'
services:
mysql:
    image: mysql:8.0
    restart: always
    ports:
      - 3306:3306
    volumes:
      - /data/hot/mysql:/var/lib/mysql
      - ./config/my.cnf:/etc/mysql/conf.d/my.cnf
    environment:
      MYSQL_ROOT_PASSWORD: ${MYSQL_PASSWORD}
      TZ: Asia/Shanghai
    networks:
      - middleware

redis:
    image: redis:7.2
    restart: always
    ports:
      - 6379:6379
    volumes:
      - /data/hot/redis:/data
    command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
    networks:
      - middleware

nacos:
    image: nacos/nacos-server:v2.3.2
    restart: always
    depends_on:
      - mysql
    environment:
      MODE: standalone
      NACOS_AUTH_ENABLE: "true"
      SPRING_DATASOURCE_PLATFORM: mysql
      MYSQL_SERVICE_HOST: mysql
      MYSQL_SERVICE_DB_NAME: nacos
      MYSQL_SERVICE_USER: root
      MYSQL_SERVICE_PASSWORD: ${MYSQL_PASSWORD}
    ports:
      - 8848:8848
      - 9848:9848
    networks:
      - middleware

rabbitmq:
    image: rabbitmq:3.12-management
    restart: always
    ports:
      - 5672:5672
      - 15672:15672
    environment:
      RABBITMQ_DEFAULT_USER: admin
      RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
    volumes:
      - /data/hot/rabbitmq:/var/lib/rabbitmq
    networks:
      - middleware

elasticsearch:
    image: elasticsearch:7.17.19
    restart: always
    volumes:
      - /data/cold/elasticsearch:/usr/share/elasticsearch/data
    environment:
      discovery.type: single-node
      ES_JAVA_OPTS: -Xms2g -Xmx2g
    ports:
      - 9200:9200
    networks:
      - middleware

networks:
middleware:
    driver: bridge8.4 关键配置文件位置

配置项路径kubeadm配置/etc/kubernetes/admin.confkubelet配置/var/lib/kubelet/config.yamlcontainerd配置/etc/containerd/config.tomlCalico配置kubectl get installation default -o yamlkubectl配置~/.kube/config关键词: Kubernetes、部署实践、CI/CD、监控、故障排查

来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

殳世英 发表于 前天 19:07

懂技术并乐意极积无私分享的人越来越少。珍惜
页: [1]
查看完整版本: Kubernetes集群的搭建与DevOps实践(下)- 部署实践篇