Kubectl Describe k8s故障排查利器

除了kubectl logs、events，kubectl describe 也是K8s排查问题必备命令，有点类似docker inspect，它返回结果比 kubectl get 详细，比YAML直观，用过都说好。本文内容较长，但都是干货，坚持看完一定会有收获！

1. 基础用法

1.1 查看Pod详情

# 查看Pod完整信息（opsnot最常用）
kubectl describe pod opsnot-postgresql

# 指定命名空间
kubectl describe pod my-pod -n opsnot-postgresql

# 查看所有Pod（opsnot.com提醒：要慎用！输出太多！）
kubectl describe pods

1.2 查看不同资源

# 查看Deployment（opsnot.com常用）
kubectl describe deployment opsnot-mariadb

# 查看Service
kubectl describe service opsnot-service

# 查看Node
kubectl describe node node-mariadb

# 查看ConfigMap
kubectl describe configmap mariadb-config

# 查看Secret（敏感信息会隐藏）
kubectl describe secret mariadb-secret

2. Pod排查技巧

2.1 查看Pod事件（重点关注）

# 直接看Pod信息，重点看最底部的Events
kubectl describe pod ops-elasticsearch -n prod

# opsnot.com经验：Events是排查问题的关键！
# 看到ImagePullBackOff、CrashLoopBackOff都能找到原因

实操案例：某Pod一直Pending，用describe一看Events：

Warning  FailedScheduling  node didn't have enough memory

很明显内存不足，调整下requests轻松解决。

2.2 查看容器状态

# Pod describe里会显示：
 - 容器状态（Running/Waiting/Terminated）
 - 重启次数
 - 退出码
 - 最后一次重启原因

kubectl describe pod my-elasticsearch | grep -A 10 "State:"
kubectl describe pod my-elasticsearch | grep "Restart Count"

实操案例：容器反复重启，describe显示：

Last State: Terminated
  Reason: Error
  Exit Code: 137

容器退出码137，那就是OOM killed了，内存调大些就ok了。

2.3 查看资源使用情况

# describe会显示requests和limits
kubectl describe pod opsnot-redis | grep -A 5 "Limits:"
kubectl describe pod opsnot-redis | grep -A 5 "Requests:"

# 查看实际分配的资源
kubectl describe pod ops-not-kafka | grep "QoS Class"

实操案例：集群资源紧张，这个命令可以找到那些设置了过高的requests但实际用不到的Pod，优化资源，节能提效

3. Node排查

3.1 查看Node健康状态

# 查看Node详情
kubectl describe node worker-node-1

# 重点看Conditions部分
kubectl describe node worker-node-1 | grep -A 10 "Conditions:"

# 常见状态：
# Ready: True/False
# MemoryPressure: True/False
# DiskPressure: True/False
# PIDPressure: True/False

实操案例：Pod调度不上node节点，describe node发现：

DiskPressure: True
Message: kubelet has disk pressure

很明显是磁盘占满了，清理即可，一般都是日志爆了。

3.2 查看Node资源分配

# 查看Node上的资源使用情况（加班哥推荐）
kubectl describe node worker-node-1 | grep -A 20 "Allocated resources:"

# 会显示：
# CPU Requests: 1200m (60% of 2 cores)
# Memory Requests: 4Gi (50% of 8Gi)

3.3 查看Node上的Pod

# describe node会列出该节点上的所有Pod及其资源占用情况（加班哥墙裂推荐：非常好用！！！）
kubectl describe node worker-node-1 | grep -A 50 "Non-terminated Pods:"

# 返回一般是这样的：
Non-terminated Pods:          (10 in total)
  Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                ------------  ----------  ---------------  -------------  ---
  default                     nginx-opsnot-abc12  100m (2%)     200m (5%)   128Mi (1%)       256Mi (3%)     5d
  default                     redis-xyz34         50m (1%)      100m (2%)   64Mi (0%)        128Mi (1%)     3d
  kube-system                 kube-proxy-5678     100m (2%)     0 (0%)      64Mi (0%)        0 (0%)         15d
  kube-system                 coredns-1234        100m (2%)     200m (5%)   70Mi (0%)        170Mi (2%)     15d

4. Service排查

4.1 查看Service配置

# 查看Service详情
kubectl describe service ops-not-service

# 重点看：
# - Selector: 匹配哪些Pod
# - Endpoints: 实际关联的Pod IP
# - Port配置
kubectl describe svc my-service | grep Selector
kubectl describe svc my-service | grep Endpoints

实操案例：Service访问不通，describe发现Endpoints是空的：

Endpoints: <none>

一般是Selector写错了，Pod的Label不匹配，改下Label就行了

4.2 检查Service类型和端口

# 查看Service暴露方式
kubectl describe svc ops-not-service | grep Type
# Type: ClusterIP / NodePort / LoadBalancer

# 查看端口映射
kubectl describe svc ops-not-service | grep Port

5. Deployment/StatefulSet排查

5.1 查看副本状态

# 查看Deployment
kubectl describe deployment opsnot-rabbitmq

# 重点关注：
# - Replicas: 期望数量 vs 实际运行数量
# - Conditions: 部署状态
# - Events: 滚动更新记录
kubectl describe deploy opsnot-rabbitmq | grep Replicas
kubectl describe deploy opsnot-rabbitmq | grep -A 5 "Conditions:"

实操案例：发版后只有一部分Pod更新成功，describe显示：

Replicas: 3 desired | 2 updated | 3 total | 2 available
Conditions:
  Progressing: False
  Reason: ProgressDeadlineExceeded

这种情况基本是新版本镜像有问题导致Pod起不来，直接回滚

5.2 查看滚动更新策略

# 查看更新策略
kubectl describe deploy opsnot-rabbitmq | grep -A 3 "StrategyType:"

# 输出示例：
# StrategyType: RollingUpdate
# RollingUpdateStrategy:
#   Max Surge: 25%
#   Max Unavailable: 25%

6. PVC/PV存储排查

6.1 查看PVC状态

# 查看PVC
kubectl describe pvc opsnot-pvc

# 重点看：
# - Status: Bound/Pending
# - Volume: 绑定的PV名称
# - Capacity: 实际容量
# - Events: 绑定失败原因

实操场景：PVC一直Pending，describe显示：

Events:
  Warning  ProvisioningFailed  no volume plugin matched

基本是StorageClass配置错误，改一下就行了

6.2 查看PV详情

# 查看PV
kubectl describe pv opsnot-pv-name

# 重点看：
# - Status: Available/Bound/Released
# - Claim: 被哪个PVC使用
# - Reclaim Policy: Delete/Retain
# - Access Modes: ReadWriteOnce/ReadWriteMany

7. ConfigMap/Secret排查

7.1 查看ConfigMap

# 查看ConfigMap详情
kubectl describe configmap my-config

# 会显示所有键值对（opsnot.com提醒：数据量大的话会截断）
kubectl describe cm my-config | grep -A 20 "Data"

#为什么会截断？
kubectl describe 命令有输出长度限制，主要是为了：

防止终端被大量输出淹没
提高命令响应速度
避免网络传输过大数据

#查看cm被截断了怎么办？
看yaml呗，这玩意是完整的

7.2 查看Secret

# 查看Secret（数据会被隐藏）
kubectl describe secret my-secret

# 输出示例：
# Data
# ====
# password: 16 bytes
# username: 8 bytes

# 提示：真实数据不会显示，只显示大小

8. Ingress排查

8.1 查看Ingress规则

  # 查看Ingress
    kubectl describe ingress my-ingress
    
    # 重点看：
    # - Rules: 路由规则
    # - Backend: 后端Service
    # - Events: 配置更新记录
    kubectl describe ing my-ingress | grep -A 20 "Rules:"

8.2 查看Ingress地址

# 查看Ingress分配的IP
kubectl describe ingress opsnot-ingress | grep Address

# 查看TLS配置
kubectl describe ing opsnot-ingress | grep -A 5 "TLS:"

9. 实战技巧

9.1 快速定位问题Pod

# 故障排查脚本
#!/bin/bash
NS=${1:-default}

echo "=== 查找 $NS 命名空间中的异常 Pod ==="

# 检查命名空间是否存在
if ! kubectl get ns $NS &> /dev/null; then
    echo "错误: 命名空间 $NS 不存在!"
    exit 1
fi

# 获取异常 Pod 列表
PODS=$(kubectl get pods -n $NS --field-selector=status.phase!=Running -o name 2>/dev/null)

if [ -z "$PODS" ]; then
    echo "加班哥没有发现异常 Pod"
    exit 0
fi

for pod in $PODS; do
    echo "--- $pod ---"
    kubectl describe $pod -n $NS | grep -A 15 "Events:"
    echo "================================"
done

9.2 批量检查资源

# 检查所有Node状态
for node in $(kubectl get nodes -o name); do
    echo "=== $node ==="
    kubectl describe $node | grep -A 5 "Conditions:"
done

# 检查命名空间内所有Service的Endpoints
kubectl get svc -n prod -o name | while read svc; do
    echo "$svc:"
    kubectl describe $svc -n prod | grep Endpoints
done

9.3 查看最近的事件

# Pod最近的事件（按时间排序）
kubectl describe pod ops-not-pod | grep -A 50 "Events:" | tail -20

# 所有资源的事件
kubectl get events --sort-by=.metadata.creationTimestamp

# opsnot经验：结合describe和events一起看
kubectl describe pod ops-not-pod && kubectl get events --field-selector involvedObject.name=my-pod

9.4 导出完整信息用于排查

# 导出Pod完整信息
kubectl describe pod ops-not-pod > pod-describe.txt

# 导出所有资源信息
kubectl describe all -n production > cluster-info.txt

10. 常见问题排查清单

10.1 Pod起不来

# 1. 先看Pod状态
kubectl get pod ops-not-mysql

# 2. describe看Events
kubectl describe pod ops-not-mysql

# 常见原因：
 - ImagePullBackOff: 镜像拉取失败
 - CrashLoopBackOff: 容器启动后立即退出
 - Pending: 资源不足或调度失败
 - Error: 配置错误

10.2 Service不通

# 1. 检查Service的Endpoints
kubectl describe svc opsnot-nginx-service | grep Endpoints

# 2. 如果Endpoints为空，检查Selector
kubectl describe svc opsnot-nginx-service | grep Selector
kubectl get pods --show-labels

# 3. 检查Pod是否Ready
kubectl get pods -l app=opsnot-blog

10.3 网络问题

# 1. 检查Pod IP
kubectl describe pod my-pod | grep "IP:"

# 2. 检查Service ClusterIP
kubectl describe svc my-service | grep "IP:"

# 3. 检查DNS
kubectl describe pod my-pod | grep -A 5 "DNS"

# 加班哥排查流程：
# Pod -> Service -> Ingress 逐层排查

10.4 资源不足

# 1. 检查Node资源
kubectl describe nodes | grep -A 10 "Allocated resources:"

# 2. 检查Pod资源配置
kubectl describe pod opsnot-pod | grep -A 5 "Limits:"

# 3. 看Events里有没有资源不足的告警
kubectl describe pod opsnot-pod | grep "Insufficient"

11. 进阶用法

11.1 结合其他命令使用

# describe + logs 组合排查
kubectl describe pod my-pod && kubectl logs my-pod --tail=50

# describe + exec 组合
kubectl describe pod my-pod
kubectl exec -it my-pod -- sh

# describe + top 查看资源使用
kubectl describe pod my-pod
kubectl top pod my-pod

11.2 使用watch实时监控

# 实时监控Pod变化（加班哥常用）
watch -n 2 'kubectl describe pod opsnot-pod | grep -A 10 "Events:"'

# 实时监控Node状态
watch kubectl describe node worker-1 | grep "Allocated resources:" -A 20

11.3 格式化输出关键信息

# 自用脚本：快速查看Pod关键信息
#!/bin/bash
POD=$1
NS=${2:-default}

echo "=== 基本信息 ==="
kubectl describe pod $POD -n $NS | grep "Status:\|IP:\|Node:"

echo -e "\n=== 容器状态 ==="
kubectl describe pod $POD -n $NS | grep "State:" -A 3

echo -e "\n=== 重启信息 ==="
kubectl describe pod $POD -n $NS | grep "Restart Count"

echo -e "\n=== 最近事件 ==="
kubectl describe pod $POD -n $NS | grep "Events:" -A 15 | tail -10

12. 性能优化提示

# describe输出很多，用管道过滤关键信息
kubectl describe pod ops-not-pod | grep -E "Status:|Events:|State:|Restart"

# 只看特定命名空间，避免全局搜索
kubectl describe pods -n opsnot-namespace

# 结合-o wide查看更多信息
kubectl get pods -o wide
kubectl describe pod ops-not-pod

# opsnot.com建议：先用get快速定位，再用describe详细排查

13. 最后

kubectl describe 输出内容极其丰富，实为排查问题一利器，以上内容基本涵盖describe所有场景。感谢大家耐心看完，如果觉得有帮助，请帮忙点赞转发，加班哥继续加班输出干货！

Kubectl Describe k8s故障排查利器

1. 基础用法

1.1 查看Pod详情

1.2 查看不同资源

2. Pod排查技巧

2.1 查看Pod事件（重点关注）

2.2 查看容器状态

2.3 查看资源使用情况

3. Node排查

3.1 查看Node健康状态

3.2 查看Node资源分配

3.3 查看Node上的Pod

4. Service排查

4.1 查看Service配置

4.2 检查Service类型和端口

5. Deployment/StatefulSet排查

5.1 查看副本状态

5.2 查看滚动更新策略

6. PVC/PV存储排查

6.1 查看PVC状态

6.2 查看PV详情

7. ConfigMap/Secret排查

7.1 查看ConfigMap

7.2 查看Secret

8. Ingress排查

8.1 查看Ingress规则

8.2 查看Ingress地址

9. 实战技巧

9.1 快速定位问题Pod

9.2 批量检查资源

9.3 查看最近的事件

9.4 导出完整信息用于排查

10. 常见问题排查清单

10.1 Pod起不来

10.2 Service不通

10.3 网络问题

10.4 资源不足

11. 进阶用法

11.1 结合其他命令使用

11.2 使用watch实时监控

11.3 格式化输出关键信息

12. 性能优化提示

13. 最后

相关推荐

服务器断电后CentOS7无法正常启动恢复

浅聊ansible幂等

linux 进程D状态的解决思路

K8s高频命令实操手册

共有 0 条评论

全站搜索

标签

最新文章

1服务器断电后CentOS7无法正常启...

2Kubectl Describe k8s故障排查利器

3linux系统磁盘性能测试dd命令详解

4Docker Inspect 值得族谱单开一...

5K8s高频命令实操手册

6运维技能大杂烩

随便看看

Python量化实操：数字货币量化投资课程

2025年农村最新自建房别墅素材合集

语言类

CCTV央视网视频下载工具

illustrator 2025安装教程

近期评论