1.1kubectl describe故障排查技巧
1)kubectrl describe作用
可以查看资源的详细信息,运行状态。
我们可以根据资源的状态及事件信息来确定问题的原因。
2)实战案例
1.编写资源清单并创建
[root@master231 pods]# cat 03-pods-troubleshooting-describe.yaml
apiVersion: v1
kind: Pod
metadata:
name: troubleshooting-describe
spec:
containers:
- name: c1
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v11111111111111111111111111111
[root@master231 pods]# kubectl apply -f 03-pods-troubleshooting-describe.yaml
pod/troubleshooting-describe created
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-describe 0/1 ErrImagePull 0 6s 10.100.1.7 worker232 <none> <none>
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-describe 0/1 ImagePullBackOff 0 18s 10.100.1.7 worker232 <none> <none>
2.查看错误信息
[root@master231 pods]# kubectl describe pod troubleshooting-describe
Name: troubleshooting-describe
Namespace: default
Priority: 0
Node: worker232/10.0.0.232
...
Events: # 查看事件信息
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 69s default-scheduler Successfully assigned default/troubleshooting-desribe to worker232
Normal Pulling 25s (x3 over 68s) kubelet Pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111"
Warning Failed 24s (x3 over 68s) kubelet Failed to pull image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111": rpc error: code = Unknown desc = Error response from daemon:
manifest for registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111 not found: manifest unknown: manifest unknown
Warning Failed 24s (x3 over 68s) kubelet Error: ErrImagePull
Normal BackOff 12s (x3 over 67s) kubelet Back-off pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111"
Warning Failed 12s (x3 over 67s) kubelet Error: ImagePullBackOff
3.错误分析
通过上面的'Events'信息不难发现,是由于镜像拉取失败导致的错误。
解决思路:
- 1.有可能是用户的镜像名称写错导致的问题,需要检查镜像名称;
- 2.用户拉私有镜像仓库没有权限,拉取镜像也会导致该错误,需要检查是否需要权限;
4.清理环境
[root@master231 pods]# kubectl delete -f 03-pods-troubleshooting-describe.yaml
pod "troubleshooting-describe" deleted
1.2kubectl logs故障排查技巧
1)kubectl logs作用
kubectl logs可以查看pod指定容器的日志信息。
一般用来查看服务日志,进行故障排查。
2)实战案例
1.编写资源清单并创建资源
[root@master231 pods]# cat 04-troubleshooting-logs.yaml
apiVersion: v1
kind: Pod
metadata:
name: troubleshooting-logs
spec:
containers:
- name: c1
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
- name: c2
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
[root@master231 pods]# kubectl apply -f 04-troubleshooting-logs.yaml
pod/troubleshooting-logs created
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-logs 1/2 Error 1 (8s ago) 12s 10.100.2.13 worker233 <none> <none>
2.查看详细信息
[root@master231 pods]# kubectl describe po troubleshooting-logs
Name: troubleshooting-logs
Namespace: default
...
Containers:
c1:
...
State: Running
Started: Mon, 01 Dec 2025 09:08:16 +0800
...
c2:
...
State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 01 Dec 2025 09:08:34 +0800
Finished: Mon, 01 Dec 2025 09:08:37 +0800
...
Conditions:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33s default-scheduler Successfully assigned default/troubleshooting-logs to worker232
Normal Pulled 33s kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
Normal Created 32s kubelet Created container c1
Normal Started 32s kubelet Started container c1
Normal Pulled 14s (x3 over 32s) kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
Normal Created 14s (x3 over 32s) kubelet Created container c2
Normal Started 14s (x3 over 32s) kubelet Started container c2
Warning BackOff 11s (x2 over 26s) kubelet Back-off restarting failed container
3.问题分析
通过第一版斧,通过Containers字段,不难发现是c2容器处于非正常状态。其中Events信息只能看到容器在重启,但无法进一步获取错误原因。
4.查看指定容器的日志
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-logs 1/2 CrashLoopBackOff 5 (2m10s ago) 5m23s 10.100.1.10 worker232 <none> <none>
[root@master231 pods]# kubectl logs troubleshooting-logs -c c1
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2025/12/01 01:08:16 [notice] 1#1: using the "epoll" event method
2025/12/01 01:08:16 [notice] 1#1: nginx/1.20.1
2025/12/01 01:08:16 [notice] 1#1: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1)
2025/12/01 01:08:16 [notice] 1#1: OS: Linux 5.15.0-119-generic
2025/12/01 01:08:16 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 524288:524288
2025/12/01 01:08:16 [notice] 1#1: start worker processes
2025/12/01 01:08:16 [notice] 1#1: start worker process 33
2025/12/01 01:08:16 [notice] 1#1: start worker process 34
[root@master231 pods]# kubectl logs -c c2 -f troubleshooting-logs #-f 实时查看,再指定pod
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:11:25 [emerg] 1#1: bind() to [::]:80 failed (98: Address in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address in use)
2025/12/01 01:11:25 [notice] 1#1: try again to bind() after 500ms
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:11:25 [emerg] 1#1: bind() to [::]:80 failed (98: Address in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address in use)
2025/12/01 01:11:25 [notice] 1#1: try again to bind() after 500ms
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:11:25 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()
5.错误分析
查看c1容器日志正常,c2容器报错原因是因为绑定80端口失败。
6.解决方案:
- 1.由于同一个pod共享网络名称空间,因此不能存在端口冲突问题,我们应该避免同端口应用放在同一个Pod;
- 2.可以修改容器的监听端口;
- 3.临时解决可以先将容器启动,而后进入容器手动启动服务,后期再用k8s其他资源来协助解决问题;
7.清理环境
[root@master231 pods]# kubectl delete -f 04-troubleshooting-logs.yaml
pod "troubleshooting-logs" deleted
1.3修改容器启动命令故障排查技巧
1)修改容器的启动命令
- 在实际工作中,可能存在容器无法启动的情况,我们可以先将容器启动起来,然后进入到容器中去进行故障排查。
- 使用指定的命令要确保容器有该命令工具,否则照样无法启动。
2)故障案例
1.命令不存在的情况演示
[root@master231 pods]# cat 05-troubleshooting-command-args-exec.yaml
apiVersion: v1
kind: Pod
metadata:
name: troubleshooting-command-args-exec
spec:
containers:
- name: c1
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
- name: c2
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
command:
- tail11111111111111111111111111
args:
- -f
- /etc/hosts
[root@master231 pods]#
[root@master231 pods]# kubectl apply -f 05-troubleshooting-command-args-exec.yaml
pod/troubleshooting-command-args-exec created
[root@master231 pods]#
[root@master231 pods]#
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-command-args-exec 1/2 RunContainerError 1 (2s ago) 3s 10.100.2.8 worker233 <none> <none>
[root@master231 pods]#
[root@master231 pods]#
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-command-args-exec 1/2 RunContainerError 1 (5s ago) 6s 10.100.2.8 worker233 <none> <none>
[root@master231 pods]#
[root@master231 pods]# kubectl describe po troubleshooting-command-args-exec
Name: troubleshooting-command-args-exec
Namespace: default
Priority: 0
Node: worker233/10.0.0.233
Start Time: Mon, 01 Dec 2025 09:23:36 +0800
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.100.2.8
IPs:
IP: 10.100.2.8
Containers:
c1:
...
State: Running
Started: Mon, 01 Dec 2025 09:23:36 +0800
...
c2:
...
Command:
tail11111111111111111111111111 #此命令在容器中不存在
Args:
-f
/etc/hosts
State: Waiting
Reason: RunContainerError
Last State: Terminated
Reason: ContainerCannotRun
Message: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "tail11111111111111111111111111": executable file not found in $PATH: unknown
Exit Code: 127
Started: Mon, 01 Dec 2025 09:23:37 +0800
Finished: Mon, 01 Dec 2025 09:23:37 +0800
...
Conditions:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned default/troubleshooting-command-args-exec to worker233
Normal Pulled 11s kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
Normal Created 11s kubelet Created container c1
Normal Started 11s kubelet Started container c1
Normal Pulled 10s (x2 over 11s) kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
Normal Created 10s (x2 over 11s) kubelet Created container c2
Warning Failed 10s (x2 over 10s) kubelet Error: failed to start container "c2": Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "tail11111111111111111111111111": executable file not found in $PATH: unknown
Warning BackOff 9s kubelet Back-off restarting failed container
[root@master231 pods]#
[root@master231 pods]# kubectl delete -f 05-troubleshooting-command-args-exec.yaml
pod "troubleshooting-command-args-exec" deleted
[root@master231 pods]#
2.先启动容器后期再启动nginx服务
[root@master231 pods]# cat 06-troubleshooting-command-args-exec.yaml
apiVersion: v1
kind: Pod
metadata:
name: troubleshooting-command-args-exec
spec:
containers:
- name: c1
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
- name: c2
image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
command:
- tail
args:
- -f
- /etc/hosts
[root@master231 pods]#
[root@master231 pods]# kubectl apply -f 06-troubleshooting-command-args-exec.yaml
pod/troubleshooting-command-args-exec created
[root@master231 pods]#
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-command-args-exec 2/2 Running 0 3s 10.100.1.11 worker232 <none> <none>
[root@master231 pods]#
3.查看容器的启动命令
[root@master231 pods]# kubectl exec -c c1 troubleshooting-command-args-exec -- ps -ef
PID USER TIME COMMAND
1 root 0:00 nginx: master process nginx -g daemon off;
32 nginx 0:00 nginx: worker process
33 nginx 0:00 nginx: worker process
34 root 0:00 ps -ef
[root@master231 pods]#
[root@master231 pods]# kubectl exec -c c2 troubleshooting-command-args-exec -- ps -ef
PID USER TIME COMMAND
1 root 0:00 tail -f /etc/hosts
6 root 0:00 ps -ef
[root@master231 pods]#
4.进入容器启动c2容器的nginx
[root@master231 pods]# kubectl exec -c c2 -it troubleshooting-command-args-exec -- sh
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN -
tcp 0 0 :::80 :::* LISTEN -
/ #
/ # ps -ef
PID USER TIME COMMAND
1 root 0:00 tail -f /etc/hosts
12 root 0:00 sh
19 root 0:00 ps -ef
/ #
/ # nginx # 在这一步,我们发现是端口冲突问题!!!
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: still could not bind()
nginx: [emerg] still could not bind()
/ #
/ # cat /etc/nginx/nginx.conf # 查看nginx配置文件
....
http {
include /etc/nginx/mime.types;
...
include /etc/nginx/conf.d/*.conf; # 在这里面看到要加载的配置文件
}
/ #
/ # ls /etc/nginx/conf.d/*.conf # 找到了有效的配置文件
/etc/nginx/conf.d/default.conf
/ #
/ # grep listen /etc/nginx/conf.d/default.conf # 查看监听的端口果然是80端口。
listen 80;
# proxy the PHP scripts to Apache listening on 127.0.0.1:80
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ #
/ # sed -i '/listen/s#80#81#g' /etc/nginx/conf.d/default.conf # 修改默认端口
/ #
/ # grep listen /etc/nginx/conf.d/default.conf # 修改成功
listen 81;
# proxy the PHP scripts to Apache listening on 127.0.0.1:81
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ #
/ # nginx # 启动测试,发现成功启动
2025/12/01 01:31:03 [notice] 28#28: using the "epoll" event method
2025/12/01 01:31:03 [notice] 28#28: nginx/1.20.1
2025/12/01 01:31:03 [notice] 28#28: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1)
2025/12/01 01:31:03 [notice] 28#28: OS: Linux 5.15.0-119-generic
2025/12/01 01:31:03 [notice] 28#28: getrlimit(RLIMIT_NOFILE): 524288:524288
/ # 2025/12/01 01:31:03 [notice] 29#29: start worker processes
2025/12/01 01:31:03 [notice] 29#29: start worker process 30
2025/12/01 01:31:03 [notice] 29#29: start worker process 31
/ #
/ # netstat -untalp # 查看监听端口
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:81 0.0.0.0:* LISTEN 29/nginx: master pr
tcp 0 0 :::80 :::* LISTEN -
/ #
5.访问测试
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
troubleshooting-command-args-exec 2/2 Running 0 6m21s 10.100.1.11 worker232 <none> <none>
[root@master231 pods]#
[root@master231 pods]# curl 10.100.1.11:80
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>yinzhengjie apps v1</title>
<style>
div img {
width: 900px;
height: 600px;
margin: 0;
}
</style>
</head>
<body>
<h1 style="color: green">凡人修仙传 v1 </h1>
<div>
<img src="1.jpg">
<div>
</body>
</html>
[root@master231 pods]#
[root@master231 pods]# curl 10.100.1.11:81
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>yinzhengjie apps v2</title>
<style>
div img {
width: 900px;
height: 600px;
margin: 0;
}
</style>
</head>
<body>
<h1 style="color: red">凡人修仙传 v2 </h1>
<div>
<img src="2.jpg">
<div>
</body>
</html>
[root@master231 pods]#
6.删除资源
[root@master231 pods]# kubectl delete -f 06-troubleshooting-command-args-exec.yaml
pod "troubleshooting-command-args-exec" deleted
[root@master231 pods]#
总结
- 故障排查技巧’三板斧’:
- kubectl describe #事件信息
- kubectl logs
- kubectl exec —> command & args