k8s故障排查技巧 – 雪雅阁blog

在

Linux运维小白成长记

1.1kubectl describe故障排查技巧

1）kubectrl describe作用

可以查看资源的详细信息，运行状态。

我们可以根据资源的状态及事件信息来确定问题的原因。

2）实战案例

	1.编写资源清单并创建
[root@master231 pods]# cat 03-pods-troubleshooting-describe.yaml
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting-describe
spec:
  containers:
  - name: c1
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v11111111111111111111111111111

[root@master231 pods]# kubectl apply -f  03-pods-troubleshooting-describe.yaml
pod/troubleshooting-describe created

[root@master231 pods]# kubectl get pods -o wide
NAME                       READY   STATUS         RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
troubleshooting-describe   0/1     ErrImagePull   0          6s    10.100.1.7   worker232   <none>           <none>

[root@master231 pods]# kubectl get pods -o wide
NAME                       READY   STATUS             RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
troubleshooting-describe   0/1     ImagePullBackOff   0          18s   10.100.1.7   worker232   <none>           <none>


	2.查看错误信息
[root@master231 pods]# kubectl describe pod troubleshooting-describe 
Name:         troubleshooting-describe
Namespace:    default
Priority:     0
Node:         worker232/10.0.0.232

...

Events:  # 查看事件信息
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  69s                default-scheduler  Successfully assigned default/troubleshooting-desribe to worker232
  Normal   Pulling    25s (x3 over 68s)  kubelet            Pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111"
  Warning  Failed     24s (x3 over 68s)  kubelet            Failed to pull image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111": rpc error: code = Unknown desc = Error response from daemon: 
  manifest for registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111 not found: manifest unknown: manifest unknown
  Warning  Failed     24s (x3 over 68s)  kubelet            Error: ErrImagePull
  Normal   BackOff    12s (x3 over 67s)  kubelet            Back-off pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v111111111111111"
  Warning  Failed     12s (x3 over 67s)  kubelet            Error: ImagePullBackOff

	3.错误分析
通过上面的'Events'信息不难发现，是由于镜像拉取失败导致的错误。

解决思路:
	- 1.有可能是用户的镜像名称写错导致的问题，需要检查镜像名称;
	- 2.用户拉私有镜像仓库没有权限,拉取镜像也会导致该错误，需要检查是否需要权限;
	
	
	4.清理环境
[root@master231 pods]# kubectl delete -f 03-pods-troubleshooting-describe.yaml 
pod "troubleshooting-describe" deleted

1.2kubectl logs故障排查技巧

1）kubectl logs作用

kubectl logs可以查看pod指定容器的日志信息。

一般用来查看服务日志，进行故障排查。

2）实战案例

	1.编写资源清单并创建资源
[root@master231 pods]# cat 04-troubleshooting-logs.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting-logs
spec:
  containers:
  - name: c1
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
  - name: c2
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2

[root@master231 pods]# kubectl apply -f  04-troubleshooting-logs.yaml 
pod/troubleshooting-logs created

[root@master231 pods]# kubectl get pods -o wide
NAME                   READY   STATUS   RESTARTS     AGE   IP            NODE        NOMINATED NODE   READINESS GATES
troubleshooting-logs   1/2     Error    1 (8s ago)   12s   10.100.2.13   worker233   <none>           <none>

	
	2.查看详细信息
[root@master231 pods]# kubectl describe po troubleshooting-logs 
Name:         troubleshooting-logs
Namespace:    default
...
Containers:
  c1:
    ...
    State:          Running
      Started:      Mon, 01 Dec 2025 09:08:16 +0800
    ...
  c2:
    ...
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 01 Dec 2025 09:08:34 +0800
      Finished:     Mon, 01 Dec 2025 09:08:37 +0800
    ...
Conditions:
  ...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  33s                default-scheduler  Successfully assigned default/troubleshooting-logs to worker232
  Normal   Pulled     33s                kubelet            Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
  Normal   Created    32s                kubelet            Created container c1
  Normal   Started    32s                kubelet            Started container c1
  Normal   Pulled     14s (x3 over 32s)  kubelet            Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
  Normal   Created    14s (x3 over 32s)  kubelet            Created container c2
  Normal   Started    14s (x3 over 32s)  kubelet            Started container c2
  Warning  BackOff    11s (x2 over 26s)  kubelet            Back-off restarting failed container 


	3.问题分析
通过第一版斧,通过Containers字段，不难发现是c2容器处于非正常状态。其中Events信息只能看到容器在重启，但无法进一步获取错误原因。


	4.查看指定容器的日志
[root@master231 pods]# kubectl get pods -o wide
NAME                   READY   STATUS             RESTARTS        AGE     IP            NODE        NOMINATED NODE   READINESS GATES
troubleshooting-logs   1/2     CrashLoopBackOff   5 (2m10s ago)   5m23s   10.100.1.10   worker232   <none>           <none>

[root@master231 pods]# kubectl logs troubleshooting-logs -c c1 
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2025/12/01 01:08:16 [notice] 1#1: using the "epoll" event method
2025/12/01 01:08:16 [notice] 1#1: nginx/1.20.1
2025/12/01 01:08:16 [notice] 1#1: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1) 
2025/12/01 01:08:16 [notice] 1#1: OS: Linux 5.15.0-119-generic
2025/12/01 01:08:16 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 524288:524288
2025/12/01 01:08:16 [notice] 1#1: start worker processes
2025/12/01 01:08:16 [notice] 1#1: start worker process 33
2025/12/01 01:08:16 [notice] 1#1: start worker process 34

[root@master231 pods]# kubectl logs -c c2 -f troubleshooting-logs  #-f 实时查看,再指定pod
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:11:25 [emerg] 1#1: bind() to [::]:80 failed (98: Address in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address in use)
2025/12/01 01:11:25 [notice] 1#1: try again to bind() after 500ms
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:11:25 [emerg] 1#1: bind() to [::]:80 failed (98: Address in use)
nginx: [emerg] bind() to [::]:80 failed (98: Address in use)
2025/12/01 01:11:25 [notice] 1#1: try again to bind() after 500ms
2025/12/01 01:11:25 [emerg] 1#1: bind() to 0.0.0.0:80 failed (98: Address in use)

2025/12/01 01:11:25 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()


	5.错误分析
查看c1容器日志正常，c2容器报错原因是因为绑定80端口失败。


	6.解决方案:
		- 1.由于同一个pod共享网络名称空间，因此不能存在端口冲突问题，我们应该避免同端口应用放在同一个Pod;
		- 2.可以修改容器的监听端口;
		- 3.临时解决可以先将容器启动，而后进入容器手动启动服务，后期再用k8s其他资源来协助解决问题;
		
		
	7.清理环境
[root@master231 pods]# kubectl delete -f 04-troubleshooting-logs.yaml 
pod "troubleshooting-logs" deleted

1.3修改容器启动命令故障排查技巧

1）修改容器的启动命令

在实际工作中，可能存在容器无法启动的情况，我们可以先将容器启动起来，然后进入到容器中去进行故障排查。
使用指定的命令要确保容器有该命令工具，否则照样无法启动。

2）故障案例

	1.命令不存在的情况演示
[root@master231 pods]# cat 05-troubleshooting-command-args-exec.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting-command-args-exec
spec:
  containers:
  - name: c1
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
  - name: c2
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
    command:
    - tail11111111111111111111111111
    args:
    - -f
    - /etc/hosts
[root@master231 pods]# 
[root@master231 pods]# kubectl apply -f  05-troubleshooting-command-args-exec.yaml 
pod/troubleshooting-command-args-exec created
[root@master231 pods]# 
[root@master231 pods]# 
[root@master231 pods]# kubectl get pods -o wide
NAME                                READY   STATUS              RESTARTS     AGE   IP           NODE        NOMINATED NODE   READINESS GATES
troubleshooting-command-args-exec   1/2     RunContainerError   1 (2s ago)   3s    10.100.2.8   worker233   <none>           <none>
[root@master231 pods]# 
[root@master231 pods]# 
[root@master231 pods]# kubectl get pods -o wide
NAME                                READY   STATUS              RESTARTS     AGE   IP           NODE        NOMINATED NODE   READINESS GATES
troubleshooting-command-args-exec   1/2     RunContainerError   1 (5s ago)   6s    10.100.2.8   worker233   <none>           <none>
[root@master231 pods]# 
[root@master231 pods]# kubectl describe po troubleshooting-command-args-exec 
Name:         troubleshooting-command-args-exec
Namespace:    default
Priority:     0
Node:         worker233/10.0.0.233
Start Time:   Mon, 01 Dec 2025 09:23:36 +0800
Labels:       <none>
Annotations:  <none>
Status:       Running
IP:           10.100.2.8
IPs:
  IP:  10.100.2.8
Containers:
  c1:
    ...
    State:          Running
      Started:      Mon, 01 Dec 2025 09:23:36 +0800
    ...
  c2:
    ...
    Command:
      tail11111111111111111111111111  #此命令在容器中不存在
    Args:
      -f
      /etc/hosts
    State:          Waiting
      Reason:       RunContainerError
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "tail11111111111111111111111111": executable file not found in $PATH: unknown
      Exit Code:    127
      Started:      Mon, 01 Dec 2025 09:23:37 +0800
      Finished:     Mon, 01 Dec 2025 09:23:37 +0800
    ...
Conditions:
  ...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  11s                default-scheduler  Successfully assigned default/troubleshooting-command-args-exec to worker233
  Normal   Pulled     11s                kubelet            Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
  Normal   Created    11s                kubelet            Created container c1
  Normal   Started    11s                kubelet            Started container c1
  Normal   Pulled     10s (x2 over 11s)  kubelet            Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
  Normal   Created    10s (x2 over 11s)  kubelet            Created container c2
  Warning  Failed     10s (x2 over 10s)  kubelet            Error: failed to start container "c2": Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "tail11111111111111111111111111": executable file not found in $PATH: unknown
  Warning  BackOff    9s                 kubelet            Back-off restarting failed container
[root@master231 pods]# 
[root@master231 pods]# kubectl delete -f 05-troubleshooting-command-args-exec.yaml 
pod "troubleshooting-command-args-exec" deleted
[root@master231 pods]# 

	
	2.先启动容器后期再启动nginx服务
[root@master231 pods]# cat 06-troubleshooting-command-args-exec.yaml
apiVersion: v1
kind: Pod
metadata:
  name: troubleshooting-command-args-exec
spec:
  containers:
  - name: c1
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
  - name: c2
    image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
    command:
    - tail
    args:
    - -f
    - /etc/hosts
[root@master231 pods]# 
[root@master231 pods]# kubectl apply -f  06-troubleshooting-command-args-exec.yaml
pod/troubleshooting-command-args-exec created
[root@master231 pods]# 
[root@master231 pods]# kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
troubleshooting-command-args-exec   2/2     Running   0          3s    10.100.1.11   worker232   <none>           <none>
[root@master231 pods]# 


	3.查看容器的启动命令
[root@master231 pods]# kubectl exec -c c1 troubleshooting-command-args-exec -- ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 nginx: master process nginx -g daemon off;
   32 nginx     0:00 nginx: worker process
   33 nginx     0:00 nginx: worker process
   34 root      0:00 ps -ef
[root@master231 pods]# 
[root@master231 pods]# kubectl exec -c c2 troubleshooting-command-args-exec -- ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 tail -f /etc/hosts
    6 root      0:00 ps -ef
[root@master231 pods]# 


	4.进入容器启动c2容器的nginx
[root@master231 pods]# kubectl exec -c c2 -it troubleshooting-command-args-exec -- sh
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      -
tcp        0      0 :::80                   :::*                    LISTEN      -
/ # 
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 tail -f /etc/hosts
   12 root      0:00 sh
   19 root      0:00 ps -ef
/ # 
/ # nginx  # 在这一步，我们发现是端口冲突问题！！！
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: bind() to 0.0.0.0:80 failed (98: Address in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address in use)
2025/12/01 01:28:52 [notice] 20#20: try again to bind() after 500ms
2025/12/01 01:28:52 [emerg] 20#20: still could not bind()
nginx: [emerg] still could not bind()
/ #  
/ # cat /etc/nginx/nginx.conf   # 查看nginx配置文件
....

http {
    include       /etc/nginx/mime.types;
    ...

    include /etc/nginx/conf.d/*.conf;  # 在这里面看到要加载的配置文件
}
/ # 
/ # ls /etc/nginx/conf.d/*.conf  # 找到了有效的配置文件
/etc/nginx/conf.d/default.conf
/ # 
/ # grep listen /etc/nginx/conf.d/default.conf   # 查看监听的端口果然是80端口。
    listen       80;
    # proxy the PHP scripts to Apache listening on 127.0.0.1:80
    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ # 
/ # sed -i '/listen/s#80#81#g' /etc/nginx/conf.d/default.conf   # 修改默认端口
/ # 
/ # grep listen /etc/nginx/conf.d/default.conf  # 修改成功
    listen       81;
    # proxy the PHP scripts to Apache listening on 127.0.0.1:81
    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ # 
/ # nginx  # 启动测试，发现成功启动
2025/12/01 01:31:03 [notice] 28#28: using the "epoll" event method
2025/12/01 01:31:03 [notice] 28#28: nginx/1.20.1
2025/12/01 01:31:03 [notice] 28#28: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1) 
2025/12/01 01:31:03 [notice] 28#28: OS: Linux 5.15.0-119-generic
2025/12/01 01:31:03 [notice] 28#28: getrlimit(RLIMIT_NOFILE): 524288:524288
/ # 2025/12/01 01:31:03 [notice] 29#29: start worker processes
2025/12/01 01:31:03 [notice] 29#29: start worker process 30
2025/12/01 01:31:03 [notice] 29#29: start worker process 31

/ # 
/ # netstat -untalp   # 查看监听端口
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:81              0.0.0.0:*               LISTEN      29/nginx: master pr
tcp        0      0 :::80                   :::*                    LISTEN      -
/ # 

	5.访问测试
[root@master231 pods]# kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP            NODE        NOMINATED NODE   READINESS GATES
troubleshooting-command-args-exec   2/2     Running   0          6m21s   10.100.1.11   worker232   <none>           <none>
[root@master231 pods]# 
[root@master231 pods]# curl 10.100.1.11:80
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <title>yinzhengjie apps v1</title>
    <style>
       div img {
          width: 900px;
          height: 600px;
          margin: 0;
       }
    </style>
  </head>

  <body>
    <h1 style="color: green">凡人修仙传 v1 </h1>
    <div>
      <img src="1.jpg">
    <div>
  </body>

</html>
[root@master231 pods]# 
[root@master231 pods]# curl 10.100.1.11:81
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <title>yinzhengjie apps v2</title>
    <style>
       div img {
          width: 900px;
          height: 600px;
          margin: 0;
       }
    </style>
  </head>

  <body>
    <h1 style="color: red">凡人修仙传 v2 </h1>
    <div>
      <img src="2.jpg">
    <div>
  </body>

</html>
[root@master231 pods]# 


	6.删除资源
[root@master231 pods]# kubectl delete -f 06-troubleshooting-command-args-exec.yaml 
pod "troubleshooting-command-args-exec" deleted
[root@master231 pods]#

总结

故障排查技巧’三板斧’:
- kubectl describe #事件信息
- kubectl logs
- kubectl exec —> command & args

1.1kubectl describe故障排查技巧

1）kubectrl describe作用

2）实战案例

1.2kubectl logs故障排查技巧

1）kubectl logs作用

2）实战案例

1.3修改容器启动命令故障排查技巧

1）修改容器的启动命令

2）故障案例

发表回复取消回复

作者

探索世界，发现自我：生活的万花筒

生活中的那些小确幸

探索生活中的美好：从旅行到美食，再到身心健康

探索世界，发现自我：生活的万花筒

生活中的那些小确幸

探索生活中的美好：从旅行到美食，再到身心健康

1.1kubectl describe故障排查技巧

1）kubectrl describe作用

2）实战案例

1.2kubectl logs故障排查技巧

1）kubectl logs作用

2）实战案例

1.3修改容器启动命令故障排查技巧

1）修改容器的启动命令

2）故障案例

发表回复 取消回复

作者

相关文章

发表回复取消回复