System Administration: April 2008 Archives

一台机器上mount了一很多nfs的分区,但是其中一个nfs server挂了(硬件问题一时启动不起来)。结果几个df进程就跟着挂起了,并且用kill -9也杀不掉。当时的进程状态是:

[jianingy(0)@xxxxxx ~]$ ps ax -o pid,wchan,s,command | grep df$
3505 rpc_ex D df
3844 rpc_ex D df
4162 rpc_ex D df

[jianingy(0)@xxxxxx ~]$ pstree
init─┬─acpid
    ├─agetty
    ├─atd
    ├─crond
    ├─dbus-daemon-1
    ├─3*[df]
...

所有df都wait在了rpc_execute。进程都在uninterruptible sleep(即ps的D状态),因此不会处理任何信号。经过广泛的搜索,发现rpc_execute所需数据由rpciod提供。因此只要killall -KILL rpciod就可以终止rpc_execute调用, 而rpciod在被杀掉后会自己重启过来。

另外出现这种情况大多因为使用默认方式mount了nfs, 这种情况下连接失败时nfs客户端会不停尝试连接服务器。在mount时使用intr选项可以避免这类问题的出现。下面贴一段nfs mount option的说明

       soft           If  an  NFS  file operation has a major timeout then report an I/O error to the calling
                      program.  The default is to continue retrying NFS file operations indefinitely.

       hard           If an NFS file operation has a major timeout then report "server not responding" on the
                      console and continue retrying indefinitely.  This is the default.

       intr           If an NFS file operation has a major timeout and it is hard mounted, then allow signals
                      to interupt the file operation and cause it to return EINTR  to  the  calling  program.
                      The default is to not allow file operations to be interrupted.

October 2008

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Friends

Archives

Powered by Movable Type