happyfish100/FastCFS

/opt/fastcfs/fstore/下的serverd.pid会丢失

Opened this issue · 9 comments

如题,三副本,突然01和02的faststore下线,查询日志发现是因为找不到/opt/fastcfs/fstore/serverd.pid这个文件,如果手动创建的话,再重启还是会失败,并且文件会被删除,请问这个应该如何解决?
6447ba820bdd134479a23f7f81d83667

systemd启动命令 和 直接用命令行启动两种方式不要混用。
另外,可以看一下faststore server日志报什么错。

systemd启动命令 和 直接用命令行启动两种方式不要混用。 另外,可以看一下faststore server日志报什么错。

[2024-03-12 17:08:29] ERROR - file: connection_pool.c, line: 227, connect to fstore server 192.168.3.51:21014 fail, errno: 111, error info: Connection refused
[2024-03-12 17:08:29] WARNING - file: cluster_relationship.c, line: 1304, round 1th select leader, alive server count: 2 < server count: 3, try again after 1 seconds.
[2024-03-12 17:08:29] INFO - file: cluster_relationship.c, line: 998, the leader server id: 3, ip 192.168.3.52:21014, retry count: 1, time used: 740 ms
[2024-03-12 17:08:30] INFO - file: cluster_relationship.c, line: 1318, abort election because the leader exists, leader id: 3, ip 192.168.3.52:21014, election time used: 1s
[2024-03-12 17:08:30] ERROR - file: replication/replication_processor.c, line: 200, 1th connect to replication peer: 2, 192.168.3.51:21015 fail, time used: 0s, errno: 111, error info: Connection refused
[2024-03-12 17:08:30] ERROR - file: replication/replication_processor.c, line: 200, 1th connect to replication peer: 2, 192.168.3.51:21015 fail, time used: 0s, errno: 111, error info: Connection refused
[2024-03-12 17:08:30] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 3, 192.168.3.52:21015 successfully
[2024-03-12 17:08:30] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 3, 192.168.3.52:21015 successfully
[2024-03-12 17:08:31] INFO - file: cluster_relationship.c, line: 2251, connect to leader id: 3, 192.168.3.52:21014 successfully
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 6, slave id: 1, the replica connection NOT established!
[2024-03-12 17:08:33] WARNING - file: recovery/binlog_fetch.c, line: 551, data group id: 6, waiting count: 0, result: 16, time used: 0 ms
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 3, slave id: 1, the replica connection NOT established!
[2024-03-12 17:08:33] WARNING - file: recovery/binlog_fetch.c, line: 551, data group id: 3, waiting count: 0, result: 16, time used: 1 ms
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 8, slave id: 1, the replica connection NOT established!
这个是fs_serverd.log这里的报错,我尝试了使用命令行启动,还是不行。
我确认firewalld和selinux都已经关闭了,也已经添加了密钥,fdir是正常的

从日志看,有一台 fstore server 没有启动。
你 ps 看下有 fs_serverd这个进程吗?

从日志看,有一台 fstore server 没有启动。 你 ps 看下有 fs_serverd这个进程吗?

三副本,其中03是正常的,剩下两台都是启动了之后过几十秒就会杀死自己,命令是同时执行的。会是存储的数据不同步的原因吗?我使用了/usr/bin/fdir_serverd --data-rebuild /data/storage1,/data/storage2 /etc/fastcfs/fdir/server.conf restart尝试恢复数据,但是提示没有这个参数了。
192.168.3.52是正常的节点,2.50和2.51是无法启动的节点
[2024-03-12 17:42:20] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 2, 192.168.3.51:21015 successfully after 3 retries
[2024-03-12 17:42:21] INFO - file: replica_handler.c, line: 929, replication peer id: 3, 192.168.3.52:21015 join in
[2024-03-12 17:42:21] INFO - file: replica_handler.c, line: 929, replication peer id: 3, 192.168.3.52:21015 join in
[2024-03-12 17:42:21] ERROR - file: dio/trunk_write_thread.c, line: 405, [fstore] open file "/data/storage2/0003/000007" fail, errno: 2, error info: No such file or directory
[2024-03-12 17:42:21] WARNING - file: /usr/include/sf/sf_func.h, line: 42, kill myself from caller {file: dio/trunk_write_thread.c, line: 722, func: batch_write}
[2024-03-12 17:42:21] CRIT - file: sf_service.c, line: 710, catch signal 3, program exiting...
[2024-03-12 17:42:22] WARNING - file: recovery/binlog_replay.c, line: 561, data group id: 132, is_online: 0, block {oid: 9007199257741014, offset: 0}, slice {offset: 32, length: 131040}, read bytes: 65504 != slice length, maybe delete later?
[2024-03-12 17:42:22] ERROR - file: dio/trunk_write_thread.c, line: 405, [fstore] open file "/data/storage1/0002/000005" fail, errno: 2, error info: No such file or directory
[2024-03-12 17:42:22] WARNING - file: /usr/include/sf/sf_func.h, line: 42, kill myself from caller {file: dio/trunk_write_thread.c, line: 722, func: batch_write}
[2024-03-12 17:42:23] INFO - file: fs_serverd.c, line: 483, program exit normally.

从日志看,有一台 fstore server 没有启动。
你 ps 看下有 fs_serverd这个进程吗?

从日志看,有一台 fstore server 没有启动。 你 ps 看下有 fs_serverd这个进程吗?

01和02ps查询不到fs_serverd这个进程,在启动了fs_serverd进程后,过几十秒后会自动杀死这个进程

你看下Linux 的系统日志,看下fs_serverd是如何被杀死的。
有如下三种可能:

  1. killed by systemd
  2. killed by Linux due to OOM
  3. fs_serverd coredump

可以在系统日志中搜索关键字 fs_serverd

3. fs_serverd coredump

e17fa9658d9b23eb2a251a8ea421b62e
这个是我之前在fs_serverd.log中找到的被删除原因,您知道为什么会这样吗?提到的dio/trunk_write_thread.c这个文件,我并没有找到

  1. fs_serverd coredump

e17fa9658d9b23eb2a251a8ea421b62e 这个是我之前在fs_serverd.log中找到的被删除原因,您知道为什么会这样吗?提到的dio/trunk_write_thread.c这个文件,我并没有找到

在 libdiskallocator 这个库中