hashicorp/raft

New snapshots are being deleted leading to loss of data

Closed this issue · 1 comments

List of snapshot dirs before snapshot reaping kicked in

1-7-1585811459919
2-4-1585769971269
2-5-1585770067955

1-7-1585811459919 contains snapshot with latest changes

Log for snapshot being reaped

2020/04/02 12:40:59 [INFO] snapshot: reaping snapshot /Users/moyukh/node0/snapshots/1-7-1585811459919

This happens after a restart. The term starts with '1' and all latest snapshots are deleted as they are pushed to the end of the slice after the reverse sort is performed before snapshot reaping.

I am losing latest updates and data is being restored to old versions from the stale snapshot dir.

Sharing the raft instantiation code below

// Setup Raft configuration.
config := raft.DefaultConfig()
config.LocalID = raft.ServerID(localID)
config.SnapshotInterval = time.Second * 15
config.SnapshotThreshold = 1

// Setup Raft communication.
addr, err := net.ResolveTCPAddr("tcp", s.RaftBind)
if err != nil {
	return err
}
transport, err := raft.NewTCPTransport(s.RaftBind, addr, 3, 10*time.Second, os.Stderr)
if err != nil {
	return err
}

// Create the snapshot store. This allows the Raft to truncate the log.
snapshots, err := raft.NewFileSnapshotStore(s.RaftDir, retainSnapshotCount, os.Stderr)
if err != nil {
	return fmt.Errorf("file snapshot store: %s", err)
}

// Create the log store and stable store.
logStore := raft.NewInmemStore()
stableStore := raft.NewInmemStore()

logCache, err := raft.NewLogCache(512, logStore)
if err != nil {
	return fmt.Errorf("log cache error: %s", err)
}

// Instantiate the Raft systems.
ra, err := raft.NewRaft(config, (*fsm)(s), logCache, stableStore, snapshots, transport)
if err != nil {
	return fmt.Errorf("new raft: %s", err)
}

Hi @moyukhbera - I've noticed in your snippet that you are using in memory storage for stable/log storage, which ensures data loss upon restart or termination. It's meant for unit testing only and not for production use case: https://pkg.go.dev/github.com/hashicorp/raft?tab=doc#InmemStore .

We'd recommend using boltdb storage: https://github.com/hashicorp/raft-boltdb (used by consul and nomad) which is meant to be durable and ensures no data loss. Upon restart, raft will ensure that the Term will start higher than previous values and the cluster will not be susceptible to the case you hit here. Also, using a disk stable storage will ensure that data since the last snapshot isn't lost either.