300% + CPU load. go 1.17.2

Question

300% + CPU load. go 1.17.2

guijun opened this issue 3 years ago · 10 comments

single node.
1 client reconnected. and send some messages.
CPU load will be 300% in 5-10minutes.

even if I disabled all routing spawned in stats.go: New().

please find config below:
{
// Default HTTP(S) address:port to listen on for websocket. Either a
// numeric or a canonical name, e.g. ":80" or ":https". Could include a host name, e.g.
// "localhost:80".
// Could be blank: if TLS is not configured, will use ":80", otherwise ":443".
// Can be overridden from the command line, see option --listen.
"listen": ":6060",

// Default HTTP(S) address:port to listen on for grpc. Either a
// numeric or a canonical name, e.g. ":80" or ":https". Could include a host name, e.g.
// "localhost:80".
// Could be blank: if TLS is not configured, will use ":80", otherwise ":443".
// Can be overridden from the command line, see option --listen.
"grpc_listen": ":6080",

// Default logging level is "InfoLevel" so to enable the debug log set the "LogLevel" to "DebugLevel".
"logging_level": "Error",

// Maximum message size allowed from client in bytes (262144 = 256KB).
// Intended to prevent malicious clients from sending very large messages inband (does
// not affect out-of-band large files).
"max_message_size": 262144,

// Maximum number of subscribers per group topic.
"max_subscriber_count": 128,

// Encryption configuration
"encryption_config": {
    // chacha20poly1305 encryption key for client Ids and topic keys. 32 random bytes base64-encoded.
	// Generate your own and keep it secret.
    "key": "4BWm1vZletvrCDGWsF6mex8oBSd59m6I",
    // Key identifier. it is useful when you use multiple keys.
    "identifier":"local",
    // slealed flag tells if key in the configuration is sealed.
    "sealed":false,
    // timestamp is helpful to determine the latest key in case of keyroll over.
    "timestamp":1522325758
},

// Cluster-mode configuration.
"cluster_config": {
	// Name of this node. Can be assigned from the command line.
	// Empty string disables clustering.
	"self": "",

	// List of available nodes.
	"nodes": [
		// Name and TCP address of every node in the cluster.
		{"name": "one", "addr":"localhost:12001"},
		{"name": "two", "addr":"localhost:12002"},
		{"name": "three", "addr":"localhost:12003"}
	],

	// Failover config.
	"failover": {
		// Failover is enabled.
		"enabled": true,
		// Time in milliseconds between heartbeats.
		"heartbeat": 100,
		// Initiate leader election when the leader is not available for this many heartbeats.
		"vote_after": 8,
		// Consider node failed when it missed this many heartbeats.
		"node_fail_after": 16
	}
},

// Database configuration
"store_config": {
	// reset message store on service restart 
	"reset": false,
	// Configurations of individual adapters.
	"adapters": {
		// unitdb configuration.
		"unitdb": {
			// Name of the database.
			"database": "unitdb",
			// Memdb message store size
			"mem_size": 500000000
		}
	}
}

}

Answer 1 · 2021-10-26T09:59:33.000Z

Unitdb uses following default options in options.go file, so you could adjust the unitdb options based on your hardware specifications. In future release we will make these configurable in the the server config file.

// WithDefaultOptions will open DB with some default values.
func WithDefaultOptions() Options {
return newFuncOption(func(o *_Options) {
if o.maxSyncDurations == 0 {
o.maxSyncDurations = 1
}
if o.syncDurationType == 0 {
o.syncDurationType = time.Second
}
if o.bufferSize == 0 {
o.bufferSize = 1 << 32 // maximum size of a buffer to use in bufferpool (4GB).
}
if o.memdbSize == 0 {
o.memdbSize = 1 << 33 // maximum size of blockcache (8GB).
}
if o.freeBlockSize == 0 {
o.freeBlockSize = 1 << 27 // minimum size of (128MB).
}
})
}

Note, server uses the memdb to store in-flight messages and also unitdb database uses memdb for its internal use.

Answer 2 · 2021-10-26T11:24:27.000Z

I'll check if it will help.

I'm wondering if its heap issue , as I said ,Only one user and almost no message pushed.

Answer 3 · 2021-10-27T01:41:41.000Z

there is no config item for them at this moment, right ?
I'm testing with following config:
"mem_size": 5000000,
"buffer_size": 3000000,
"freeblock_size": 2000000,
I had added WithFreeBlockSize(), WithBufferSize().

I'm wondering why a single user (1 push /hour) will use 300% CPU. Is it because of unitdb used all heap? ( there are 64GB ram in my mac book).
is this true that: unitdb flush whole db to disk every minute ?

after 1 hour ( nothing published):
CPU load is 400%.
system free memory is 46G
unitdb has writed 2.6M bytes. readed 128KB

so ,three must be some dead loop inside

Answer 4 · 2021-10-27T03:38:41.000Z

I disconnected client. and dlv attach the process:
(dlv) bt
0 0x00000000014a001a in github.com/guijun/unitdb/memdb.(_TinyLogManager).write
at ./memdb/tiny_log.go:187
1 0x00000000014a001a in github.com/guijun/unitdb/memdb.(_TinyLogManager).writeLoop
at ./memdb/tiny_log.go:236
2 0x000000000149fc0a in github.com/guijun/unitdb/memdb.(*DB).newLogManager¬∑dwrap¬∑28
at ./memdb/tiny_log.go:144
3 0x0000000001064d61 in runtime.goexit

Answer 5 · 2021-10-27T04:26:10.000Z

Is this the issue ?
go func() {
done <- w.writeLog(timeID)
}()
If lots of such routing spawned, these routings will lock each other, is't it ?

Answer 6 · 2021-10-27T09:21:44.000Z

1 publish / 4 hours, still 300%+:
// db.startSyncer()
// logWriter
// stats.go
// w.writeLog() with no go routine.

I'll switch to gnats .and retry unitdb later.

thanks.

Answer 7 · 2021-10-27T10:38:49.000Z

I am still not able to reproduce the error.

What platform are you running the unitdb?

This may not issue with spawning multiple groutines but there could be a dead loop. I could only think of dead loop in DB.loadTrie() as it iterates over the window file and load the trie database. How is your window file looks like? This file resides under DBPath-->unitdb-->window--unitdb0000.win file.

Answer 8 · 2021-10-27T11:06:39.000Z

macosx 11.6
64G intel
go 1.17.2
1 instance. (no cluster)
"db_path":"var/unitdb"

I reverted all change I made. restart testing:

there is nothing in db_path, at this moment.
mac's active monitor tells unitdb-server did n K read/ write.

you can try it like this:
connect 1 client.
publish 1 message
disconnect client
publish 1 message

then let it the server run about 10~30 minutes.

Answer 9 · 2021-10-28T04:22:13.000Z

cool.
reproduced and fixed ?

Answer 10 · 2021-10-28T09:47:24.000Z

Yes, reproduced this issue and resolved it.

There was issue with client initiated disconnect i.e Conn.clientDisconnect() as it was called from readLoop but the Conn.close() was waiting for readLoop to exit. Calling the Conn.clientDisconnect() in separate goroutine did resolve the issue related to dangling connection. Both latest code for the unitdb server and client is pushed to git.