shiblon/entroq

Flaky SimpleWorker test for fine-grained lock in-memory implementation

Closed this issue · 3 comments

It was brought to my attention by @ButchIstook (thanks Butch!) that the SImpleWorker test became flaky after the fine-grained lock implementation of in-memory EntroQ was released.

The errors looked like this:

=== RUN   TestEQMemSimpleWorker
[112](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L112)    qtest.go:167: Tasks inserted not the same as tasks consumed:
[113](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L113)          []*entroq.Task{
[114](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L114)          	&{
[115](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L115)          		Queue:    "simple_worker",
[116](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L116)          		ID:       {0x44, 0x44, 0x84, 0xf6, ...},
[117](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L117)        - 		Version:  0,
[118](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L118)        + 		Version:  1,
[119](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L119)        - 		At:       s"1754-08-30 22:43:41.129 +0000 UTC",
[120](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L120)        + 		At:       s"2022-01-27 14:32:52.243 +0000 UTC",
[121](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L121)          		Claimant: {0x37, 0xfe, 0xd5, 0xcc, ...},
[122](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L122)        - 		Claims:   0,
[123](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L123)        + 		Claims:   1,
[124](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L124)          		Value:    nil,
[125](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L125)          		Created:  s"2022-01-27 14:32:22.243 +0000 UTC",
[126](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L126)          		... // 4 identical fields
[127](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L127)          	},
[128](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L128)          	&{
[129](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L129)          		Queue: "simple_worker",
[130](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L130)          		ID: uuid.UUID{
[131](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L131)        - 			0x2a, 0x5c, 0xec, 0xe4, 0xe8, 0xf5, 0x44, 0x95, 0xa6, 0x32, 0x73, 0xc7, 0xa6, 0x70, 0xa5, 0x39,
[132](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L132)        + 			0x35, 0x12, 0xad, 0xba, 0xa4, 0x0d, 0x4c, 0x12, 0x97, 0x9c, 0x47, 0x6e, 0xf9, 0x9d, 0x28, 0xc5,
[133](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L133)          		},
[134](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L134)        - 		Version:  0,
[135](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L135)        + 		Version:  1,
[136](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L136)        - 		At:       s"1754-08-30 22:43:41.129 +0000 UTC",
[137](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L137)        + 		At:       s"2022-01-27 14:32:52.244 +0000 UTC",
[138](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L138)          		Claimant: {0x37, 0xfe, 0xd5, 0xcc, ...},
[139](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L139)        - 		Claims:   0,
[140](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L140)        + 		Claims:   1,
[141](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L141)          		Value:    nil,
[142](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L142)          		Created:  s"2022-01-27 14:32:22.244 +0000 UTC",
[143](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L143)          		... // 4 identical fields
[144](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L144)          	},
[145](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L145)          	&{
[146](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L146)          		Queue: "simple_worker",
[147](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L147)          		ID: uuid.UUID{
[148](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L148)        - 			0x35, 0x12, 0xad, 0xba, 0xa4, 0x0d, 0x4c, 0x12, 0x97, 0x9c, 0x47, 0x6e, 0xf9, 0x9d, 0x28, 0xc5,
[149](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L149)        + 			0xd2, 0xe9, 0x4e, 0xd8, 0x01, 0x8c, 0x47, 0x4d, 0x82, 0xef, 0x89, 0x92, 0xd9, 0xdd, 0xe1, 0xf2,
[150](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L150)          		},
[151](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L151)        - 		Version:   0,
[152](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L152)        + 		Version:   1,
[153](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L153)        - 		At:        s"1754-08-30 22:43:41.129 +0000 UTC",
[154](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L154)        + 		At:        s"2022-01-27 14:32:52.245 +0000 UTC",
[155](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L155)          		Claimant:  {0x37, 0xfe, 0xd5, 0xcc, ...},
[156](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L156)        - 		Claims:    0,
[157](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L157)        + 		Claims:    1,
[158](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L158)          		Value:     nil,
[159](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L159)          		Created:   s"2022-01-27 14:32:22.244 +0000 UTC",
[160](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L160)        - 		Modified:  s"2022-01-27 14:32:22.244 +0000 UTC",
[161](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L161)        + 		Modified:  s"2022-01-27 14:32:22.245 +0000 UTC",
[162](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L162)          		FromQueue: "",
[163](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L163)          		Attempt:   0,
[164](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L164)          		Err:       "",
[165](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L165)          	},
[166](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L166)        - 	s"Task [\"simple_worker\" d2e94ed8-018c-474d-82ef-8992d9dde1f2:v0]\n\tat=\"1754-08-30 22:43:41.129 +0000 UTC\" claimant=37fed5cc-8b1d-4282-90df-585f338104dd claims=0 attempt=0 err=\"\"\n\tval=\"\"",
[167](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L167)          	&{
[168](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L168)          		Queue:    "simple_worker",
[169](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L169)          		ID:       {0x96, 0xe3, 0x80, 0xc8, ...},
[170](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L170)        - 		Version:  0,
[171](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L171)        + 		Version:  1,
[172](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L172)        - 		At:       s"1754-08-30 22:43:41.129 +0000 UTC",
[173](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L173)        + 		At:       s"2022-01-27 14:32:52.245 +0000 UTC",
[174](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L174)          		Claimant: {0x37, 0xfe, 0xd5, 0xcc, ...},
[175](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L175)        - 		Claims:   0,
[176](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L176)        + 		Claims:   1,
[177](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L177)          		Value:    nil,
[178](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L178)          		Created:  s"2022-01-27 14:32:22.245 +0000 UTC",
[179](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L179)          		... // 4 identical fields
[180](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L180)          	},
[181](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L181)          	&{
[182](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L182)          		Queue: "simple_worker",
[183](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L183)          		ID: uuid.UUID{
[184](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L184)          			0x32,
[185](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L185)        - 			0x3e, 0xed, 0x92, 0x24, 0xea, 0x4a, 0x13, 0x9f, 0x37, 0xca, 0xaa, 0x86, 0xc1, 0xa0, 0x9e,
[186](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L186)        + 			0x62, 0x15, 0xab, 0x23, 0x71, 0x4a, 0x73, 0xa7, 0xb4, 0x1d, 0xce, 0x17, 0x0f, 0xd5, 0xe6,
[187](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L187)          		},
[188](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L188)        - 		Version:   0,
[189](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L189)        + 		Version:   1,
[190](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L190)        - 		At:        s"1754-08-30 22:43:41.129 +0000 UTC",
[191](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L191)        + 		At:        s"2022-01-27 14:32:52.246 +0000 UTC",
[192](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L192)          		Claimant:  {0x37, 0xfe, 0xd5, 0xcc, ...},
[193](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L193)        - 		Claims:    0,
[194](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L194)        + 		Claims:    1,
[195](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L195)          		Value:     nil,
[196](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L196)        - 		Created:   s"2022-01-27 14:32:22.245 +0000 UTC",
[197](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L197)        + 		Created:   s"2022-01-27 14:32:22.246 +0000 UTC",
[198](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L198)        - 		Modified:  s"2022-01-27 14:32:22.245 +0000 UTC",
[199](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L199)        + 		Modified:  s"2022-01-27 14:32:22.246 +0000 UTC",
[200](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L200)          		FromQueue: "",
[201](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L201)          		Attempt:   0,
[202](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L202)          		Err:       "",
[203](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L203)          	},
[204](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L204)          	&{
[205](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L205)          		Queue: "simple_worker",
[206](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L206)          		ID: uuid.UUID{
[207](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L207)        - 			0xb3, 0x7b, 0x7a, 0x22, 0x30, 0xe8, 0x4c, 0x09, 0xad, 0x82, 0x4e, 0x98, 0xce, 0xf7, 0x6f, 0xac,
[208](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L208)        + 			0x37, 0xe2, 0xd5, 0x00, 0x91, 0x92, 0x43, 0xd0, 0x9c, 0x08, 0x63, 0x45, 0xbb, 0x75, 0x04, 0x5b,
[209](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L209)          		},
[210](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L210)        - 		Version:   0,
[211](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L211)        + 		Version:   1,
[212](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L212)        - 		At:        s"1754-08-30 22:43:41.129 +0000 UTC",
[213](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L213)        + 		At:        s"2022-01-27 14:32:52.247 +0000 UTC",
[214](https://gitlab.logx.cloud/support/entroq/-/jobs/239207#L214)

This test output is obviously terrible. Some changes to test output revealed that occasionally, a task would be inserted successfully, but never consumed by the worker!

Further testing indicated that it was always the second task to be inserted that disappeared. Always. It didn't matter whether there were 10 or 30 or 2 tasks inserted, when a failure occurred, it was always the second task.

This issue is to document findings and the fixes that went into them, and why those are the fixes.

This took some time to chase down. The issue happens when a queue becomes empty due to a Modify/Delete operation, while a Modify/Insert operation is underway elsewhere. The sequence of events that triggers the race condition is this:

  • Delete: lock global, collect the appropriate queue locks, unlock global
  • Insert: lock global, collect the same queue locks (or overlapping), unlock global
  • Delete: lock queues, remove tasks, note that queues are empty, mark for cleanup, unlock queues
  • Delete: lock global, clean up queues, unlock global
  • Insert (now holding an orphaned data structure for a queue that was deleted): merrily do your work with the queue locks

The issue is the moment between collecting queue locks and obtaining queue locks. The solution is to add a semaphore to the queue lock structure indicating that someone has asked for it and will be locking it later. Then the cleanup routine can simply run when unlocking the queue: decrement the semaphore, check for empty, and delete.

The presence of the semaphore makes it so that the delete procedure above would not delete the queue: the insert procedure is about to claim it and work on it, so the when the delete procedure finishes, the semaphore has not reached 0 (both procedures incremented it in the first two steps above).

The fixes are in the following commits:

b957946 (fix the test output)
1860242 (main fix for race, with flaws)
68af2bf (updated fix for race - semaphore management)
0f27f5b (comment updates)
884d60d (leaky goroutine with potential panic fixed)

Cut release v0.8.5