How is the CNN in your paper only achieved 0.04% accuracy?

Question

How is the CNN in your paper only achieved 0.04% accuracy?

dichen9412 opened this issue 6 years ago · 4 comments

I read your paper and I can't believe that a CNN with a sequence of 10 convolutional layers (each with 512 3×3 filters) can only achieve 0.04% accuracy on 9x9 sudoku with on average 36 known cells. Because, by accident, I tried to use the exact same structure to solve 9x9 sudoku with at most 32 given cells with 100% accuracy. I think you didn't train that network well to saturate its performance as the training acc in your paper was about 70%. You can actually train it to 100%.

Someone also did similar things before:
https://github.com/Kyubyong/sudoku
https://www.kaggle.com/bryanpark/sudoku

B.T.W. Sudoku with on average 36 known cells is actually almost trivial to solve. However, with a minimal number of known cells (17), it becomes much harder. I tried your network on 9x9 Sudoku with 17 known cells and it didn't work.
FYI: This paper https://arxiv.org/pdf/1711.08028.pdf (you also mentioned in your paper) achieved like 96% Acc. Thus, I don't think you can say "our test accuracy is qualitatively similar to the results obtained in Palm et al. (2017)" in your paper.

Answer 1 · 2019-06-04T21:34:25.000Z

Thanks for your comments! We’ll follow up shortly with a bit more detailed numbers just to go over everything in detail (and we’ll leave the issue open until then), but we wanted to quickly respond to these points.

As cited in the paper, we’re exactly using for comparison the first repo you mentioned in your issue. However, we’re comparing against the “one shot” accuracy (i.e., predicting all the terms at once), rather than the sequential accuracy, which chains together the network N times for each unknown element, making for a more involved process (this is discussed briefly in the repo README). Because SATNet is using one-shot prediction, this seemed to us to be the fairest comparison, though we’ll certainly emphasize this point more in the paper. We will follow up in the more detailed response about performance comparison with the sequential method. Also note that we’re comparing full-puzzle accuracy, where the example is considered incorrect if a single square is predicted incorrectly ... this differs from the way accuracy is reported in the repo you mentioned, which gives credit for partially-correct solutions.
You’re right that it’s definitely possible to overfit the training sets here, if trained for long enough. We stopped early based upon the test loss, which wasn’t decreasing for the one-shot error, so we’re showing the training loss at the earlier points.
That’s a fair point regarding the comparison to Palm et al., in regards to the number of filled entries. The comparison was more meant to be qualitative, emphasizing that they were still encoding substantial structure into the problem, whereas our approach (and the approach from the repo), don’t. But we’ll expand upon that in the paper, and look to compare at different levels of held-out numbers. Thanks for pointing that out.

More shortly.

Answer 2 · 2019-06-04T22:46:21.000Z

Thank you so much for your quick reply.

I know you are counting the exact correct sudoku instead of digits. My point was that you can train that CNN better. I just checked my old code about that problem, and that CNN still has a 37.5% Acc with one-shot prediction. If it makes an iterative prediction, the accuracy was 100%. I used a dataset with 24-32 known cells while you were using a dataset with on average 36.2% known cells. Therefore, I guess CNN could even do better on your dataset. I remember that the CNN did take a relatively long time to train but eventually it worked OK.
I just want to be fair with other baseline models.
Another question. Since your model is named SATNet, I'm wondering whether it solves could SAT problems? Like 3-SAT.
It looks like SATNet requires a fixed relationship among variables while every SAT problems may have different clauses. Thus, I think the only we to use SATNet for SAT instances is to train on K-SAT like 3-SAT and include the indices of literals in each clause as the variables in SATNet.
e.g.: (x_1, x_2, -x_3) and (x_4, -x5, x6) => CNF form: [1, 2, -3]
[4, -5, 6]
and the variables are the integer indices in each clause and the boolean literals.
Then, hopefully, there is a chance that SATNet can learn the mapping from a CNF-formula of 3-SAT to its solution. Did you try that?

SATNet may not be able to solve 3-SAT problems. Is this correct?

Answer 3 · 2019-06-05T20:41:51.000Z

Thanks for the additional information! Another possible point of discrepancy is that all of our models use only 9K training examples, as we are explicitly looking to evaluate SATNet's ability to learn logical structure from relatively little data. The other repos (e.g., the first one cited) used a much large number of examples (1M), which likely has a large impact on the CNN's performance. With that said, if you're willing to share the code/dataset that your results are based on, we would be eager to compare the setups in more detail.
SATNet solves an approximation of MAXSAT based on the Goemans-Williamson MAXSAT SDP relaxation. Any specific setting S of the SATNet layer weights corresponds to a particular CNF formula. As such, you are right that we assume all inputs and outputs to SATNet are governed by the same CNF formula (though we do not know this formula ahead of time and therefore must learn S). In the setting you describe, where a CNF formula is explicitly given, we can simply set S to the value corresponding to that CNF and then solve the MAXSAT SDP (see https://arxiv.org/abs/1706.00476 for more information about the optimization approach for solving the MAXSAT SDP).

Answer 4 · 2019-06-05T20:59:46.000Z

Thanks for your reply.