Custom DIMACS parser misinterprets the number of variables in CNF Formula
Closed this issue · 1 comments
Let's consider an example of the file Benchmarks/GuidanceService.sk_4_27.cnf
in the benchmarks folder.
The first few lines of the file is:
c ind 2 3 4 5 6 7 8 9 0
c ind 10 11 12 13 14 15 16 17 18 19 0
c ind 20 21 22 23 24 25 26 27 28 0
p cnf 988 3088
71 0
145 0
336 0
343 0
332 0
-331 0
As per my understanding of the DIMACS-CNF format, the following line should be used to identify the total number of boolean variables and the clauses in the CNF file.
p cnf 988 3088
The above line suggests that there are 988 variables
and 3088 clauses
.
As per my study into your codebase, I've realized that you're using a non-standard dimacs-parser, which interprets the number of variables in this file to be 28 variables
. The sample-sizes generated by your code contains a reduced number of boolean-variable assigments counting 28 variables
instead of the required. 988 variables
.
The reason your code identifies 28 variables is because it interprets the following header of the dimacs file GuidanceService.sk_4_27.cnf
c ind 2 3 4 5 6 7 8 9 0
c ind 10 11 12 13 14 15 16 17 18 19 0
c ind 20 21 22 23 24 25 26 27 28 0
and doesn't treat them as comments, but rather treats the integers to represent the indices of the variables of the boolean-formula under study.
Is this a standard way of interpreting DIMACS-CNF files? Is there a reason why this format has been assumed?
In your codebase, the function parse_cnf()
is utilized to parse the CNF files and feed the information into the Z3.optimize() object. The following lines are the lines where I've determined to be the cause of the mis-interpretation. I've displayed a subsection of the code at the beginning of the parse_cnf()
function:
while (getline(f, line)) {
std::istringstream iss(line);
if(line.find("c ind ") == 0) {
std::string s;
iss >> s;
iss >> s;
int v;
while (!iss.eof()) {
iss >> v;
if (v && indset.find(v) == indset.end()) {
indset.insert(v);
ind.push_back(v);
has_ind = true;
}
}
and some code at the end of the parse_cnf()
function:
if (!has_ind) {
for (int lit = 0; lit <= max_var; ++lit) {
if (indset.find(lit) != indset.end()) {
ind.push_back(lit);
}
}
}
The original function definition is provided below for reference:
void parse_cnf() {
z3::expr_vector exp(c);
std::ifstream f(input_file);
if (!f.is_open()) {
std::cout << "Error opening input file\n";
abort();
}
std::unordered_set<int> indset;
bool has_ind = false;
int max_var = 0;
std::string line;
while (getline(f, line)) {
std::istringstream iss(line);
if(line.find("c ind ") == 0) {
std::string s;
iss >> s;
iss >> s;
int v;
while (!iss.eof()) {
iss >> v;
if (v && indset.find(v) == indset.end()) {
indset.insert(v);
ind.push_back(v);
has_ind = true;
}
}
} else if (line[0] != 'c' && line[0] != 'p') {
z3::expr_vector clause(c);
int v;
while (!iss.eof()) {
iss >> v;
if (v > 0)
clause.push_back(literal(v));
else if (v < 0)
clause.push_back(!literal(-v));
v = abs(v);
if (!has_ind && v != 0)
indset.insert(v);
if (v > max_var)
max_var = v;
}
if (clause.size() > 0)
exp.push_back(mk_or(clause));
}
}
f.close();
if (!has_ind) {
for (int lit = 0; lit <= max_var; ++lit) {
if (indset.find(lit) != indset.end()) {
ind.push_back(lit);
}
}
}
z3::expr formula = mk_and(exp);
opt.add(formula);
}
I found out this discrepancy, only because, I tried to convert your codebase into python while using the Z3-Python API, and I identified that there's a huge mis-match in the running-time of the algorithms. The sample-projection by your C++ implementation works about 10 orders of magnitude faster than the Python implementation for the same DIMACS file.
I couldn't understand how the standard API-library of Z3 would have such a large performance difference.
I would really appreciate it if you could help me understand why this non-standard way of interpreting the DIMACS-CNF file has been adopted for this algorithm. Thank you for your guidance.
Hello,
As described in our paper (https://dl.acm.org/doi/pdf/10.1145/3180155.3180248) section 3.2 our benchmarks are the same as the ones from UniGen2, which specify an independent support for the formula.
The benchmarks already come with a list of variables that form an "independent support", meaning that assigning values to those variables completely determines the values of all the other variables.
That's why our algorithm operates over the variables in the independent support and outputs assignments to those variables.
If you want to apply the tool over benchmarks where you don't have an independent support specified, you could operate over all the variables, or apply techniques such as the one from @kuldeepmeel et al. ("On computing minimal independent support and its applications to sampling and counting") that compute an independent support for a formula.