A Write-Based Solver for SAT Modulo the Theory of Arrays

The extensional theory of arrays is one of the most important ones for applications of SAT modulo theories (SMT) to hardware and software verification. Here we present a new T-solver for arrays in the context of the DPLL(T) approach to SMT. The main characteristics of our solver are: (i) no translation of writes into reads is needed, (ii) there is no axiom instantiation, and (iii) the T-solver interacts with the Boolean engine by asking to split on equality literals between indices. Unlike most state-of-the-art array solvers, it is not based on a lazy instantiation of the array axioms. This novelty might make it more convenient to apply this solver in some particular environments. Moreover, it is very competitive in practice, specially on problems that require heavy reasoning on array literals.


I. INTRODUCTION
Over the last few years, traditional generic proof-search methods have been increasingly replaced by more efficient domain-specific procedures or procedures for fragments of certain logics. Despite being more specific, these procedures have seen its way into many real-world applications since most problems can be decomposed, either manually or automatically, into smaller problems to which these concrete procedures can be applied.
Among them, Satisfiability Modulo Theories (SMT) tools have received special attention. These procedures decide the satisfiability of (usually ground) formulas modulo some background theory T . The choice of the theory depends on the particular application area: when verifying complex designs, the theory of Equality with Uninterpreted Functions (EUF) comes in handy since it allows one to abstract away uninteresting or too complicated details of the system to be verified [1]; if, on the other hand, one is interested in verifying low-level aspects of, e.g., a microprocessor, the theory of bit-vectors provides the necessary level of detail [2], [3]; and, just to give another example, for the verification of timed automata a certain fragment of linear arithmetic called Difference Logic is the most appropriate choice [4]. Hence, SMT tools deal with problems that may consist of thousands of clauses like: p ∨ a = f (b−c) ∨ read(s, f (b−c) ) = d ∨ a−g(c) ≤ 7 containing purely propositional atoms as well as atoms over (combined) theories.
One of the main approaches to SMT has been called DPLL(T ) [5], which consists of a general DPLL(X) en-gine, very similar in nature to a SAT solver, that works in cooperation with a theory solver Solver T . In the most basic version of this approach the engine is in charge of enumerating propositional models of the formula, whereas Solver T is responsible for checking whether these models are consistent with the theory T (e.g. if T is the theory of linear arithmetic and the current Boolean model contains x+2y ≤ 0, −y − 1 2 z ≤ 0 and x − z > 1, then Solver T has to detect that the current assignment is T -inconsistent).
In this paper, we present a new Solver T for the Theory of Arrays with Extensionality. This theory is useful for both software and hardware verification, since it can be used to model the behavior of, e.g., arrays and memories. The signature of the theory consists of two interpreted function symbols: read, used to retrieve the element stored at a certain index of the array, and write, used to modify an array by updating the element stored at a certain index. More formally, their behavior can be modeled with the following axioms: ∀a : Array, ∀i, j : Index, ∀x : V alue i = j ⇒ read(write(a, i, x), j) = x i = j ⇒ read(write(a, i, x), j) = read(a, j) Finally, one still has to consider the extensionality property, stating that two arrays that coincide on all indices are indeed equal. This is enforced by the axiom: A possible approach to decide the satisfiability of ground literals in the theory of arrays with extensionality is to consider the necessary instances of the aforementioned axioms and let them be handled by the DPLL(X) engine in cooperation with a Solver T for EUF. In this setting, the array solver is reduced to a module that only generates clauses. This approach is used in state-of-the art SMT solvers like YICES [6] and Z3 [7] but, as far as we know, there is no precise description of it in the literature.
Our approach does not follow the same lines but is rather based on a careful analysis of the problem that allows us to infer, for each array, which are the relevant indices and which values are stored at these indices. Once this information has been made transparent, checking satisfiability of sets of literals becomes an easy task. Unlike the approach of [8], we do not remove significant write's from the problem, and hence there is no need to use the notion of "equality of arrays except in a certain set of indices" used in [8].
As usual, the extensionality axiom is addressed by forcing that, if two arrays are different, there is an index, called a disequality witness, where the two arrays do not coincide. For efficiency reasons, the introduction of these witnesses is delayed so as to reduce the combinatorial explosion caused by the comparison between indices. This explosion is also mitigated in our solver by identifying situations where the set of literals can be proved to admit a model without the need to construct a total partition over the sets of values and indices.
These ideas lead to a very natural and easy-to-understand solver. As a result, all proofs also become noticeably intuitive. Moreover, using the adequate strategy, our solver performs very well on problems that require heavy reasoning on array literals.
The rest of the paper is structured as follows. In Section II we give basic definitions and notation. Then, in Section III we describe our solver in terms of transformation rules and prove their correctness. After that, in Section IV we give some details of the integration of the solver in a DPLL(T ) system. Finally, we present experimental results in Section V and we conclude in Section VI.

II. PRELIMINARIES
We will consider a set of index constants denoted by I and a set of value constants denoted by V. An array expression is either an array constant or write(A, i, x) where A is an array expression, i is an index constant and x is a value constant. Arrays of the form write(. . . write(a, i 1 , x 1 ) . . . , i n , x n ) will be written as write(a, i 1 , . . . , i n , x 1 , . . . , x n ), and as write(a, I, X) for short. Note that the first elements of I and X correspond to the innermost write and the last elements to the outermost write. We are using vector notation for I and X to emphasize that the elements are ordered and to refer to the elements at position p as i p and x p , respectively. Given I, we will write I to denote the set of elements in I. As a convention, the array constant a will be seen, when necessary, as write(a, ∅, ∅).
In what follows, possibly subscripted a, b, c denote array constants and A, B, C denote array expressions. When we write ≡ we mean syntactic equality between expressions and = set denotes equality between sets.
An array satisfaction (ASAT) problem is a conjunction of literals of the form: • ⊥ (representing an unsatisfiable formula).

A. The Extensional Theory of Arrays
The extensional theory of arrays is defined by the following axioms: read-write axioms: ∀a : Array, ∀i, j : Index, ∀x : V alue i = j ⇒ read(write(a, i, x), j) = x i = j ⇒ read(write(a, i, x), j) = read(a, j) extensionality axiom: The following well-known axioms are not necessary if we consider extensionality, since they are entailed by the previous axioms: write-write axioms: ∀a : Array, ∀i, j : Index, ∀x, y : V alue i = j ⇒ write(write(a, i, x), j, y)=write(a, j, y) i = j ⇒ write(write(a, i, x), j, y)=write(write(a, j, y), i, x) Note that, however, the axioms of read-write and write-write do not entail the extensionality axiom.
Example 1: Let us consider the following set of literals: The two literals on the left imply that read(b, i) = x. This can be seen by replacing a by b in the first disequality, applying a read at position i at both sides of the disequality and using the first read-write axiom. The two literals on the right imply that x = y. This can be deduced if we use that i = j and apply the first read-write axiom at the top-right hand equation. Together with the equation in the middle we get that read(b, i) = x which shows that the set of literals is T -inconsistent. 2

B. Models
The models we will consider are given by a mapping I I from I to some domain D I , a mapping I V from V to some domain D V and a mapping I A from array constants to functions f A :

III. A NEW SOLVER FOR THE EXTENSIONAL THEORY OF
ARRAYS In this section we present, in terms of transformation rules, a solver for deciding the satisfiability of ASAT problems. Our transformation system deals with conjunctions of literals, but they will be represented as M | P , so as to explicitly distinguish the subset M of literals than can be treated by a simple Union-Find algorithm, i.e., (dis)equalities between indices or values, from the conjunction P of those literals involving arrays. As expected, M | P has to be understood as M ∧ P . Given a conjunction of literals P , by P {a := A} we mean the result of replacing all occurrences of a by A in P .
Definition 1: The system of transformation rules in Figure 1 is called A.
Let us briefly comment on some of the rules. First of all, for rules like True equality, Array inconsistency or Disequality witness introduction, recall that an array constant a can be expressed as write(a, ∅, ∅).
The Significance query rule is used to detect the set of significant indices in an array expression, whereas the Equality index query rule is used to detect equal indices in both sides of an equality.
In the Write introduction rule, by propagating b := write(c, i, x) in the conclusion we make explicit that b is an array with x at position i. Note that we could keep that literal in order to be able to later recover a model for the original set of array constants. This is also the case for the Substitution rule.
Finally, let us mention that the Disequality witness introduction rule takes care of extensionality: if we have write(a, I, X) = write(b, J, Y ) and we know that the difference is not in the significant indices in I (which coincides with significant indices in J), then we can infer that a and b are different at some position not in I. The introduction of these witnesses can be delayed at convenience. Similarly to Write introduction and Substitution we could keep the two substitutions as literals so as to later recover a model for the original problem.
As we will show below in all these rules the premise and the disjunction of the conclusions are equisatisfiable. Hence, they can be used to produce a derivation tree rooted by the original array problem and where every non-terminal node has as children the conclusions of an application of a transformation rule. As we will see, the problem will be unsatisfiable if and only if all terminal nodes are ⊥.
Lemma 1: For all rules in A the premise and the disjunction of its conclusions are equisatisfiable.
Proof: We show that there is a model for the premise if and only if there is a model for the disjunction of the conclusions. We only detail the proof for the following two rules: • Write introduction. For the left to right implication, let (I I , I V , I A ) be a model for M | P ∧ write(a, I, X) = write(b, J, Y ). Since by the conditions of the transformation rule, I I (i p ) = I I (i k ) for all k > p, and and hence also of the conclusion. For the right to left implication, since b does not appear in the conclusion we only have to extend the model so that b is interpreted as write(c, i p , x p ). Definition 3: An A-derivation tree is a tree whose nodes are ASAT problems and where the children of a node M | P are the conclusions of a transformation rule in A applied to M | P . We will say that an A-derivation tree is solved iff all its leafs are solved forms.
Lemma 3: All paths in a A-derivation tree are finite.
Proof: First of all we show that given an ASAT problem M | P , the set of indices and values that can occur in any A-derivation tree rooted by M | P is finite. This is easy to see since the only rule that introduces new constants is Disequality witness introduction, but since every application of this rule removes a disequality literal and no other rule introduces new disequalities, this can only happen a finite number of times. Hence, this also proves that this rule cannot be applied an infinite number of times.
Regarding the rules Significance query, Equality index query and Equality values propagation, they can only be applied a finite number of times. This is because they add a (dis)equality between indices or values to M that was not previously entailed by it. Since there are only a finite number of indices and values, this can only happen finitely often.
Similarly, the Write introduction rule can only be applied a finite number of times. The key argument is based on the two following facts: (i) the number of literals appearing in any node of the derivation tree does not increase, since no rule adds new literals and (ii) no rule removes write's from an equality between array expressions. Given these two facts, if we denote by N I the maximum number of indices that can occur in the derivation tree, we know that for every equality literal in the original problem, we can apply Write Introduction at most 2 * N I times, since after 2 * N I applications all indices will appear in both sides of the equality and the rule will no longer be applicable.
Hence, any infinite derivation should end with an infinite sequence of applications of UF-Dispatch, Read-Write, Read2Write, Substitution and True equality. But this cannot be the case: given an ASAT problem M | P , we can associate to it the triple of natural numbers (#literals in P, #reads, |P |), being |P | the size of P . It is an easy exercise to check that using a lexicographic ordering all remaining rules decrease that triple of natural numbers.
The following lemma is easily proved by comparing the conditions imposed on solved forms and the conditions imposed on the rules.
Lemma 4: If an ASAT problem is not a solved form then a transformation rule in A can be applied.
Proof: If M is not consistent then we can apply the UFinconsistency rule. Otherwise, we proceed by case analysis according to the kind of literal that is not in solved form.
1) The literal is of the form a = A. If a does not occur in A then the Substitution rule can be applied. Otherwise, if A ≡ a then True equality can be applied. Finally, if A is write(a, I, X) with I = ∅, we can apply Write introduction for the last index i n in I. Then, we can apply the UF-Dispatch rule. 2 Since the premises and the disjunction of the conclusions of every transformation rule are equisatisfiable, the following lemma holds.
Lemma 5: An ASAT problem M | P is satisfiable if and only if at least one leaf of a solved A-derivation tree rooted by M | P is not ⊥. Now we present our main result. Theorem 1: The extensional theory of arrays can be decided with A.
Proof: By lemmas 3 and 4, we have that, given a ASAT problem M | P , a solved A-tree derivation can be obtained. Then by Lemma 5, we can decide if the problem is satisfiable. 2 Note that we can apply any strategy in the application of the transformation rules.

IV. INTEGRATION OF THE SOLVER IN DPLL(T )
In the previous section, we showed how to check the satisfiability of conjunctions of literals, but SMT deals with arbitrary formulas. For that purpose, in the DPLL(T ) approach to SMT a Boolean engine DPLL(X) works in cooperation with a theory solver Solver T . In its most basic version, the engine enumerates all propositional models of the formula and Solver T checks the models (seen as conjunctions of literals) for consistency over the theory T . As expected, the input formula is declared satisfiable as soon as a T -consistent propositional model is found.
In our implementation of the Solver T described in this paper, array expressions are stored using a DAG, where the nodes are array constants and the edges are labelled by the index and the value of the corresponding write. In this way we can share information and have an efficient way of applying subtitutions of array constants by write expressions. In addition, (dis)equalities between constants are stored in a Union-Find data structure. In what follows, we give some details about usual optimizations on DPLL(T ) systems and how they are implemented in our solver.
Incrementality: there is no need to delay consistency checks until a full propositional model has been found. One can check the T -consistency of partial assignments while they are being built, with the aim of detecting T -inconsistencies at an earlier stage. In order to fully exploit this feature, it is interesting to ask Solver T to be incremental. That is, once an assignment M has been found T -consistent, processing the addition of a literal l should be done faster (in average) than reprocessing the whole assignment M ∪ {l} from scratch.
For that purpose, for every array (dis)equality literal we have a watched pair of indices, one in each side, that is used for the analysis of satisfiability of the literal. If the literal is not in solved form and we know whether these two indices and their associated values are equal or not, and whether they are significant, we move the watched pair to other indices in order to avoid repeated work in future checks.
Splitting on demand: if some of this information is unknown, we allow the solver not to give a conclusive answer, but rather to ask DPLL(X) to split on a certain equality between indices. This refinement, presented in [9] and called splitting on demand, allows reusing the case-splitting infrastructure present in DPLL(X) instead of duplicating it inside Solver T . This simplifies the implementation of all splitting rules presented in the previous section.
Theory propagation: if, when checking a non-solved literal, we have all the information about equalities between indices but not about values, we can propagate an equality between values, applying the Equality values propagation rule. Similarly, by successive applications of Read-Write we can sometimes infer a disequality between values that is propagated by UF-Dispatch. If such a (dis)equality already exists in the input formula we can notify it to the DPLL(X) engine. This optimization, introduced in [10], is very effective in reducing the search space.
Backtracking: sometimes an inconsistency can be detected, and then it is beneficial to backtrack to a point where the assignment was still T -consistent, instead of restarting the search. Hence, we need Solver T to be backtrackable. Our solver annotates some information with timestamps, e.g. the one given by the Union-Find, and some other information is restored using a trail stack.
In addition, Solver T has to assist DPLL(X) in identifying the backtrack point by providing an inconsistency explanation, that is, given a T -inconsistent set of literals M , it has to provide a small subset of M that is also T -inconsistent. As it is well-known, generating short explanations is a determinant factor in the performance of an SMT solver. In our case, an explanation describes basically the conditions of the transformation rule of Figure 1 that has been applied on a given (dis)equality array literal, together with the explanation of why the relevant indexes and values of both sides of the array literal are there. For this reason, when any of the rules introducing new write's is applied, namely the Read2Write, the Write introduction or the Disequality witness introduction rule, we remember the literal that has generated it. It is crucial to make these explanations for the introduced write's as short as possible.

V. EXPERIMENTS
In order to evaluate the Solver T for arrays described in this paper, we implemented it on top of our BARCELOGIC [11] system 2 . We performed experiments on a 2GHz 2GB Intel Core Duo with a time limit of 300 seconds, comparing our implementation with the four systems that competed at SMT-COMP 2007 (the annual SMT competition 3 ) in the QF AUFLIA division, the only one involving quantifier-free formulas with arrays. These systems are: CVC3 1.2 [12], YICES 1.0 and YICES 1.0.10 [6] and Z3 0.1 [7]. We ran all systems on all available benchmarks in SMT-LIB [13], the largest existing library for SMT problems, discarding families of benchmarks consisting only of trivial problems. The remaining benchmark families were: • array-benchs (25 benchs): a variety of verification conditions involving arithmetic and arrays. • cvc (25 benchs): processor verification conditions involving arithmetic and arrays. • qlock2 (52 benchs): unbounded version of the queue lock algorithm. All benchmarks result from parameterizing two single problems. They all contain arithmetic and arrays. • storecomm (2030 benchs), storeinv (172 benchs) and swap (1368 benchs): benchmarks from the paper [14] encoding simple properties about arrays. They do not contain any arithmetic. As we can see, most benchmarks involve both arrays and arithmetic, hence forcing us to implement some method for combining the Solver T for arrays with our solvers for arithmetic (both the difference logic one and the one for full linear arithmetic). It is important to note that BARCELOGIC does not implement any sophisticated combination technique. Unlike what it is done in YICES or Z3, where interface equalities are created on the fly and sophisticated techniques are used to reduce their number, BARCELOGIC implements Delayed Theory Combination [15]. This is much simpler but, since we create all interface equalities upfront, it may significantly slow down the search in some cases. The main reason for that is that, since our arithmetic solvers do not admit (dis)equalities, we have to add clauses, for each interface equality x = y, expressing that x = y ↔ x ≤ y ∧ x ≥ y . This problem is specially acute in the qlock2 family where, in some cases, up to sixty thousand interface equalities had to be created.
Results are presented in Figure 2, where the column labeled Total contains the number of seconds needed to process the whole family. We only count the time for the number of instances solved within 300 seconds and, if not all problems could be handled, we write in parenthesis how many instances 2  could be processed. The column Max gives the largest time in seconds needed by a single instance. From the table it can be seen that BARCELOGIC can solve any instance in less than 3 minutes and in fact only one benchmark takes more than one minute. Our system is in general much faster than CVC3, but slower than Z3 and YICES. However, apart from the qlock2 family, where the huge number of interface equalities greatly affects the search space, the difference wrt. Z3 and YICES is similar to the difference we obtain when we run the systems on formulas not involving arrays. Hence, the difference is probably not due to the array solver but rather to other factors such as heuristics and the worse performance of the arithmetic solvers in BARCELOGIC. In fact, for benchmarks containing a big array component, such as the families storecomm, storeinv and swap, our system is comparable, if not better, which shows that our array solver behaves very well in practice.

VI. CONCLUSIONS
In this paper we have described a new theory solver for extensional arrays. As far as we know, this is the first accurate description of an array solver integrated in a state-of-theart SMT solver and, unlike most state-of-the-art solvers, it is not based on a lazy instantiation of the array axioms. Moreover, our solver is very intuitive and easy-to-understand: after performing a careful analysis on which indices are relevant in each array, the satisfiability of conjunctions of literals becomes an easy task. We have proved soundness, completeness, and termination of our procedure and shown how it can be integrated to work in a DPLL(T ) setting. Finally, we have presented experimental results showing that it performs very well in practice. We want to note that this approach smoothly extends to multidimensional arrays by expressing a position (i 1 , . . . , i n ) in an n-dimensional array as f (i 1 , . . . , i n ), where f is an uninterpreted function symbol.