Monday, May 2, 2016

Automatically solving regular expression crosswords

Automatically solving regular expression crosswords

Automatically solving regular expression crosswords

Copyright 2016 Antoine Trux

MIT crossword

1. Introduction

As part of its annual Mystery Hunt tradition, the MIT once published an interesting regular expression puzzle, reproduced above. Since then, similar crosswords of various flavors have been published, most notably at https://regexcrossword.com.

Here, I present one possible way to automatically solve such crosswords. A program which implements these ideas is also provided – see the last section.

In the following, regex will be used as an abbreviation of regular expression.

2. An incursion into the theory of regular languages and regular expressions

2.1 Alphabets, strings, languages

An alphabet is a finite, nonempty set of symbols. In the context of regex crosswords, the symbols are often capital letters of the English alphabet. Thus, the alphabet of a regex crossword could be, for example, { A, B, C, D }. Alphabets are usually denoted Σ.

A string over some alphabet is a finite sequence of symbols from the alphabet. A, CD, DADADB are examples of strings over alphabet { A, B, C, D }. The empty string is usually denoted ε.

Two strings over the same alphabet can be concatenated into a third string.

For example, the concatenation of AB and C is ABC.

The set of all strings over some alphabet Σ, including the empty string ε, is denoted Σ*.

For example, if Σ is { A, B }, Σ* is the infinite set { ε, A, B, AA, AB, BA, BB, AAA, ... }.

A subset of Σ* is called a language over Σ. A language can be finite or infinite.

For example, languages over { A, B } include:

  • ∅ – the empty language
  • { A, B } – that is, Σ itself if we identify each symbol of Σ with the string which consists solely of that symbol
  • { ε, A, AA, AAA, AAAA, ... } – the infinite language which consists of strings with any number of A's
  • Σ* – the universal language over Σ

2.2 Operations on languages

Since languages are sets, the union of two languages is defined in the same way as the union of two sets.

For example, if L1 = { A, AB } and L2 = { ε, A, BA }, the union of L1 and L2, L1L2, is equal to { ε, A, AB, BA }.

The concatenation of two languages L1 and L2, denoted L1L2, is the set of all string concatenations s1s2, where s1 is an element of L1 and s2 is an element of L2.

For example, if L1 = { A, AB } and L2 = { ε, BA }, L1L2 is equal to { A, ABA, AB, ABBA }.

The Kleene star of a language L, denoted L*, is the set of concatenations of zero or more strings of L (where the concatenation of zero strings is ε, and the concatenation of one string is the string itself).

For example, if L = { A, AB }, L* is equal to { ε, A, AB, AA, AAB, ABA, ABAB, ... }.

2.3 Regular languages

Let Σ be some alphabet. The regular languages over Σ are defined as follows:

  1. The empty language, ∅, is a regular language.
  2. For each element A of Σ, { A } is a regular language.
  3. For any regular languages L1 and L2, L1L1, L1L2 and L1* are regular languages.
  4. The only regular languages are those which can be deduced from axioms 1, 2 and 3.

2.4 Regular expressions which represent a regular language

Regular languages can be represented by regular expressions, defined as follows:

  1. ∅ is a regular expression which represents the empty language. ε is a regular expression which represents the language { ε }.
  2. For any symbol A in the alphabet, A is a regular expression which represents the language { A }.
  3. If R1 and R2 are regular expressions, R1 R2 is a regular expression which represents the union of the languages represented by R1 and R2.
  4. If R1 and R2 are regular expressions, R1R2 is a regular expression which represents the concatenation of the languages represented by R1 and R2.
  5. If R is a regular expression, R* is a regular expression which represents the Kleene star of the language represented by R.
  6. If R is a regular expression, (R) is a regular expression which represents the same language as R.
  7. The only regular expressions are those that can be deduced from axioms 1 to 6.

Kleene star has higher precedence than concatenation, which has higher precedence than union.

For example, A BC* represents the language { A, B, BC, BCC, BCCC, ... }.

Parentheses can be used to change precedence rules.

For example, (A B)C* represents the language { A, B, AC, BC, ACC, BCC, ACCC, BCCC, ... }.

2.5 Convenience extensions to regular expressions

The syntax of regular expressions described in the previous section has been extended for convenience purposes. Such extensions include:

extension meaning
+ one or several repetitions; for example, A+ is equivalent to AA*
? zero or one repetition; for example, A? is equivalent to ε A
{m} exactly m repetitions; for example, A{3} is equivalent to AAA
{m,n} between m and n repetitions; for example, A{2,4} is equivalent to AA AAA AAAA
{m,} m or more repetitions; for example, A{2,} is equivalent to AAA*
[...] a character class, that is, a set of characters; for example, [ABC] is equivalent to A B C, [0-3] (a character range in a character class) is equivalent to 0 1 2 3
[^...] a negated character class: any character, except the ones listed; for example, if the alphabet is { A, B, C }, [^A] is equivalent to B C
. any symbol of the alphabet; for example, if the alphabet is { A, B }, this is equivalent to A B

These syntactic extensions, as convenient as they may be, do not increase the expressing power of regular expressions. Any (regular) language which can be represented with a regular expression using these extensions can also be represented with the syntax described in the previous section.

For further information on the theoretical aspects of regular languages, and on regular expressions which represent regular languages, see any textbook on the theory of computation.

2.6 Extensions of another kind

The syntax of regular expressions has also been extended with backreferences. A backreference refers to a group which appears before the backreference.

Groups are indicated by parentheses, and numbered (starting at 1) in the order in which they are opened.

For example, A(B)(CD) contains these two groups:

  • group 1 = (B)
  • group 2 = (CD)

Groups can be nested.

For example, A(B(CD)) contains these two groups:

  • group 1 = (B(CD))
  • group 2 = (CD)

A backreference is indicated by a backslash followed by the number of the group to which it refers.

For example, in (A*)B\1, \1 refers to group 1, that is, to (A*).

A backreference matches the same value as the group to which it refers.

For example, (A*)B\1 represents the language { B, ABA, AABAA, AAABAAA, ... }. ABAA, for example, does not belong to that language, because there would be one A in the backreferenced group, but two A's in the backreference.

In general, a regular expression which contains backreferences does not represent a regular language. Such languages cannot be represented by the regular expressions described in the previous sections.

For example, it can be shown that the language represented by (A*)B\1 is not regular. Therefore, that language cannot be represented by the regular expressions described in the previous sections.

That is, backreferences are an extension of a very different kind from the ones mentioned in the previous section, which were mere notational conveniences.

To summarize, there are two different kinds of regular expressions:

  • regular expressions which represent regular languages,
  • regular expressions which do not represent regular languages.

The fact that the term regular expression is also used to represent nonregular languages is unfortunate, and a source of confusion.

3. The alphabet of a crossword

Let's consider the following simplistic crossword, which consists of only one regex:

A.

This regex means "A followed by any character". What are the solutions?

"Any character" could be A, or B, ..., or Z, or any punctuation character, or any of the tens of thousands of Chinese characters, or ...

In the regex crosswords I have seen, the convention seems to be that the alphabet consists of all the characters which are explicitly mentioned in the crossword regexes. This is the convention we will use.

In the above example, only A is explicitly mentioned, so the alphabet of that "crossword" is the singleton { A }, and the unique solution is AA.

4. Solving a crossword

The method presented here uses constraint propagation. For further information on this topic, see, for example, [1].

4.1 The example

This simple example, copied from https://regexcrossword.com/challenges/beginner/puzzles/1, will be used throughout this section:

              [^SPEAK]+
                 |   EP|IP|EF
                 |      |
                 v      v
               ------ ------
              |      |      |
HE|LL|O+  --> |      |      |
              |      |      |
               ------ ------
              |      |      |
[PLEASE]+ --> |      |      |
              |      |      |
               ------ ------

4.2 Determining the alphabet

First, we need to determine the alphabet of the crossword.

When going through the four regexes of the crossword, we can see that the set of characters explicitly mentioned is: { A, E, F, H, I, K, L, O, P, S }. This is the alphabet.

4.3 Initializing the grid constraints

Next, we initialize each cell of the grid with all the characters of the alphabet, since we cannot exclude a priori any character from any cell:

              [^SPEAK]+
                 |   EP|IP|EF
                 |      |
                 v      v
               ------ ------
              | AEFH | AEFH |
HE|LL|O+  --> | IKLO | IKLO |
              |  PS  |  PS  |
               ------ ------
              | AEFH | AEFH |
[PLEASE]+ --> | IKLO | IKLO |
              |  PS  |  PS  |
               ------ ------

At this initial stage, all the rows and columns hold the same constraint:

({ A, E, F, H, I, K, L, O, P, S }, { A, E, F, H, I, K, L, O, P, S })

which we will abbreviate as:

(AEFHIKLOPS, AEFHIKLOPS)

4.4 Constraining the top row

Now, we constrain the grid by using the provided regexes, one regex at a time: top row, bottom row, left column, right column.

The top row regex is HE LL O+. This regex yields the following values of length two:

(H, E)
(L, L)
(O, O)

Intersecting the value (H, E) with the current constraint of the top row, (AEFHIKLOPS, AEFHIKLOPS), yields the new constraint:

(H, E)

Intersecting the value (L, L) with the current constraint of the top row, (AEFHIKLOPS, AEFHIKLOPS), yields the new constraint:

(L, L)

Intersecting the value (O, O) with the current constraint of the top row, (AEFHIKLOPS, AEFHIKLOPS), yields the new constraint:

(O, O)

So, applying the regex to the current constraint results in three constraints:

(H, E)
(L, L)
(O, O)

All these constraints are possible a priori, so we take the union of them, yielding the new top row constraint: (HLO, ELO).

Here is the new grid of constraints:

              [^SPEAK]+
                 |   EP|IP|EF
                 |      |
                 v      v
               ------ ------
              |      |      |
HE|LL|O+  --> | HLO  | ELO  |
              |      |      |
               ------ ------
              | AEFH | AEFH |
[PLEASE]+ --> | IKLO | IKLO |
              |  PS  |  PS  |
               ------ ------

4.5 Constraining the bottom row

The bottom row regex is [PLEASE]+. This regex yields many values of length two:

(P, P)
(P, L)
(P, E)
...
(L, P)
(L, L)
(L, E)
...

These values can be conveniently summarized as: (AELPS, AELPS).

Intersecting (AELPS, AELPS) with the current constraint of the bottom row, (AEFHIKLOPS, AEFHIKLOPS), yields the new bottom row constraint:

(AELPS, AELPS)

The new grid of constraints:

              [^SPEAK]+
                 |  EP|IP|EF
                 |     |
                 v     v
               ----- -----
HE|LL|O+  --> | HLO | ELO |
              |     |     |
               ----- -----
[PLEASE]+ --> | AEL | AEL |
              |  PS |  PS |
               ----- -----

4.6 Constraining the left column

The left column regex is [^SPEAK]+.

Let's recall that ^ as the first character within a character class stands for negation, so [^SPEAK] means any character in the alphabet except A, E, K, P and S. Given the alphabet of this crossword, these letters are F, H, I, L and O.

This regex yields many values of length two:

(F, F)
(F, H)
(F, I)
...
(H, F)
(H, H)
(H, I)
...

which can be summarized as: (FHILO, FHILO).

Intersecting (FHILO, FHILO) with the current constraint of the left column, (HLO, AELPS), yields the new left column constraint:

(HLO, L)

The grid of constraints becomes:

              [^SPEAK]+
                 |  EP|IP|EF
                 |     |
                 v     v
               ----- -----
HE|LL|O+  --> | HLO | ELO |
              |     |     |
               ----- -----
[PLEASE]+ --> |  L  | AEL |
              |     |  PS |
               ----- -----

4.7 Constraining the right column

The right column regex is EP IP EF.

This regex yields three values, all of length two:

(E, P)
(I, P)
(E, F)

Intersecting the value (E, P) with the current constraint of the right column, (ELO, AELPS), yields the new constraint:

(E, P)

Intersecting the value (I, P) with the current constraint of the right column, (ELO, AELPS), yields the new constraint:

(∅, P)
where ∅ is the empty set

This constraint has one of its elements equal to the empty set. Such a constraint is said to be impossible, since no string can satisfy it – no string of length two has a first character which belongs to the empty set.

Intersecting the value (E, F) with the current constraint of the right column, (ELO, AELPS), yields the new constraint:

(E, ∅)

Again, this constraint is impossible.

So, applying the regex to its current constraint results in three constraints, two of which are impossible. The one possible constraint, (E, P), is the new right column constraint.

The new grid of constraints:

              [^SPEAK]+
                 |  EP|IP|EF
                 |     |
                 v     v
               ----- -----
HE|LL|O+  --> | HLO |  E  |
              |     |     |
               ----- -----
[PLEASE]+ --> |  L  |  P  |
              |     |     |
               ----- -----

We are almost there. Only one cell contains several possible characters.

4.8 Constraining the top row (second pass)

We proceed as in section 4.4, but this time with a tighter constraint: (HLO, E) instead of (AEFHIKLOPS, AEFHIKLOPS).

When we intersect the values of HE LL O+ of length two with the top row constraint, we get:

(H, E) ∩ (HLO, E) = (H, E)
(L, L) ∩ (HLO, E) = (L, ∅)
(O, O) ∩ (HLO, E) = (O, ∅)

So, applying the regex to its current constraint results in three constraints, two of which are impossible. The one possible constraint, (H, E), is the new top row constraint.

The new grid of constraints:

              [^SPEAK]+
                 |  EP|IP|EF
                 |     |
                 v     v
               ----- -----
HE|LL|O+  --> |  H  |  E  |
              |     |     |
               ----- -----
[PLEASE]+ --> |  L  |  P  |
              |     |     |
               ----- -----

4.9 One more round

It looks as if our crossword is solved by now. However, this is not necessarily the case. We don't know a priori whether one of the last changes we made to the grid of constraints introduced an impossibility.

For example, when we constrained the right column in section 4.7, the bottom right cell was constrained from AELPS to P. But how do we know that this new constraint is compatible with the bottom row regex?

We answer this question by going once more through the regexes. Once we have gone through all the regexes without modifying the constraints, we know we are done.

In this case, as you can easily check, the next round does not change the constraints – no impossibility is detected. The crossword is solved, and it has a unique solution.

5. Searching

5.2 Searching may be needed even for unambiguous crosswords

Searching is required for ambiguous crosswords (that is, crosswords which have more than one solution), but sometimes also for unambiguous ones.

Consider for example this crossword:

           AA|BB
              |  AB|BA|AA
              |     |
              v     v
            ----- -----
AB|BA  --> |     |     |
           |     |     |
            ----- -----
AB|BA -->  |     |     |
           |     |     |
            ----- -----

The alphabet is { A, B }, so we initialize the grid to:

           AA|BB
              |  AB|BA|AA
              |     |
              v     v
            ----- -----
AB|BA  --> | AB  | AB  |
           |     |     |
            ----- -----
AB|BA -->  | AB  | AB  |
           |     |     |
            ----- -----

After going through the first round of constraints, we get:

           AA|BB
              |  AB|BA|AA
              |     |
              v     v
            ----- -----
AB|BA  --> | AB  | AB  |
           |     |     |
            ----- -----
AB|BA -->  | AB  | AB  |
           |     |     |
            ----- -----

No progress was made. The grid is still unsolved, so we resort to searching.

If we now choose A when searching the top-left cell, we arrive at an impossibility. If we select B for that cell, we arrive at a solution:

           AA|BB
              |  AB|BA|AA
              |     |
              v     v
            ----- -----
AB|BA  --> |  B  |  A  |
           |     |     |
            ----- -----
AB|BA -->  |  B  |  A  |
           |     |     |
            ----- -----

This is the unique solution – the crossword is unambiguous, yet searching was required.

5.3 Summary of the possible outcomes

To summarize, when attempting to solve a regex crossword, we may find that:

  • the crossword has no solutions,
  • the crossword has a unique solution,
  • the crossword has several solutions.

If the crossword has a unique solution, searching may or may not be required. If the crossword has several solutions, searching is required.

6. An alternative approach

It can be shown that any regular language can be represented by a deterministic finite automaton (DFA), and that any DFA represents a regular language.

Since DFAs are simple and efficient objects, we could think of using them in order to implement the constraining process.

However, as noted earlier, backreferences extend the expressive power of regular expressions in such a way that a regular expression with backreferences does not, in general, represent a regular language. Thus, there is no DFA which would correspond to such a regular expression.

In general.

But consider, for example, this regular expression: (A|B)C\1. Here, \1 refers to (A B). This group can take on one of two values, A and B. Thus, we can rewrite the regular expression as: ACA BCB. So, this regular expression does represent a regular language.

More generally, if the backreferences of a regular expression all refer to finite languages, the regular expression represents a regular language (this is actually the case for all the MIT puzzle's regexes with backreferences). Therefore, for such cases, an approach based on DFAs would be possible in principle.

7. Optimizations

Enumerating the possible values of a regex is at the heart of the algorithm described above. It is where the program spends most of its time, and is thus worth optimizing.

Here below are three possible optimizations, which precisely aim at speeding up enumeration.

7.1 Group optimizations

Groups are sometimes used in conjunction with backreferences, but sometimes only for syntactic purposes.

For example, AB C means (A concatenated with B) or C, because concatenation takes precedence over the operator. If we mean A concatenated with (B or C), we write A(B C).

Once a regex has been parsed, groups which are not backreferenced can be simply optimized away. This makes the enumeration process lighter, and may enable further optimizations (described in subsequent sections).

For example, the regex A(B C) would be parsed and simplified as follows:
   .                 .
  / \               / \
 /   \             /   \
A    ( )          A     |
      |                / \
      |      =>       /   \
      |              B     C
     / \
    /   \
   B     C

7.2 Union optimizations

When both sides of a union represent a subregex of length one (for example, a single character or a character class), they can be unified into a character class.

As a simple example, the regex A B is optimized into [AB]. With the original regex (A B), two values are enumerated: A and B. With the optimized regex ([AB]), only one value is enumerated: [AB].

As another example, the regex A BC [DE] would be parsed and simplified as follows:

   |                  |                        |                   |
  / \                / \                      / \                 / \
 /   \              /   \                    /   \               /   \
A     |            A     |                  /     \            [ADE]  .
     / \                / \                /       \                 / \
    /   \     =>       /   \       =>     |         .      =>       /   \
   .   [DE]   1.     [DE]   .      2.    / \       / \     3.      B     C
  / \                      / \          /   \     /   \
 /   \                    /   \        A   [DE]  B     C
B     C                  B     C

Here, the optimization consists of three steps:

  1. reorganize the parse tree by swapping BC and [DE] – allowed, because union is commutative,
  2. left-rotate the parse tree at the root level – allowed, because union is associative,
  3. the left union of the parse tree is unified into a character class.

7.3 Concatenation optimizations

This optimization transforms simple concatenations into "string" subregexes.

For example, regex [AB]C would be parsed and simplified as follows:
    .
   / \    =>  "[AB]C"
  /   \
[AB]   C
where "[AB]C" is a regex of length two which has only one possible value: [AB]C.

7.4 Speedups obtained from optimizations

In my implementation, the optimizations described above speed up the program by an amount which depends a lot on the characteristics of the crossword.

For the MIT puzzle, optimizations speed up the program by 6.7%.

Sometimes, the benefits are much more spectacular. The highest benefits I have found are for hexagonal_4: with optimizations enabled, the program finds the solution 38 times faster than without optimizations.

For some puzzles, optimizations are counterproductive: the program finds the solution(s) slightly faster without than with optimizations. This happens when optimizations bring no or little benefit (no groups can be eliminated, etc.). In those cases, the program wastes some time trying to optimize, but the "optimized" parse tree is identical or similar to the original one. The worst such case I have found is volapuk_3, where optimizations slow down the program by 12.9%. Such crosswords tend to be small and fast to solve anyway.

All the speeds reported in this section were measured in Linux, with Callgrind.

8. The program

I have implemented the above ideas in this program.

Highlights:

  • Supports grids of regular hexagonal shape, or of rectangular shape.
  • Supports one or several regexes per line.
  • With the --log option, one can follow the steps taken by the program, and visualize the intermediate results after each line constraint, with diagrams similar to the ones in section 4. This option also reveals whether the program had to resort to searching.
  • Can find whether a crossword is ambiguous or not. If a crossword is ambiguous, can find all its solutions.

  • Unsupported regular expression constructs include:

    • \u and \U (Unicode)
    • (?...
    • however, the following constructs are supported:
      • (?: – non-capturing group (supported since version 1.1.0)
      • (?= – positive lookahead (supported since version 1.1.0)
  • Many of the crosswords of https://regexcrossword.com can be solved. The ones which cannot be solved include:

    • grids of irregular hexagonal shape (sides of unequal lengths),
    • grids which have regexes with non ASCII characters,
    • grids which use unsupported regex constructs – see above for examples of such constructs.
  • No binaries are provided, only source code.

  • Written in Standard C++ (C++11).

  • Self-contained – no need to download and install external facilities.

  • Reasonably fast – for example, the MIT puzzle is solved in about 25 milliseconds on my PC.

  • Built and tested in the following environments:

    • Linux (64-bit Kubuntu): gcc 4.7.4, gcc 5.2.1, clang 3.6.2
    • Cygwin (32-bit): gcc 4.9.3, clang 3.5.2
    • Windows 7 (64-bit): Visual Studio Community 2015 (Update 2)

    In principle, any compiler which complies with the C++11 Standard should also work.

  • Tested with unit tests, black-box tests (grid tests) and fuzz testing – all tests are provided. Together, unit tests and black-box tests have 98% statement coverage, as reported by gcov.

  • Provided with an MIT-like license – what else?

For further information, see README.txt.

I chose to implement the program in C++, because I originally thought performance would be an issue. However, given the speeds achieved even for complex crosswords, a language such as Python, or a language which supports constraint programming, could well have been envisaged.

References

[1] Stuart Russell and Peter Norvig, Artificial Intelligence, A Modern Approach, 3rd edition (Prentice-Hall, 2010), chapter 6.