%%%%%plain TeX file of gj.tex a paper by J. Noonan and D. Zeilberger%%%
%begin macros
\def\C{{\cal C}}
\def\L{{\cal L}}
\def\M{{\cal M}}
\baselineskip=14pt
\parskip=10pt
\def\Tilde{\char126\relax}
\def\halmos{\hbox{\vrule height0.15cm width0.01cm\vbox{\hrule height
0.01cm width0.2cm \vskip0.15cm \hrule height 0.01cm width0.2cm}\vrule
height0.15cm width 0.01cm}}
\font\eightrm=cmr8 \font\sixrm=cmr6
\font\eighttt=cmtt8
\magnification=\magstephalf
\def\lheading#1{\bigbreak \noindent {\bf#1}
}
\parindent=0pt
\overfullrule=0in
%end macros
\bf
\centerline{The Goulden-Jackson Cluster Method: Extensions, Applications
and Implementations}
\medskip
\it
\centerline{John NOONAN$^1$ and Doron ZEILBERGER\footnote{$^1$}
{\baselineskip=9pt
\eightrm \raggedright
Department of Mathematics, Temple University,
Philadelphia, PA 19122.
{\baselineskip=9pt
\eighttt [noonan,zeilberg]@math.temple.edu ;
\break http://www.math.temple.edu/\Tilde
[noonan,zeilberg]. }
The work of the second author was
supported in part by the NSF. May 2, 1997.
}}
\medskip
\rm
{\bf Abstract:} The powerful
(and so far under-utilized) Goulden-Jackson Cluster method for finding
the generating function for the number of words avoiding, as factors,
the members of
a prescribed set of `dirty words', is tutorialized and extended in
various directions.
The authors' Maple implementations, contained in several
Maple packages available from this paper's website
{\tt http://www.math.temple.edu/\Tilde zeilberg/gj.html},
are described and explained.
{\bf Preface}
In New York City there is a hotel called ESSEX. Once in a while
the bulbs of the first two letters of its neon sign go out,
resulting in the wrong message. This motivates the following
problem. Given a finite alphabet, and a finite set (lexicon) of
`bad words',
find the number of
$n$-lettered words in the alphabet that avoid as {\it factors}
(i.e. strings of consecutive letters) any of the dirty words.
More generally, count the number of such words with a prescribed
number of occurrences of obscenities (the previous case being
$0$ bad words), and even more generally, count how many
words are there with a prescribed number of occurrences of
each letter of the alphabet, and a prescribed number of occurrences
of each of the bad words.
Many problems in combinatorics, probability, statistics, computer science,
engineering, and the natural and social sciences, are special cases of, or
can be formulated in terms of, the above scenario.
It is a rather well-kept secret that there exists a powerful
method, the Goulden-Jackson Cluster method[GoJ1][GoJ2], to tackle it.
In this paper we start with a motivated and
accessible account
of the method, and then we generalize it in various directions.
Most importantly, we describe our Maple
implementations of both the
original method and of our various extensions. These packages
are obtainable, free of charge, from this paper's very own
website
{\tt http://www.math.temple.edu/\Tilde zeilberg/gj.html }.
The Goulden-Jackson Cluster method is very similar, and in some
sense, a generalization of, the method of Guibas and Odlyzko[GuiO],
whose main motivation was Penney-ante games.
However, philosophically, psychologically, and conceptually, the
Goulden-Jackson and Guibas-Odlyzko
methods are quite distinct, and we find that the former
is more suitable for our purposes.
{\bf The Naive Approach}
Before describing the Cluster method, let's review the naive approach.
First, some notation.
Given a word $w=w_1 , \dots , w_n$, a {\it factor} (burrowing the
term from the theory of formal languages) is any of the ${{n+1} \choose {2}}$
words $w_i w_{i+1} \dots w_{j-1} w_j$, for $1 \leq i \leq j \leq n$.
For example the factors of {\it JOHN} are {\it J, O, H, N, JO, OH, HN,
JOH,OHN, JOHN}, while the factors of {\it DORON} are
{\it D, O, R, O, N, DO, OR, RO, ON, DOR, ORO, RON, DORO, ORON, DORON}.
Note that a given word may occur several times as a factor, for example
the one-letter word {\it O} in {\it DORON}, or the two-letter
words {\it CA} and {\it TI}
in {\it TITICACA}. Also as in formal languages, given
an alphabet $V$, we will denote
the set of all possible words in $V$ by $V^*$.
Consider a finite alphabet $V$ with $d$ letters, and
suppose that we want to keep
track of all factors of length $\leq R+1$, including individual
letters. For every word $w$, of length $\leq R+1$, introduce
a variable $x[w]$. All the $x[w]$ commute with each other.
Define a weight on words $w=w_1, \dots , w_n$, by:
$$
Weight(w)=\prod_{r=1}^{R+1} \prod_{i=1}^{n-r+1} x[w_i, \dots , w_{i+r-1}]
\quad
.
$$
For example, if $R=2$, then
$Weight(SEXY)=x[S]x[E]x[X]x[Y]x[SE]x[EX]x[XY]x[SEX]x[EXY]$.
The weight of a set of words (`language') $\L$, $Weight(\L)$, is
defined as the sum of the weights of all the words
belonging to that language. Also, given a language
$\L$ and a letter $v$, we will denote by $\L v$ the set of words obtained
from $\L$ by appending $v$ at the end of each of the words of
$\L$. Thus if $\L=\{SEX, LOON\}$, and $R=1$, then
$Weight(\L)=x[S]x[E]x[X]x[SE]x[EX]+x[L]x[O]^2x[N]x[LO]x[OO]x[ON]$,
and $\L Y=\{SEXY, LOONY\}$.
The {\it generating function}
$$
\Phi_R:=\sum_{w \in V^*} Weight(w) \quad,
$$
stores all the information about the number of words with a prescribed
number of factors of length $\leq R+1$. So, the number of words in
$V^{*}$ that have exactly $n_u$ factors that are $u$
for each $u \in V^*$ of $length(u)\leq R+1$, is the coefficient
in $\Phi_R$ of the monomial $\prod x[u]^{n_u}$,
where the product extends over the set $\{u \in V^*, length(u)\leq R+1 \} $.
If we want the generating function for the number of words
with a prescribed number of bad words and a prescribed number of letters,
we may first compute $\Phi_R$, (where $R+1$ is the maximum length of
a bad word),
and then set $x[v]=s$ for each letter
$v \in V$, and $x[w]=t$, if $w$ is a bad word, and $x[w]=1$ otherwise.
The coefficient of $s^n t^m$ in the resulting generating function would
be the number of $n$-letter words with exactly $m$ instances of
bad words occurring as factors.
If we want the generating function for words with no occurrences of
dirty words as factors, we set $t=0$.
How to compute $\Phi_R$? For each word $v \in V^*$, of length
$R$, let $Sof[v]$ be the subset of $V^*$ of words that ends with $v$.
Write $v=v_1, \dots, v_R$.
Every word in $Sof[v]$ is either $v$ itself or of length $> R$,
in which case
chopping the last letter results in an element of $Sof[u]$, for one
of the $d$ $u$'s of the form $i, v_1, \dots , v_{R-1}$.
In symbols
$$
Sof[v]=\{v\} \bigcup_{i \in V} Sof[i,v_1, \dots , v_{R-1}] v_R \quad.
\eqno(SetEq)
$$
Since, for any word $w=w_1, \dots, w_n \in V^*$, of length $>R$,
$$
Weight(w_1, \dots , w_n)=Weight(w_1, \dots , w_{n-1}) \cdot
\prod_{r=1}^{R+1}
x[w_{n-r+1}, \dots, w_n] \quad ,
$$
the system of set equations $(SetEq)$ translates to the linear system
of (algebraic) equations
$$
Weight(Sof[v])=
Weight(v)+ \left ( \prod_{r=1}^{R}
x[v_{R-r+1}, \dots, v_R] \right )
\sum_{i \in V}
x[i,v_1, \dots, v_R]
Weight(Sof[i,v_1, \dots , v_{R-1}]) \quad.
\eqno(Linear\_Algebra\_Eq)
$$
We have a system of $d^R$ linear equations for $d^R$ unknowns
$Weight(Sof[w])$, $w \in V^{*}, length(w)=R$,
that obviously has a unique solution (on combinatorial grounds!).
Since the coefficients are polynomials
(in fact monomials) in the variables
$x[w]$, $w \in V^{*}, length(w) \leq R+1$, the solutions
$Weight(Sof[v])$ must be rational functions in these variables.
After solving the system, we get $\Phi_R$ from
$$
\Phi_R= \sum_{w \in V^* \,, \, length(w)0$.\cr}
\eqno(ii)
$$
and for any finite set $A$,
$$
\prod_{a \in A} 0 = \prod_{a \in A} (1+(-1))=
\sum_{S \subset A} (-1)^{|S|} , \quad
$$
where as usual, $|S|$ denotes the cardinality of $S$.
We now have,
$$
f(s)= \sum_{w \in V^*} weight(w) 0^{\,[\,number \,\, of \,\,factors \,\,
of \,\,w \,\, that \,\,belong \,\,to\,\, B\,]}=
$$
$$
\sum_{w \in V^*} weight(w) (1+(-1))
^{\,[\,number \,\, of \,\,factors \,\,
of \,\,w \,\, that \,\,belong \,\,to\,\, B\,]}=
$$
$$
\sum_{w \in V^*}
\sum_{S \subset Bad(w)} (-1)^{|S|} s^{length(w)} \quad ,
$$
where $Bad(w)$ is the set of factors of $w$ that belong to
$B$. For example if $B=\{ SEX, EXE,XES \}$ and
$w=SEXES$, then $Bad(w)$ consists of the factors $SEX$
(occupying the first three letters), $EXE$, (occupying
letters 2,3,4), and $XES$ (occupying the last three letters).
So the desired generating function is also the weight-enumerator
of the much larger set consisting of pairs $(w,S)$, where
$S \subset Bad(w)$, and {\it now} the weight is defined by
$weight(w,S)=(-1)^{|S|}s^{length(w)}$. Surprisingly, it is
much easier to (weight-)count. We may think of them as
`marked words', where $S$ denotes the subset
consisting of those words that the censor, or teacher, was able to detect.
First, we need a convenient data-structure for these weird objects.
Any word $w$, of length $n$, $w=w_1 \dots w_n$, has
${{n+1} \choose {2}}$ factors $w_i, \dots , w_j$,
which we will denote by $[i,j]$.
$1 \leq i \leq j \leq n$. Hence any marked word may
be
represented by $(w; [i_1,j_1], [i_2,j_2], \dots , [i_l, j_l])$,
where $w_{i_r} w_{i_r+1} \dots w_{j_r-1} w_{j_r} \in B$,
for $r=1, \dots , l$, and we make it canonical by
ordering the $j_r$, i.e. we arrange the marked factors
such that $j_1