Suffix automaton
Deterministic finite automaton accepting set of all suffixes of particular string / From Wikipedia, the free encyclopedia
Dear Wikiwand AI, let's keep it short by simply answering these key questions:
Can you list the top facts and stats about Suffix automaton?
Summarize this article for a 10 year old
In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.
Suffix automaton | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Type | Substring index | ||||||||||||||
Invented | 1983 | ||||||||||||||
Invented by | Anselm Blumer; Janet Blumer; Andrzej Ehrenfeucht; David Haussler; Ross McConnell | ||||||||||||||
|
In terms of automata theory, a suffix automaton is the minimal partial deterministic finite automaton that recognizes the set of suffixes of a given string . The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton.
Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string having length at least two characters has at most states and at most transitions. Further works have shown a close connection between suffix automata and suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc.
Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings.
The concept of suffix automaton was introduced in 1983[1] by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner,[2] Vaughan Pratt[3] and Anatol Slissenko.[4] In their initial work, Blumer et al. showed a suffix automaton built for the string of length greater than has at most states and at most transitions, and suggested a linear algorithm for automaton construction.[5]
In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm[2] while building a suffix tree of the string constructs a suffix automaton of the reversed string as an auxiliary structure.[6] In 1987, Blumer et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG).[7] In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction.[1] In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie.[8]
Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular:[9]
- "Alphabet" is a finite set that is used to construct words. Its elements are called "characters";
- "Word" is a finite sequence of characters . "Length" of the word is denoted as ;
- "Formal language" is a set of words over given alphabet;
- "Language of all words" is denoted as (where the "*" character stands for Kleene star), "empty word" (the word of zero length) is denoted by the character ;
- "Concatenation of words" and is denoted as or and corresponds to the word obtained by writing to the right of , that is, ;
- "Concatenation of languages" and is denoted as or and corresponds to the set of pairwise concatenations :\alpha \in A,\beta \in B\}} ;
- If the word may be represented as , where , then words , and are called "prefix", "suffix" and "subword" (substring) of the word correspondingly;
- If and (with ) then is said to "occur" in as a subword. Here and are called left and right positions of occurrence of in correspondingly.