+
+ +
+

Collation Algorithm

+

The library uses an enhancement of the Needleman-Wunsch algorithm by Gotoh [Gotoh1982]. +This section provides a very high level overview of the algorithm.

+
+

Phase 1 - Build Table

+

In phase 1 the algorithm builds a table. For example this is the table built for the +two strings: the quick brown fox jumps over the lazy dog and sick fox is crazy.

+ + + + + + + + + + + +
sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96
+ +

Every cell in the table contains three values: \(D\), \(P\), and \(Q\), and an arrow, like this:

+ + + +
+ + + +
DP
Q
+

We define the score \(S\) for each cell as:

+
+\[S = \max(D, P, Q)\]
+

The grayed cells in the first row and first column are initialized using the gap start +and gap extension penalties. The numbers for each remaining cell are calculated using +only values from the three cells, to the top-left, the top, and the left, of the current +cell:

+
+\[D = S_↖ + \mbox{similarity}(word_←, word_↑)\]
+
+\[P = \max(S_↑ + openingpenalty, P_↑ + extensionpenalty)\]
+
+\[Q = \max(S_← + openingpenalty, Q_← + extensionpenalty)\]
+

Finally the arrow in the current cell is set to point to that cell which yielded the +highest of the current cell’s \(D\), \(P\), and \(Q\) values.

+
+
+

Phase 2 - Backtrack

+

When the table is thus completed, two empty sequences are created. Then the algorithm +starts backtracking from the last (bottom-right) cell following the arrows until it +reaches the first (top-left) cell. If the arrow points:

+
+

the word in the row header is added to the first sequence, a hyphen is added to the +second sequence,

+
+

the word in the row header is added to the first sequence, the word in the column +header is added to the second sequence,

+
+

a hyphen is added to the first sequence, the word in the column header is added to the +second sequence.

+
+
+ + + + + + + + + + + +
sickfoxiscrazy
0.000.00
0.00
-0.50-0.50
-0.50
-0.70-0.70
-0.70
-0.90-0.90
-0.90
-1.10-1.10
-1.10
the
-0.50-0.50
-0.50
0.00-0.70
-0.70
-0.50-0.90
-0.50
-0.70-1.10
-0.70
-0.90-1.30
-0.90
quick
-0.70-0.70
-0.70
-0.04-0.50
-0.90
0.00-1.00
-0.54
-0.50-1.20
-0.50
-0.70-1.40
-0.70
brown
-0.90-0.90
-0.90
-0.70-0.54
-1.10
-0.04-0.50
-1.04
0.00-1.00
-0.54
-0.50-1.20
-0.50
fox
-1.10-1.10
-1.10
-0.90-0.74
-1.30
0.46-0.54
-1.24
-0.04-0.50
-0.04
0.00-1.00
-0.24
jumps
-1.30-1.30
-1.30
-1.10-0.94
-1.50
-0.74-0.04
-1.44
0.64-0.54
-0.54
-0.04-0.50
0.14
over
-1.50-1.50
-1.50
-1.30-1.14
-1.70
-0.94-0.24
-1.64
-0.040.14
-0.74
0.64-0.36
-0.36
the
-1.70-1.70
-1.70
-1.50-1.34
-1.90
-1.14-0.44
-1.84
-0.24-0.06
-0.94
0.140.14
-0.56
lazy
-1.90-1.90
-1.90
-1.70-1.54
-2.10
-1.34-0.64
-2.04
-0.44-0.26
-1.14
0.40-0.06
-0.76
dog
-2.10-2.10
-2.10
-1.90-1.74
-2.30
-1.54-0.84
-2.24
-0.64-0.46
-1.34
-0.26-0.10
-0.96
+ +

Finally the two sequences are reversed and printed.

+ + +
thequickbrownfoxjumpsoverthelazydog
-sick-foxis--crazy-
+ +
+
+

Parameters

+

The algorithm can be customized by setting:

+
    +
  • a word comparison (similarity) function,

  • +
  • the starting gap penalty,

  • +
  • the gap opening penalty,

  • +
  • and the gap extension penalty.

  • +
+
+
+ + +
+