最佳字符串匹配算法(Damerau-Levenshtein距离算法)的Java实现

标签: 字符串 匹配 算法 | 发表时间:2013-11-11 10:13 | 作者:datamachine
出处:http://www.iteye.com

原文: http://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html
---------------------------------------------------------------------------------------------------------------------------------

For a while, I’ve used the Apache Commons lang StringUtils implementation of Levenshtein distance.  It implements a few well known tricks to use less memory by only hanging on to two arrays instead of allocating a huge n x m table for the memoisation table.  It also only checks a “stripe” of width 2 * k +1 where k is the maximum number of edits.

 

In most practical usages of levenshtein you just care if a string is within some small number (1, 2, 3) of edits from another string.  This avoid much of the n * m computation that makes levenstein “expensive”.  We found that with a k <= 3, levenshtein with these tricks was faster than  Jaro-Winkler distance, which is an approximate edit distance calculation that was created to be a faster approximate (well there were many reasons).

 

Unfortunately, the Apache Commons Lang implementation only calculates Levenshtein and not the possible more useful  Damerau-Levenshtein distance.  Levenshtein defines the edit operations insert, delete, and substitute.  The Damerau variant adds *transposition* to the list, which is pretty useful for most of the places I use edit distance.  Unfortunately DL distance is not a true metric in that it doesn’t respect the triangle inequality, but there are plenty of applications that are unaffected by this.  As you can see from that wikipedia page, there is often confusion between Optimal String Alignment and DL distance.  In practice OSA is a simpler algorithm and requires less book-keeping so the runtime is probably marginally faster.

 

I could not find any implementations of OSA or DL that used the memory tricks and “stripe” tricks that I saw in Apache Commons Lang.  So I implemented my own OSA using those tricks.  At some point I’ll also implement DL with the tricks and see what the performance differences are:

 

Here’s OSA in Java.  It’s public domain; feel free to use as you like. The unit tests are below. Only dependency is on Guava- but its just the preconditions class and an annotation for documentation so easy to remove that dependency if you like:

 

package com.github.steveash.util;

import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.base.Preconditions.checkNotNull;
import static com.google.common.primitives.Shorts.checkedCast;
import static java.lang.Math.abs;
import static java.lang.Math.max;

import java.util.Arrays;

import com.google.common.annotations.VisibleForTesting;

/**
 * Implementation of the OSA which is similar to the Damerau-Levenshtein in that it allows for transpositions to
 * count as a single edit distance, but is not a true metric and can over-estimate the cost because it disallows
 * substrings to edited more than once.  See wikipedia for more discussion on OSA vs DL
 * <p/>
 * See Algorithms on Strings, Trees and Sequences by Dan Gusfield for more information.
 * <p/>
 * This also has a set of local buffer implementations to avoid allocating new buffers each time, which might be
 * a premature optimization
 * <p/>
 * @author Steve Ash
 */
public class OptimalStringAlignment {

    private static final int threadLocalBufferSize = 64;

    private static final ThreadLocal<short[]> costLocal = new ThreadLocal<short[]>() {
        @Override
        protected short[] initialValue() {
            return new short[threadLocalBufferSize];
        }
    };

    private static final ThreadLocal<short[]> back1Local = new ThreadLocal<short[]>() {
        @Override
        protected short[] initialValue() {
            return new short[threadLocalBufferSize];
        }
    };

    private static final ThreadLocal<short[]> back2Local = new ThreadLocal<short[]>() {
        @Override
        protected short[] initialValue() {
            return new short[threadLocalBufferSize];
        }
    };

    public static int editDistance(CharSequence s, CharSequence t, int threshold) {
        checkNotNull(s, "cannot measure null strings");
        checkNotNull(t, "cannot measure null strings");
        checkArgument(threshold >= 0, "Threshold must not be negative");
        checkArgument(s.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");
        checkArgument(t.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");

        if (s.length() + 1 > threadLocalBufferSize || t.length() + 1 > threadLocalBufferSize)
            return editDistanceWithNewBuffers(s, t, checkedCast(threshold));

        short[] cost = costLocal.get();
        short[] back1 = back1Local.get();
        short[] back2 = back2Local.get();
        return editDistanceWithBuffers(s, t, checkedCast(threshold), back2, back1, cost);
    }

    @VisibleForTesting
    static int editDistanceWithNewBuffers(CharSequence s, CharSequence t, short threshold) {
        int slen = s.length();
        short[] back1 = new short[slen + 1];    // "up 1" row in table
        short[] back2 = new short[slen + 1];    // "up 2" row in table
        short[] cost = new short[slen + 1];     // "current cost"

        return editDistanceWithBuffers(s, t, threshold, back2, back1, cost);
    }

    private static int editDistanceWithBuffers(CharSequence s, CharSequence t, short threshold,
            short[] back2, short[] back1, short[] cost) {

        short slen = (short) s.length();
        short tlen = (short) t.length();

        // if one string is empty, the edit distance is necessarily the length of the other
        if (slen == 0) {
            return tlen <= threshold ? tlen : -1;
        } else if (tlen == 0) {
            return slen <= threshold ? slen : -1;
        }

        // if lengths are different > k, then can't be within edit distance
        if (abs(slen - tlen) > threshold)
            return -1;

        if (slen > tlen) {
            // swap the two strings to consume less memory
            CharSequence tmp = s;
            s = t;
            t = tmp;
            slen = tlen;
            tlen = (short) t.length();
        }

        initMemoiseTables(threshold, back2, back1, cost, slen);

        for (short j = 1; j <= tlen; j++) {
            cost[0] = j; // j is the cost of inserting this many characters

            // stripe bounds
            int min = max(1, j - threshold);
            int max = min(slen, (short) (j + threshold));

            // at this iteration the left most entry is "too much" so reset it
            if (min > 1) {
                cost[min - 1] = Short.MAX_VALUE;
            }

            iterateOverStripe(s, t, j, cost, back1, back2, min, max);

            // swap our cost arrays to move on to the next "row"
            short[] tempCost = back2;
            back2 = back1;
            back1 = cost;
            cost = tempCost;
        }

        // after exit, the current cost is in back1
        // if back1[slen] > k then we exceeded, so return -1
        if (back1[slen] > threshold) {
            return -1;
        }
        return back1[slen];
    }

    private static void iterateOverStripe(CharSequence s, CharSequence t, short j,
            short[] cost, short[] back1, short[] back2, int min, int max) {

        // iterates over the stripe
        for (int i = min; i <= max; i++) {

            if (s.charAt(i - 1) == t.charAt(j - 1)) {
                cost[i] = back1[i - 1];
            } else {
                cost[i] = (short) (1 + min(cost[i - 1], back1[i], back1[i - 1]));
            }
            if (i >= 2 && j >= 2) {
                // possible transposition to check for
                if ((s.charAt(i - 2) == t.charAt(j - 1)) &&
                        s.charAt(i - 1) == t.charAt(j - 2)) {
                    cost[i] = min(cost[i], (short) (back2[i - 2] + 1));
                }
            }
        }
    }

    private static void initMemoiseTables(short threshold, short[] back2, short[] back1, short[] cost, short slen) {
        // initial "starting" values for inserting all the letters
        short boundary = (short) (min(slen, threshold) + 1);
        for (short i = 0; i < boundary; i++) {
            back1[i] = i;
            back2[i] = i;
        }
        // need to make sure that we don't read a default value when looking "up"
        Arrays.fill(back1, boundary, slen + 1, Short.MAX_VALUE);
        Arrays.fill(back2, boundary, slen + 1, Short.MAX_VALUE);
        Arrays.fill(cost, 0, slen + 1, Short.MAX_VALUE);
    }

    private static short min(short a, short b) {
        return (a <= b ? a : b);
    }

    private static short min(short a, short b, short c) {
        return min(a, min(b, c));
    }
}






 



已有 0 人发表留言,猛击->> 这里<<-参与讨论


ITeye推荐



相关 [字符串 匹配 算法] 推荐:

字符串匹配的Boyer-Moore算法

- - 阮一峰的网络日志
上一篇文章,我介绍了 KMP算法. 但是,它并不是效率最高的算法,实际采用并不多. 各种文本编辑器的"查找"功能(Ctrl+F),大多采用 Boyer-Moore算法. Boyer-Moore算法不仅效率高,而且构思巧妙,容易理解. 1977年,德克萨斯大学的Robert S. Boyer教授和J Strother Moore教授发明了这种算法.

字符串匹配的KMP算法

- - 博客园_知识库
   字符串匹配是计算机的基本任务之一.   举例来说,有一个字符串"BBC ABCDAB ABCDABCDABDE",我想知道,里面是否包含另一个字符串"ABCDABD".   许多算法可以完成这个任务, Knuth-Morris-Pratt算法(简称KMP)是最常用的之一. 它以三个发明者命名,起头的那个K就是著名科学家Donald Knuth.

字符串匹配 KMP 算法 Java实现

- - ITeye博客
字符串匹配过程中,如果使用蛮力算法,效率非常的差,在此介绍一种较为高效的匹配算法KMP算法. 其主要思想是从匹配的模版去分析,即去分析Pattern串的自身规律,进而去优化匹配的效率. 例如字符串“ababcb”,明显看出是ab出现一组重复,若出现如下匹配模式:. 此时发生错误,一般情况下会选择移动Pattern一个位置来继续,事实证明效果不佳.

字符串匹配(BF,BM,Sunday,KMP算法解析)

- - CSDN博客综合推荐文章
字符串匹配一直是计算机领域热门的研究问题之一,多种算法层出不穷. 字符串匹配算法有着很强的实用价值,应用于信息搜索,拼写检查,生物信息学等多个领域. 今天介绍几种比较有名的算法:. BF(Brute Force)算法又称为暴力匹配算法,是普通模式匹配算法. 其算法思想很简单,从主串S的第pos个字符开始,和模式串T的第一个字符进行比较,若相等,则主串和模式串都后移一个字符继续比较;若不相同,则回溯到主串S的第pos+1个字符重新开始比较.

最佳字符串匹配算法(Damerau-Levenshtein距离算法)的Java实现

- - Java - 编程语言 - ITeye博客
原文: http://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html.  It implements a few well known tricks to use less memory by only hanging on to two arrays instead of allocating a huge n x m table for the memoisation table.

转载一篇单字符串匹配KMP算法最好理解的文章

- - 编程语言 - ITeye博客
字符串匹配是计算机的基本任务之一.   举例来说,有一个字符串"BBC ABCDAB ABCDABCDABDE",我想知道,里面是否包含另一个字符串"ABCDABD".   许多算法可以完成这个任务,. Knuth-Morris-Pratt算法(简称KMP)是最常用的之一. 它以三个发明者命名,起头的那个K就是著名科学家Donald Knuth.

字符串匹配那些事(一)

- jiessie - 搜索技术博客-淘宝
本系列文章主要介绍几种常用的字符串比较算法,包括但不限于蛮力匹配算法,KMP算法,BM算法,Horspool算法,Sunday算法,fastsearch算法,KR算法等等. 本文主要介绍KMP算法和BM算法,它们分别是前缀匹配和后缀匹配的经典算法. 所谓前缀匹配是指:模式串和母串的比较从左到右,模式串的移动也是从左到右;所谓后缀匹配是指:模式串和母串的的比较从右到左,模式串的移动从左到右.

字符串相似算法-Jaro-Winkler Distance

- - 开源软件 - ITeye博客
Jaro-Winkler Distance 算法. 这是一种计算两个字符串之间相似度的方法,想必都听过Edit Distance,Jaro-inkler Distance 是Jaro Distance的一个扩展,而Jaro Distance(Jaro 1989;1995)据说是用来判定健康记录上两个名字是否相同,也有说是是用于人口普查,具体干什么就不管了,让我们先来看一下Jaro Distance的定义.

[记录]字符串相似度算法(编辑距离算法 Levenshtein Distance)

- - xilo's blog
在搞验证码识别的时候需要比较字符代码的相似度用到“编辑距离算法”,关于原理和C#实现做个记录. 编辑距离,又称Levenshtein距离(也叫做Edit Distance),是指两个字串之间,由一个转成另一个所需的最少编辑操作次数,如果它们的距离越大,说明它们越是不同. 许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符.