博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
c#-SimHash匹配相似-算法
阅读量:6279 次
发布时间:2019-06-22

本文共 4308 字,大约阅读时间需要 14 分钟。

使用场景:Google 的 simhash 算法

//通过大量测试,simhash用于比较大文本,比如500字以上效果都还蛮好,距离小于3的基本都是相似,误判率也比较低。 //从我的经验,如果我们假定N是每个块的大小,M是重叠的字符的数目,N = 4和M = 3是最好的选择

  

public class SimHashAnalyser : IAnalyser    {        private const int HashSize = 32;        public float GetLikenessValue(string needle, string haystack)        {            var needleSimHash = this.DoCalculateSimHash(needle);            var hayStackSimHash = this.DoCalculateSimHash(haystack);            return (HashSize - GetHammingDistance(needleSimHash, hayStackSimHash)) / (float)HashSize;        }        private static IEnumerable
DoHashTokens(IEnumerable
tokens) { var hashedTokens = new List
(); foreach (string token in tokens) { hashedTokens.Add(token.GetHashCode()); } return hashedTokens; } private static int GetHammingDistance(int firstValue, int secondValue) { var hammingBits = firstValue ^ secondValue; var hammingValue = 0; for (int i = 0; i < 32; i++) { if (IsBitSet(hammingBits, i)) { hammingValue += 1; } } return hammingValue; } private static bool IsBitSet(int b, int pos) { return (b & (1 << pos)) != 0; } private int DoCalculateSimHash(string input) { ITokeniser tokeniser = new OverlappingStringTokeniser(4, 3); var hashedtokens = DoHashTokens(tokeniser.Tokenise(input)); var vector = new int[HashSize]; for (var i = 0; i < HashSize; i++) { vector[i] = 0; } foreach (var value in hashedtokens) { for (var j = 0; j < HashSize; j++) { if (IsBitSet(value, j)) { vector[j] += 1; } else { vector[j] -= 1; } } } var fingerprint = 0; for (var i = 0; i < HashSize; i++) { if (vector[i] > 0) { fingerprint += 1 << i; } } return fingerprint; } } public interface IAnalyser { float GetLikenessValue(string needle, string haystack); } public interface ITokeniser { IEnumerable
Tokenise(string input); } public class FixedSizeStringTokeniser : ITokeniser { private readonly ushort tokensize = 5; public FixedSizeStringTokeniser(ushort tokenSize) { if (tokenSize < 2 || tokenSize > 127) { throw new ArgumentException("Token 不能超出范围"); } this.tokensize = tokenSize; } public IEnumerable
Tokenise(string input) { var chunks = new List
(); int offset = 0; while (offset < input.Length) { chunks.Add(new string(input.Skip(offset).Take(this.tokensize).ToArray())); offset += this.tokensize; } return chunks; } } public class OverlappingStringTokeniser : ITokeniser { private readonly ushort chunkSize = 4; private readonly ushort overlapSize = 3; public OverlappingStringTokeniser(ushort chunkSize, ushort overlapSize) { if (chunkSize <= overlapSize) { throw new ArgumentException("Chunck 必须大于 overlap"); } this.overlapSize = overlapSize; this.chunkSize = chunkSize; } public IEnumerable
Tokenise(string input) { var result = new List
(); int position = 0; while (position < input.Length - this.chunkSize) { result.Add(input.Substring(position, this.chunkSize)); position += this.chunkSize - this.overlapSize; } return result; } }

  

使用:

const string HayStack = "中国香港………………";    const string Needle = "中国香港 2013………………";    IAnalyser analyser = new SimHashAnalyser();    var likeness = analyser.GetLikenessValue(Needle, HayStack);    Console.Clear();    Console.WriteLine("Likeness: {0}%", likeness * 100);    Console.ReadKey();

  

 

转载于:https://www.cnblogs.com/zengxiangzhan/p/3311114.html

你可能感兴趣的文章
实验4 IIC通讯与EEPROM接口
查看>>
几个smarty小技巧
查看>>
Cocos2d-x3.2 Grid3D网格动作
查看>>
Java (for循环综合应用)
查看>>
NodeJs——(10)REST风格的路由规则
查看>>
软件可扩展性:来自星巴克的经验
查看>>
Java Cache系列之Guava Cache实现详解
查看>>
深入Log4J源码之LoggerRepository和Configurator
查看>>
System V 消息队列—复用消息
查看>>
vi常用快捷键
查看>>
Code Jam 2010 Round 1A Problem A
查看>>
C语言柔性数组
查看>>
iOS学习之flappyBird游戏的实现
查看>>
Cocos2D v2.0至v3.x简洁转换指南(五)
查看>>
springMVC4(8)模型数据绑定全面分析
查看>>
设计模式 - 适配器
查看>>
CSS之可折叠导航
查看>>
淘宝美工设计师细说何为天猫透明背景
查看>>
【B/S学习总结】我的第100篇CSDN博客
查看>>
[Hadoop]chukwa与ganglia的区别
查看>>