An In-depth Discussion of Textual Similarity: Taking a look at the toolkit

The Introduction’s Introduction

This is the third post in a series of blog posts on textual similarity. If you haven’t already read the previous posts I recommend reading them. TL:DR those posts;

Define the target audience
Caveat the series
Define what textual similarity is
Discuss how current usage of the term is too broad
Describe characteristics of similarity that matters and provide a taxonomy to reason about them
Give specific examples of tasks and how they map up to the taxonomy

Introduction

Even with a framework to describe how texts can be similarity, we still need to translate that into practice. Below is a list of similarity methods I’ve come across. It cannot be understated how instrumental the SimMetrics website was in this process. I’m familiar with common GOF methods and a fair number of neural methods, but there are a large number of GOF ML approaches that are not typically covered or mentioned otherwise.

Below is a framework grouping together methods based on how they work. This will make more sense later when discussing how to pick between methods. The underlying mechanism limits what each method can do.

Considerations/how to pick between methods will be covered in the next post.

Methods

This is an incomplete list of textual similarity methods. This list won’t, and can’t, cover every single method and its variants. A lot of these methods can be tweaked or otherwise combined to create something novel not described here. When there are too many variants to cover, notes will be left to guide the reader.

Broadly the methods are broken down to be similar in terms of implementation/underlying mechanism. The reason is that a lot of the methods grouped together, while they may seem different, share some underlying qualities.

String Matching
1. Exact match: Straight forward, are these the same strings a=b
2. Processed match: Applying some transformation (stripping excess spaces, lower case) and check equality f(a) = f(b)
Overlap Matching: Process string and look at how much overlap exists between those processed values
1. Set-overlap: 2. Jaccard Similarity: |intersection(a,b)|/|union(a,b)| 3. Overlap Coefficient: |intersection(a,b)|/min(|a|,|b|) 3. MinHash
2. N-gram overlap:
  1. ROUGE
  2. BLEU
  3. q-gram
3. Substring matching
  1. Jaro distance metric
  2. Jaro Winkler
  3. Ukkonen Algorithm
  4. Compression Similarity: Using a lossless compression algorithm like gzip: |compress(a,b)|/min(|a|,|b|) range of this is [1,|a|+|b|]. Could be normalized by using d-1/(|a|+|b|-1) where d is the distance
  5. Difflib
Dynamic programming
1. Edit Distance
  1. Levenshtein distance Damerau-Levenshtein distance
  2. Needleman-Wunch distance or Sellers Algorithm
  3. Smith-Waterman distance
  4. Gotoh Distance or Smith-Waterman-Gotoh distance
  5. Monge Elkan distance
2. Dynamic Time Warping
Embedding Space Based (WIP, will add discussion on embeddings)
1. L1
2. L2
3. Cosine Similarity
4. BERT-Score
Learned methods (WIP)
1. Embeddings
2. Similarity Classification
Heuristic Methods (custom problem specific approaches)

Closing Comments

Some final remarks; a fair number of methods to work with strings were omitted since they

Require a corpus (bloom filters, statistical methods (KL Divergence))
Don’t make sense in the context of comparing two strings (GREP)

While these may make sense and can be used to compare strings, they didn’t fit into how the subject has been framed so far.

Another remark is that input granularity can often be varied. For example edit distance is often discussed at the char-level. There is nothing stopping you from doing it at the word level instead, in fact that might make more sense for most applications. If deduplicating it is unlikely the same word will be misspelled in one document, but correctly in another. Reducing the number of tokens will increase the speed.

Beyond changing input granularity, methods can also be combined. If misspelled words is a concern, you could use something like Jaccard similarity at the char level to account for any misspelt words. Nothing is stopping you from mixing and matching techniques, it’s just rarely discussed and becomes much more problem specific. To see if a specific combination is necessary you would need metrics.

A Dream: An Easy Way to Work with Documents and (implicitly) Structured Text

Original Work: “Nudging” Active Learning to Learn Minority Classes