An In-depth Discussion of Textual Similarity: Taking a look at the toolkit
The Introduction’s Introduction
This is the third post in a series of blog posts on textual similarity. If you haven’t already read the previous posts I recommend reading them. TL:DR those posts;
- Define the target audience
- Caveat the series
- Define what textual similarity is
- Discuss how current usage of the term is too broad
- Describe characteristics of similarity that matters and provide a taxonomy to reason about them
- Give specific examples of tasks and how they map up to the taxonomy
Introduction
Even with a framework to describe how texts can be similarity, we still need to translate that into practice. Below is a list of similarity methods I’ve come across. It cannot be understated how instrumental the SimMetrics website was in this process. I’m familiar with common GOF methods and a fair number of neural methods, but there are a large number of GOF ML approaches that are not typically covered or mentioned otherwise.
Below is a framework grouping together methods based on how they work. This will make more sense later when discussing how to pick between methods. The underlying mechanism limits what each method can do.
Considerations/how to pick between methods will be covered in the next post.
Methods
This is an incomplete list of textual similarity methods. This list won’t, and can’t, cover every single method and its variants. A lot of these methods can be tweaked or otherwise combined to create something novel not described here. When there are too many variants to cover, notes will be left to guide the reader.
Broadly the methods are broken down to be similar in terms of implementation/underlying mechanism. The reason is that a lot of the methods grouped together, while they may seem different, share some underlying qualities.
- String Matching
- Exact match: Straight forward, are these the same strings a=b
- Processed match: Applying some transformation (stripping excess spaces, lower case) and check equality f(a) = f(b)
- Overlap Matching: Process string and look at how much overlap exists between those processed values
- Set-overlap: 2. Jaccard Similarity: |intersection(a,b)|/|union(a,b)| 3. Overlap Coefficient: |intersection(a,b)|/min(|a|,|b|) 3. MinHash
- N-gram overlap:
- Substring matching
- Jaro distance metric
- Jaro Winkler
- Ukkonen Algorithm
- Compression Similarity: Using a lossless compression algorithm like gzip: |compress(a,b)|/min(|a|,|b|) range of this is [1,|a|+|b|]. Could be normalized by using d-1/(|a|+|b|-1) where d is the distance
- Difflib
- Dynamic programming
- Edit Distance
- Levenshtein distance Damerau-Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- Smith-Waterman distance
- Gotoh Distance or Smith-Waterman-Gotoh distance
- Monge Elkan distance
- Dynamic Time Warping
- Edit Distance
- Embedding Space Based (WIP, will add discussion on embeddings)
- L1
- L2
- Cosine Similarity
- BERT-Score
- Learned methods (WIP)
- Embeddings
- Similarity Classification
- Heuristic Methods (custom problem specific approaches)
Closing Comments
Some final remarks; a fair number of methods to work with strings were omitted since they
- Require a corpus (bloom filters, statistical methods (KL Divergence))
- Don’t make sense in the context of comparing two strings (GREP)
While these may make sense and can be used to compare strings, they didn’t fit into how the subject has been framed so far.
Another remark is that input granularity can often be varied. For example edit distance is often discussed at the char-level. There is nothing stopping you from doing it at the word level instead, in fact that might make more sense for most applications. If deduplicating it is unlikely the same word will be misspelled in one document, but correctly in another. Reducing the number of tokens will increase the speed.
Beyond changing input granularity, methods can also be combined. If misspelled words is a concern, you could use something like Jaccard similarity at the char level to account for any misspelt words. Nothing is stopping you from mixing and matching techniques, it’s just rarely discussed and becomes much more problem specific. To see if a specific combination is necessary you would need metrics.