Edit Distance

Something I didn't know till now

By ahchao

September 4, 2021

Today afternoon I was talking to one of my ex-colleagues/friend and he mentioned he gave an interview question regarding autocorrect, which basically simplifies to selecting the correct word out of 2, given an input string.

Example, given 2 words, cat and category, cats will return cat, while categorycatcat will return category. Seems straightfoward, however, etegorycat will return cat. Bet you didn’t expect that too like me.

My initially response to myself was very naive and I initially thought it was something like if first 3 char matched cat, then it becomes cat. Or some sort of string length matching, whichever is closer to the autocorrected word. Either way, it’s gonna be tough to implement it (side note, it is indeed not that straightforward). Little did I know, there is an algorithm for it.

As I was not exposed enough (either my game degree is not ‘CS’ enough or I just suck), I did not know how the examples selected the ‘correct word’. I got a hint from him, “proximity = # insertions + # deletions”, no substitution. It did not quite translate to me and the naive me thought it was about going through char by char using some sort of stack or what to track insertion/deletions as we iterate through the input string to find the correct word.

It was only then did I learn about LCSS or levenshtein distance, or edit distance in general. I think either I failed myself or my CS degree failed me (probably the former, but I don’t recall having come across these specifically, so I wonder if NTU/NUS students came across these as well), as I have no idea what it was originally about. Regardless, I think it is good to be exposed to such stuff, so when a problem comes, we at least have a rough idea of what to search/todo, rather than being stuck or do things that people have already solved. You can’t know what you don’t know.

Basically, edit distance is about measuring differences between 2 things, e.g. strings, speech samples, etc.. As it turns out, edit distance is quite commonly used. It is used in git diff, and even something like that is used in the React reconcilation logic (the process through which React updates the DOM, https://shripadk.github.io/react/docs/reconciliation.html) in React. https://vishnubharathi.codes/blog/levenshtein-distance/ has a really good blog post on it, including the algorithm. https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0 also has a good explanation on how to build the table. Try implementing the code (https://leetcode.com/problems/edit-distance/solution/) to calculate the edit distance, so you can pick the best word!

comments powered by Disqus