I have source code (20-30 lines of Java) that calculates an ngram probability by looking up n-gram occurrence counts. The software works as part of a grammar checker and calculates the probability of e.g. "It has for of the largest stadiums" and then "It has four of the largest stadiums" (note: for/four). The variant with the higher probability wins and is considered correct, the variant with the lower probability is considered an error.
The first task is to review the code to see if it is correct. The second task is to find a way to correctly compare ngrams of different sizes. For example, we want to compare "their" to "they're", but as "they're" is internally stored as "they" + "'re", the algorithm
doesn't work well anymore.
Note: This is a job for a probability theory expert, not for a software developer.