A diff tool generates an output given two input files. This output can be thought of as a script which when applied to first input returns the second input. Diff tool is nothing but a straight forward implementation for finding edit distance at its core. In addition to finding minimum edit distance we also find actions which when applied first input would lead to second input. These actions constitute diff tools output. Lets first define some classes which denote these actions
Above we declare three forms of edit actions which can be performed on first input to arrive at second input. Delete action contains line number of the line which is deleted from first input. Insert action contains line number from second input which is inserted into first. Copy action contains two line numbers. First is the line number from first input which is copied and second is line number in second input where it is copied. Although, there would be many ways to do this but we would like a sequence of these action with minimum cost. Now, we assume that cost of Delete and Insert is more that Copy action since in case of Copy line in first and second input is same and we did not make a change. And since while finding minimum cost only relative cost matters we can assign cost 1 to Delete and Insert Actions and cost of 0 to Copy action. Below is cost function for these edit actions
Now, lets code up the actual diff method which takes line numbers in first and second input at which starts calculating the diff. The actual diff of two file can be found by invoking this function at position 0,0. This method also assumes that lines from first input are available in lines1 and lines from second input are available in lines2 variables.
Above is a naive implementation of our diff algorithm where in first if we just return empty list of EditAction since pos1 and pos2 are both equal to or greater than number of lines in files, which means that we have already processed all lines. In second if we see that we have already processed all lines in first file so we have no option but to insert lines from second file. Third if handles the case when we have processed all lines in second file, in this case we have no option but to delete remaining lines from first input. Next we handle case when current line in input 1 and input 2 is same, in this case it would be best to simply copy this line from file one to file two. Next we handle more complex case by calculating cost of both performing Insert and Delete and then comparing the diff cost and returning the diff with minimum cost. Now lets also add code to read input files and call diff method.
When running above program two files, the program literally prints a list of actions to be performed to convert input one into input two. Now this list of actions can be easily translated to either text patch format or html diff or html split view diff. Now, let us add a method to convert these actions to html.
Although, this is conceptually complete implementation for generating HTML diff of two files, this implementation is usable only for creating diffs of small files. For large files this implementation is likely to fail with Stack Overflow Exception, since the definition is recursive. One way to fix this is to have an implementation which is tail recursive, such implementations are converted to iteration by Scala compiler. Following is such an implementation of diff routine. We also define a type alias DiffScript for List[EditAction] type
Tail Recursive Diff implementation
Above diff method is tail recursive implementation of iterative lcs implementation. Instead of calculating diff(a,b) in terms of diff(a-1,b),diff(a,b-1) and diff(a-1,b-1), it sequentially calculates diff(a,0),diff(a,1)….diff(a,lines2.length) and stores it in last variable. It then uses this list to calculate diff(a+1,0),diff(a+1,1)…diff(a+1,lines2.length). Proceeding this way it calculates diff(lines1.length,lines2.length) which is the final result.