Diff in 2 dimensions
A spreadsheet solution for comparing tabular data, with potential for limited extension into a third dimension.
Challenge:
Discover differences between, for example, recordsets from parallel databases, row by row, field by field.
Background:
I had worked for a large institution that used a pipeline of n-1 large, similar database environments
for purposes of development, testing and acceptance
before software changes were trusted to be rolled into the final nth production environment.
Each environment had its own evolving host of administrators and users representing a range of aptitudes and skills.
Owing to inadequate methods, policies, practices and policing,
single point changes to tables storing application parameters
had, over time, altered the character of each environment diverging from a standard.
This situation became uncomfortably apparent
when an irritating, non-critical error message began appearing in the overnight processing
on the 4th or 5th environment in this pipeline.
The cause of the error message was found to be an almost insignificant difference
in one field of one record of one parameter table.
Sometime in the past, a powerful user had not completed all required changes when 'turning on' additional processing and reporting.
Once the cause was found, the remedy was obvious and applied quickly.
However, the implication of this discovery was unnerving.
The administration recognised the need for version control of application changes and their migration through
the environments through expensive version control software and dedicated staff implementing protocols.
Sadly, there existed a blind spot in regard of the effect of parameter values on the application's stability.
Finally convincing managers
that disregarding these configuration differences
could invalidate weeks of cautious testing through pre-production environments
and potentially cripple or topple the production environment,
I was assigned the unsavoury task of discovering and documenting the differences.
(The myriad real single point fixes,
re-implementing potentially badly executed procedure as code instead of printed instructions,
was viewed as 'dangerous'.
One of the many reasons I don't work there any more.)
The reason for this cautious analysis was that each of the database/software environments had reached its own equilibrium.
(It was the disturbance of one environment's equilibrium that exposed a poorly executed change sometime in the past
triggering this entire project.) There was no TRUE configuration that could be arbitrarily applied to all environments
to wash away all past sins in one swoop.
The application had been developed and managed 'outhouse' being the bread'n'butter for a certain software house for years.
When the application was brought 'inhouse',
this supplier was not keen to provide more than the minimum that the purchase contract required
including the minimum of useful documentation.
For bureaucratic reasons, my access to software tools was a hurdle as well.
I faced a challenge of differencing quite a large volume of data.
And the safeguards around the environments precluded ideas of ODBC connections to multiple datasources.
This had to be done external to 'the system'.
Solution:
An available utility could save the recordset from an arbitrary SQL query on any single database as a CSV text file.
Running the same query against each of the databases provided corresponding CSV files of rows and columns
to search for differences between each database.
The fastest and most flexible comparison tool available at that office was MS-Office Excel.
(This is not a plug for Microsoft.
OpenOffice-Calc functions just as well.
1-2-3 would probably work, too, but I've no experience with that product.)
Text comparison tools like Unix's diff or VisualStudio's WinDiff weren't quite right as they compare lists (1D), not tables (2D).
I needed something that highlighted field difference(s) while maintaining context.
The following steps, written in the jargon of Excel, yielded the necessary functionality.
Unless explicitly stated, all cells references in the procedure are on Sheet1.
- Begin with a fresh workbook and delete the unnecessary Sheet3
- Create 15 rows by 15 columns of arbitrary sample data on Sheet2
This can be text, numbers, dates...
Entries in the first row should represent column titles as would usually appear in a CSV file.
Entries in the leftmost column(s) should represent primary keys as they likely would in a sorted CSV file.
The dimensions are arbitrary for this procedure.
- Copy Sheet2 as Sheet2a
In the following step, the arbitrary names of these worksheets are entered into cells of Sheet1.
- Set B1 to Sheet2 and B14 to Sheet2a
These 2 text cells contain the names of the worksheets (tables) in this workbook to be compared.
- Set B2 to 1
As will be seen later, this is the first interesting row number to compare.
- Set B3 to the formula =B2+1 and copy B3 to B4:B11
B11 should contain =B10+1 and B2:B11 should display the values '1' through '10'.
- Set B15 to the formula =B3 and copy B15 to B16:B24
B24 should contain =B11 and B15:B24 should display the values '1' through '10'.
- Set C1 to =INDIRECT(ADDRESS(1;COLUMN()-2;1;$B$1))
This builds a cell reference that retrieves the value of, in this example, Sheet2!A1.
C1 should retrieve and display the value of the top, left cell in Sheet2.
The static value '1' as the first parameter to ADDRESS freezes Sheet1!C1 when scrolling is added later in this procedure.
- Apply conditional formating to C1
setting the background to a noticeable colour when
the cell value Is Not Equal to the value of cell C14.
NB: Use the relative reference C14, not the absolute reference $C$14.
This step uses the spreadsheet's highlighting function to draw attention to unequal values.
At this point in the procedure, because C14 is empty, C1 should be coloured.
Verify the correct operation of the comparison using 'Copy/Paste Value only' to replicate the value of C1 into C14.
When these values are the same, C1 should no longer have its background colour.
Caveat: Excel is not case-sensitive when comparing text.
- Copy/'paste all' from C1 to C2 and adjust the formula reference to =INDIRECT(ADDRESS($B2;COLUMN()-2;1;$B$1))
Almost the same formula as C1, but the row used in this indirection is dependent on the value of B2.
- Copy C2 to C3:C11
C11 should contain =INDIRECT(ADDRESS($B11;COLUMN()-2;1;$B$1))
and C2:C11 should now display, coloured, the first column of the first ten rows of Sheet2.
Using menubar Edit/Fill/Down will replicate the conditional formating required whereas
simply typing the formula into these other cells will not.
- Copy/'paste all' from C2 to C14 and adjust the formula refence to =INDIRECT(ADDRESS(1;COLUMN()-2;1;$B$14))
This formula is almost the same as C1, but in this instance the worksheet name is drawn from B14 so the reference is now Sheet2a!A1.
Use 'paste all' to ensure the conditional formating is replicated from C2.
- Select C14 and adjust the conditional formating so that C14 is coloured when Is Not Equal to C1.
C2:C11 will be highlighted when the cells differ Sheet2 != Sheet2a.
This step mirrors highlighting Sheet2a != Sheet2 to enhance visibility.
- Copy/'paste all' from C14 to C15 and adjust the formula refence to =INDIRECT(ADDRESS($B15;COLUMN()-2;1;$B$14))
Almost the same as C14 except that the row number is now indirect instead of static.
- Copy C15 to C16:C24
C24 should contain =INDIRECT(ADDRESS($B24;COLUMN()-2;1;$B$14)). Comments of Step 11 apply, but with reference to Sheet2a
- Reduce the widths of columns A thru S to about 1cm and select the range C2:S24
Shrinking these widths eases performing the following step.
- Select C2:C24 and use menubar Edit/Fill/Right to replicate column C to columns D thru S
This step adds the second dimension to the comparison tool. The number of columns is arbitrary; extra columns will display '0'.
- Insert a vertical slider control over the space in column A giving it an appropriate range (1-50 or 1-500) and link its value to
influence the value of B2.
A slider control eases scrolling these two (pseudo-)windows vertically.
Return to the sample data in Sheet2 (or Sheet2a)
and change or delete the content of a few arbitrary cells in the first few rows and/or columns so that the content
of the two worksheets is no longer exactly the same.
Going back to Sheet1, you should see coloured cells in the worksheet
highlighting where there are now differences when comparing the cells of Sheet2 and Sheet2a.
The background colour draws attention to these differences.
(One shortcoming is that MS-Excel Conditional Formatting 'condition' is not case sensitive. Bummer.)
Differences are visibly flagged in the context of their row and column.
Discussion:
The comparison does not have any 're-sync' heuristics like the Unix command 'diff'.
One or more 'missing' rows in the 2 worksheets being compared will cause all subsequent rows to be out of alignment.
The operator must use discretion and coerce synchronisation by inserting sufficient blank rows at the appropriate place
in the worksheet that is missing rows.
As described, the column widths were reduced for demonstration purposes only.
Narrow columns provide an overview of possible differences for more columns (i.e. wider recordsets).
Freezing Sheet1 columns A thru C at the left edge of the window
allows the operator to widen individual columns as needed
while scrolling the view horizontally
without losing the context of the rows being viewed.
By adjusting the zoom of the worksheet one can move from macro/overview to micro/detail view with ease.
In practice, an optimal SQL query designed with consideration
of important columns, records and practical sorting yields a recordset amiable for comparison.
To fulfill my requirements,
for any given table I created multiple CSV files (one from each database),
imported all each of the files into a workbook already set up with this special Sheet1,
named each resulting worksheet for its source database
and could happily search for significant differences between any two snapshots
by specifying the 2 interesting worksheet names in cells B1 and B14.
The dimensions of the comparison given above are arbitrary.
For the purpose of the initial set-up I've suggested a 15x15 'grid' of sample data.
One can use fewer or more columns in Sheet1
by replicating column C horizontally to meet the need.
I've suggested creating 2 pseudo-windows, each of 10 rows, in Sheet1.
Those using high resolution displays can easily adapt the size of the pseudo-windows to increase the number of rows shown.
Obviously, each CSV is a snapshot of the recordset at the time the query runs.
As the differences were examined, considered and corrected,
a fresh CSV was created and imported into the workbook related to that query.
By naming the worksheet for the source database and a version index (eg: foo_v12),
I was able to confirm that only appropriate changes had been made by comparing the new version against its predecessor.
It is not difficult to see that this tool can be adapted to simultaneously compare (fewer) rows from more worksheets.
Three or more pseudo-windows would display fewer rows, and the reference for the conditional formating comparison could
be adapted to accommodate the A Vs. B Vs. C ... extension.
Final words
Some may ask, "Why didn't you just do this in Perl?" There are two reasons.
First, any Perl utility would have died when I left this position (and I would have been cursed by my former co-workers).
Second, as Larry said, "TMTOWTDI", and this works within the knowledge-space of those who carry it forward.