Understanding a legacy database

Any database that has gone into production is a legacy database according to Tom Kyte . I agree, but to me some databases give me associations to legacy more often than others. In a post at the OraStory blog Dom complains about companies that hire contractors for a maximum of one year, and due to the fact that it takes half a year to understand the business and the data, you hardly get to show what you can deliver.

Understanding the business is one thing, but how it is projected into the database is another matter. With a crash introduction to SQL any developer can create a few tables and start dumping data. Everyone working with the code knows the connections, but lack of documentation, no constraints, and apparently inconsistency in the data makes it difficult for the newly hired DBA to understand what is going on or what is supposed to be going on. And this agile movement doesn’t make it easier, documentation is not exactly given high priority (…the primary goal of software development is to create software, not documents- Ambler; Agile Database Techniques 2003, p8). In places where more attention has been given to the applications than the database the database can easily become a hard to manage legacy system. The characteristics are tables with few or no constraints (besides the primary key on an id column with values from a sequence, of course), tables with similar column names but with different content and meaning for the same column name, same data stored twice or more (and not in sync), same column used again in another table but with different type, missing data (not null-constraint missing) and so on.

If you have a tuning problem, adding constraints can help, but how can you add a constraint if you don’t know the data model or the connections in the data? Adding an unique index can certainly speed things up, but what if a column has 99% unique data, is that an error in the data or just coincidental? I have found the following techniques and strategies useful for understanding the data model:

Ask questions - some like to talk about the code they wrote, but there is a limit to all interruptions.
Talk to staff likely to be in need for reports and find out if they use reports from the database, then you trace the reports back to the database, i.e. the actual SQL used. Sometimes there is an intermediate data dump that is later imported to Excel for further processing. Looking at each step, asking for an interpretation of the report (if it is not obvious) you may see some new connections.
Views in the database are useful to store complex queries, hide access to columns, create a standard interface to a set of tables, and more. But I have seen views defined as select col1 newname1, col2 newname2, … from foo; i.e. the only change from the underlying table is new column names that relates to reports or a term that is used outside the database.
Stored procedure and packages are more likely to have comments than tables and views. If some strange logic is applied in the source, it motivates a question and may reveal something about the data model.
Anticipate dirty data (if constraints are few). Say the quality of the applications are reasonable good, most of the data may be fine, but there will be some orphan rows, illegal values, etc. Doing some digging with simple select statements counting and looking for seemingly missing parent values (as per an imagined business rule, since the constraint is not defined) may show a set of dirty data that should be looked into. I have used this as a pretext to ask about the tables (“Is it not so that … is supposed to have …‘cause I have seen some … that does not have a corresponding …"). This may be an opportunity to suggest to clean up the data and add constraints (because management understands the value of the data and the danger of not be able to trust them). Invalid data are more likely to have entered early in the history of the database (“well, some testing was done on the production database”) or because the data model evolved with new columns added or business rules changed at an early stage.
Besides finding corrupt data, researching the statistical distribution is useful for tuning purposes and identifying other constraints. When a column with a discrete set of distinct values have relative few nulls I suspect a candidate for a not null constraint.

I wish the oracle blogosphere would discuss database refactoring more often. The biggest challenge is to identify what can break elsewhere when you try to fix wrongdoings (impact analysis is what they call it) It is not trivial. Do people undertake this, or is it sometimes a relief when the contracts expires after twelve months so you can move on?