[ Home | Resume | Programming | Engineering Philosophy | Family ]

Code Coverage Deemed Dubious

Many organizations use code coverage tools to measure the effectiveness of regression testing in detecting design errors. While this practice is not devoid of usefulness, there are pitfalls associated with it that one must be careful to avoid in order to derive meaningful information from it.

Code Coverage is a Benchmark

Code coverage metrics are benchmarks, in that they are quantitative measurements of something that is meaningless in and of itself, but is thought to be indicative of some important aspect of a deliverable. Like other benchmarks, there are many diverse methods of determining the quality of the same aspect, none of which is universally accepted over all others. Like other benchmarks, they are meaningful only if the deliverable (i.e. the regression) is not developed with the benchmark in mind.

Code Coverage is not Inherently Meaningful

Many will take exception to the claim that code coverage measurements are not necessarily meaningful. After all, what could be more thorough than executing every single line of code?

In my experience, if the development of diagnostics is guided by code coverage rather than an understanding of the design, then only about half of the bugs will be found by increasing the measured coverage from 0% to its maximal value (e.g. 99%). The remainder of the bugs will be found by tests that do not improve the coverage measurements, or by tests outside the regression (in the worst case, by your customers). Experiencing bugs outside of the regression is particularly costly, because it may be difficult or impossible to isolate and diagnose the problem.

So, then, what accounts for these "dark" bugs?

Combinatorial Coverage

Many bugs are detected only when a particular combination of design elements are covered simultaneously. Fully exercising every design element in isolation is not sufficient to detect such bugs. Consider the following subroutine:
	ex1(int a, int b) {
		if(!a) {
			assert(b);
		}
		if(!b) {
			assert(a);
		}
	}
Every line and conditional is covered by the cases (0, 1) and (1, 0), but the case (0, 0) exposes a bug.

While it is a good idea to architect the system such that interdependencies among subsystems are minimized, there usually remain interdependencies that are beyond the comprehension of modern coverage tools. Furthermore, many real systems are poorly architected, which makes the problem even more intractable.

Observability

Bugs are not detected unless a diagnostic simultaneously provokes the bug and makes the effect of the bug visible to verification apparatus. Simply provoking the bug is not sufficient. (This is actually just a special case of Combinatorial Coverage.)

Observability problems can sometimes be addressed by either adding assertions to the design or adding properties to the Verification Specification . There are legacy maintenance issues associated with adding such constraints, so it is often preferable to address observability by improving diagnostics unless the required effort is prohibitive.

Bug Modeling

Covering a design element does not necessarily imply exposing all the possible bugs associated with that design element, because a given element might operate correctly most of the time and still be incorrect. Consider the following subroutine:
	ex2(int a, int b) {
		int c=a+b;
		if(c<a) {
			assert(b<0);
		}
	}
Every line and conditional is covered by the cases (0, 0) and (0, -1), but the case (INT_MAX, 1) exposes a bug.

Other Coverage Limitations

In addition to the limitations of specific code coverage metrics, there are a number of additional reasons that one ought not to ascribe too much importance to coverage without addressing other considerations.

Checking

Even if a bug is exposed by a diagnostic, it will remain undiscovered unless it is detected by the verification apparatus. The verification apparatus probably needs to be more efficient and reliable than humans observing the behavior under similar conditions after every change. The goal is to have an automated system that fails every incorrect behavior and no correct behaviors.

Design Modeling

The thing that you are testing is not necessarily your product, and therefore passing the regression does not necessarily mean that the product is correct. This is particularly important in the field of integrated circuit design, because code is not guaranteed to synthesize into something that has the same logical function as the simulation, and because delay and/or noise might cause the product to behave differently. In the software regime, one must still consider the effect of removing non-production instrumentation (e.g. assertions) and of the verification apparatus itself.

One way to combat design modeling issues is to devote some resources to verifying a model that has as much fidelity to the actual product as possible, regardless of its inefficiency in comparison to the usual verification model. This normally occurs shortly before release. Another approach is to use some form of static code analysis, such as lint or code reviews.

Initial Design Quality

One must be careful not to equate coverage with quality. According to testability theory, the final quality, Q, as a function of initial quality, Y, and coverage, c, is given by:

Q   =   Y 1-c   ~   1 - (- ln Y) × (1 - c)

Therefore, the level of initial design quality also plays an important role in determining final design quality.

Alternatives to Code Coverage

With presently available technology, the best way to detect bugs is to have good design verification engineers that understand both the specification and the implementation, have the intuition to know what sort of bugs are likely to occur, and are competent enough to develop diagnostics that would detect any such bugs. Since knowledge of the design is required, it is a good idea to have the developers themselves involved in the design verification role. This is a black art, but when performed correctly it is considerably more reliable than having good coverage numbers.

Another alternative to code coverage as a metric of testing is to insert a statistically meaningful number of realistic bugs to see how many are detected. Unfortunately, automated means of bug insertion tend to be very unrealistic, and manual bug insertion is time consuming and still not necessarily realistic. It is also somewhat dangerous to generate corrupted code, because somebody might mistake it for production code and incorporate it into a release.

Using Code Coverage Effectively

Coverage tools should be used to see what you've missed only after you think you're done developing diagnostics, and not to direct the development of diagnostics.

Code coverage metrics are never a substitute for an intelligent and thorough verification effort, but they can be used to make good verification even better.

The only true measure of a verification effort is how effectively and rapidly it discovers real bugs. The problem with code coverage metrics is that they are easily subvertable. By imparting a great deal of importance to them, the organization is incentivized to subvert them, and the actual effectiveness of verification suffers accordingly.

Anders Johnson, last modified $Date: 2002/02/05 $

[ Home | Resume | Programming | Engineering Philosophy | Family ]