[ Home | Resume | Programming | Engineering Philosophy | Family ]

Guidelines for Verifying Complex Electronic Designs

Thorough verification is absolutely essential to the success of complex electronic designs, and it is also extraordinarily difficult to achieve. Unfortunately, in the throes of development with looming deadlines, it is all too common for the tenets of verification to be cast aside in favor of accelerating intermediate milestones of limited meaningfulness. A great deal of painfully gained experience stands testament to the fact that this is always a mistake.

This document is intended to present and justify a nontrivial set of ground rules for development that is alleged to ensure that the success of the project is not undermined by ineffective verification. It is hoped that this will help to provide much needed perspective in the heat of battle.

Executive Summary

Verification is Always the True Critical Path to Production

Anything that isn't verified is broken.
— heard at Sun Microsystems, 1993

A reasonable quantitative metric for the complexity of an electronic design is to generate a minimal hierarchical transistor-level netlist, prune out the known-good modules (i.e. the gates and any other modules whose entire sub-hierarchy has been fully verified since its last modification) and count the number of instance pins, multiplying the asynchronous connections by 10, and the analog connections by 100. This complexity metric roughly corresponds to the number of possible design defects.

Assuming that most of the design's features are actually required by a customer, the probability of achieving a production-worthy design given a fixed verification methodology is exponentially decreasing in the number of possible design defects. In my experience, if the complexity metric exceeds 100,000 or so, and it takes at least a week to turn design modifications around, then the true critical path to production is always verification.

For example, if there are 100,000 possible defects, each having a 1% probability of occurring, where each occurring defect has a 99% probability of early discovery, then the probability of a defect-free design is:

[1 - 0.01 × (1 - 0.99)]100,000 = 0.005% ~ 0

Understanding this is probably the single most crucial key to success in complex electronic design. Unfortunately, in my experience, most firms view compromising verification in order to accelerate first release to manufacturing as an acceptable risk. However, because diagnosing a problem in real hardware is much more difficult than in simulation or emulation, and because fixing one defect often creates another defect, cutting corners in early design verification always winds up delaying production on sufficiently complex designs.

In reality, in order to produce usable designs of increasing complexity, the importance imparted to design verification must increase proportionately, such that the expected number of defects that escape verification at the time of release remains small. This generally entails a substantial verification effort, definitive and reusable specifications, and designers who go out of their way to make their designs not only correct, but also verifiably correct.

Evolutionary Development

The credible software methodology literature seems to unanimously report that evolutionary development improves both quality and development time substantially.1 2 3 While direct evidence that this applies to electronic design as well is difficult to obtain, my experience has been that the same is even more true of electronic design, with its similar degree of complexity, its even larger number of competing constraints, its less-evolved development environment, and its much longer turn-around time.

Evolutionary development relies on early integration, as well as frequent, thorough quality assessments throughout the development process. Because repeated manual testing is highly uneconomical and unreliable, it is important to establish a thorough automated testing infrastructure by the time of initial integration, even though automating the testing has an initial fixed cost. Therefore, in order to fully reap the benefits of evolutionary development, it is necessary to budget resources for writing tests essentially concurrently with design implementation.

Layers of Specification

Only a proposition, not an entity, can be verified. What we commonly refer to as "verifying a design" is actually verifying the proposition that the design satisfies its specification (even if it's an unwritten specification). Therefore, specification is an essential prerequisite for verification.

Because there are substantial, unpredictable variations in both production processes and operating conditions, useful specifications are always imprecise. Even a cycle-accurate specification is imprecise, because it has nonzero tolerances on signal timing, may allow additional unpredicted signal transition pairs within these tolerances, and may not specify a value for every output signal on every cycle. Note, however, that an imprecise specification can still be completely definitive (meaning that it is always possible to determine whether a given behavior does or does not satisfy the specification).

Specifications generally exist, at least conceptually, in distinct layers. For example:

  1. The Market Requirements Document (MRD) spells out the goals of the product (including its associated software, if any).
  2. The Product Specification specifies the guaranteed properties of the product in terms of its customer visible interfaces (including software interfaces) and at a level of detail that does not require unusual expertise in the product's application domain to understand.
  3. The Design Specification further concerns itself with the product's major internal interfaces (including the hardware/software interfaces), as well as compatibility with and reusability for other products.
  4. The Verification Specification specifies any additional properties that are necessary for it to be practical to determine whether the design satisfies the Design Specification.
  5. The Register Transfer Level Specification (RTL) recites every flip-flop in the design and the logical equations that relate their inputs and outputs.
  6. The Schematic Specification (gate netlist) enumerates the primitive components in the design and their interconnections.

This might seem as though it requires a whole lot of specification effort, but in reality it's not significantly more time consuming than specifying everything in a single monolithic specification. It's also easier to maintain, in part because by virtue of which specification is modified it's obvious whom changes affect. Of course, you can reduce the specification effort by neglecting to specify some of the properties that need to be relied upon, but in the long run that always winds up causing much more trouble than it saves. On the other hand, specifying properties that nobody relies upon is always a waste of time.

Note that each layer of specification is more precise than the last. (More precise means requiring no more and promising no less.) For example, it is not sufficient for the RTL to satisfy the Design Specification without satisfying the Verification Specification (because the goal is to prove that the design works, rather than to make it impractical to prove that the design doesn't work).

It bears mentioning that specification is generally a multi-contributor activity, because it typically consumes more than one full time body in total, and few individuals have both the talent and the inclination to write specifications for a living. Fortunately, the specification effort parallelizes nicely. For example, the party responsible for one layer of specification need not concern himself with the next lower-level specification, except perhaps to review it for violations of his intent.

Verification Specifications Shouldn't be Cycle-accurate

It seems to be popular these days to make the Verification Specification cycle-accurate. However, using a cycle-accurate model as a replacement for, rather than a sub-specification of, a lenient Verification Specification is troublesome for the following reasons:

Nondeterminism

A cycle-accurate specification is always deterministic, meaning that the output tolerances are always a deterministic function of the input vectors. Many useful systems (systems having unrelated clocks in particular) are nondeterministic, at least on a cycle-by-cycle basis. A cycle-accurate specification cannot definitively specify such a system.

Specification Complexity

A specification that comprehends the complex signal timing relationships of the design is necessarily substantially more complicated than one that doesn't.

Because a cycle-accurate specification is complex, determining whether it actually embodies the intended properties of the system requires a great deal of effort (especially when the intended properties are not documented). If this task is neglected, then the design is very likely to inherit defects from the specification, and you'll have no way of knowing about them until customers start complaining, at which point they are extremely costly either to tolerate or to correct.

Furthermore, because the design depends on the cycle-accurate specification, and complex specifications require more time to develop, the product's schedule is adversely impacted.

Design Flexibility

Perhaps the greatest problem with cycle-accurate specifications is that they overconstrain the design. That's because the optimal cycle relationships depend on an extremely complicated set of design trade-offs, and therefore cannot be accurately predicted a priori.

If you are conservative and specify cycle relationships that are easily achieved, then the design becomes suboptimal. The overall effect of this can be substantial. For example, more latency means more buffering, which means more area, which means longer wires, which means longer delays.

On the other hand, aggressive cycle relationships will need to change frequently to track design changes. That requires a great deal of maintenance, especially when designers can't distinguish between changes that affect the specification and those that do not.

Even worse, when the cycle relationships in the specification are modified, the intended properties are not necessarily preserved, so the specification itself must then be re-verified. If this is a manual process, then the effort required to carry it out after every change would be prohibitive. If it's an automatic process, then that process would probably serve as an excellent basis for a lenient specification.

Lenient Specifications

A lenient specification is a specification that is less precise than a cycle-accurate specification. The goal of a lenient specification is to specify only those properties that are meaningful and useful to customers of the design (meaning the product's customers, as well as designers of future products). A lenient specification should specify additional properties (i.e. overspecify) only when it is otherwise impractical to fully verify the meaningful properties. In other words, a lenient Verification Specification should be as similar to the Design Specification as possible without inciting test coverage issues or making the checking process computationally infeasible.

The advantages of lenient specification are that the specification is easy to maintain, that design efficiency is not compromised, and that violations of the specification are almost always real issues that the design's customers actually care about. The disadvantages are that it is more challenging, both conceptually and in terms of initial implementation, and that violations are not detectable until they propagate to a major interface, which might be much later than the point at which the defect was actually provoked.

A lenient specification typically uses some combination of fuzzy modeling and partially ordered transactions to describe the set of allowable behaviors, although in principle there may be other effective techniques of which I'm not aware.

Fuzzy Modeling

Fuzzy modeling is an extension of using unknowns in logic simulation. In the most common case, the numerical value of a bit vector is allowed to fall within a specified range, rather than having a precise value or having certain bits individually unspecified.

In principle, you can always model fuzziness with simple unknowns, but in most cases you don't want to. For example a bit vector of width 8 whose allowable numerical range is 0x1e through 0x20 could be modeled as 8'b00xxxxxx, but that would also allow the numerical values 0x00 through 0x1d and 0x21 through 0x3f, which isn't correct in this case.

Fuzzy modeling can allow behaviors that you don't expect. For example, if a counter is expected to be decremented from 0x31 to 0x2a at some imprecise moment, then it can be modeled as having the range 0x2a through 0x31 during the uncertain time interval. However, this allows the counter to jump from 0x31 to 0x2d to 0x30 to 0x2a. If that's considered a bug, then don't use fuzzy modeling.

Partially Ordered Transactions

It is common for specifications to be written in terms of transactions. A transaction is loosely defined as an abstract chunk of data that is determined by observing a set of signals over some number of cycles. In particular, this is a good way to specify asynchronous systems, because there is nothing that prevents transactions observed from different clock domains from interacting. It also makes it easy to change the physical format of data in the design without affecting a large portion of the specification.

To model nondeterminism, the specification simply refrains from imposing a full ordering on the expected transactions. Instead, the expected transactions may be partially ordered (meaning that certain pairs of expected transactions have a required sequence), and the allowable latency of an expected transaction can be constrained.

Observed transactions are not necessarily derived from specified interfaces. Observed transactions that are neither input transactions nor expected transactions are classified as hint transactions. They provide a mechanism for the implementation to indicate to the specification which of the allowable behaviors should be anticipated. One drawback to using a lot of hint transactions is that they require maintenance to track design changes that affect their generation.

Any behavior that results from any hint transaction that does not itself violate requirements is deemed as an allowable behavior. This is a subtle but important point, because a carelessly crafted specification might accidentally allow unintended behaviors when hint transactions are used. For example, a specification that observes an output as a hint, and then verifies the output against the hint it just observed, is unobviously equivalent to a specification that does not specify the output at all.

Although hint transactions must be carefully specified, they are often indispensable. Without hints to expediently prune the set of allowable behaviors, the checking problem is likely to become exponential in the number of outstanding transactions. That makes the design impractical to verify.

Note that accepting hints from the design does not corrupt the specification with implementation information, because the format of hints is defined by the specification rather than the design, and the discretionary content of the hints is always correct by definition.

It takes a little while to appreciate the power of modeling uncertainty this way, but in my experience I have yet to encounter a useful Design Specification that cannot be effectively expressed as a Verification Specification with partially ordered transactions.

Micro-architecture Specification

Whether or not there should be a Micro-architecture Specification that comprises a layer between the Verification Specification and the RTL is open to debate. A Micro-architecture Specification is usually cycle-accurate at the major internal interfaces recited by the Design Specification.

On the one hand, a Micro-architecture Specification detects provoked defects promptly (and therefore they are easier to diagnose), and it provides a very rapid means of simulation with fidelity to the cycle behavior of the design. On the other hand, a great deal of additional effort is required to develop it and to verify it against the Verification Specification, and even more effort to track design timing changes and repeatedly re-verify. Verifying the Micro-architecture Specification is absolutely mandatory, because tracking timing changes in the design is very likely to cause some design defects to be tracked as well, which makes them otherwise impossible to detect.

Unlike the other layers of specification above RTL, a Micro-architecture Specification will probably require an effort that is significant in comparison to the design effort. You need to weigh that against the potential benefits.

Executable Specification

It is generally appropriate for the MRD to be informal. However, the Verification Specification really needs to be completely definitive. Otherwise, the development schedule tends to be dominated by debates over what constitutes a defect.

Unfortunately, definitive natural language specifications tend to be prohibitively difficult to read. (For example, try reading any part of the U.S. Tax Code.) Even worse, it is common for two readers of a natural language specification both to believe that it is definitive, and yet have mutually contradictory interpretations. (This is especially true of unwritten specifications.) For these reasons, it is a practical necessity that definitive specifications be executable.

An executable specification is a program that observes the stimulus and response of a design (typically in the context of a simulation), and determines whether the observed behavior satisfies the specification under the given stimulus conditions. If not, it should produce sufficient diagnostic output for a human to determine at least one reason that the specification is not satisfied. An executable specification is also called a checker. A lenient executable specification is also called a smart checker.

Currently, C++ stands out as the language of choice for executable specifications, whether they are lenient or cycle-accurate. In order to be useful, an executable specification must be much faster than the RTL simulation that it verifies, and ideally about as fast as emulation. C++ has a speed advantage of 100× to 1000× over its chief competitors, Vera and Specman. (Verilog and VHDL are not serious candidates, because they don't have dynamic data allocation.) Furthermore, C++ has a mature feature set, including templates, exceptions, multiple inheritance, and a very useful and general standard library. It's also free and stable, and you can even obtain free hardware description libraries (such as cynlib) for it.

Unfortunately, C++ also has a very steep learning curve. You can compensate for this to some extent by using coding guidelines. (Some sample coding guidelines can be found here .) However, coding guidelines are meaningless unless they are enforced, because the contributors who are most likely to submit mistakes are often also the most likely to ignore the guidelines.

As far as I'm aware, there are no readily available libraries in any language for smart checking. However, it is necessary to abstract the checking functionality from the behavioral description functionality if the smart checker is to be practical to understand and maintain. You're pretty much stuck with writing a checking library yourself, but fortunately this library can be reused for every project.

Validating the Specification

Validating the specification is not as tautological as it might sound. It is not at all uncommon for the specified properties not to match the intent of the specification writer, and detecting such discrepancies is not impossible.

Architecture Simulation

The most reliable means of validating the behavior of a specification is consistent with the intended behavior is to simulate the specification and observe the behavior. In order to do this, there must be a means of computing at least one of the allowable behaviors in response to any stimulus.

For the RTL, such simulators are readily available. On the other hand, simulating a lenient specification is not so straightforward. The recommended approach is to build simulation functionality into the checking library, such that a given behavioral description can be used as either a checker or a simulator. This is a nontrivial programming design challenge. However, when it is implemented correctly, the Design Specification and the Verification Specification are both executable, and share the vast majority of the product-specific description code, such that they are guaranteed to be consistent.

If the executable Design Specification is written in C++, then it is possible to supply customers with its object code, such that they can simulate and approve the proposed behavior, without exposing the implementation (assuming that you trust your customers not to reverse-engineer the object code, which is probably prohibitively difficult anyway). Customers must understand that the architectural simulator is not absolutely definitive, because it does not necessarily exhibit all of the allowable behaviors.

In order for it to be practical to validate a changing specification, it is necessary to accumulate a battery of regression tests to be re-simulated after each change. It is generally adequate to perform a relatively unintelligent comparison of the results against the previously approved results. There will be cases in which the behavior changes without violating the intent of the specification, and in those cases the specification writer should simply update the expected behavior. This approach is practical only if the simulator does not use a random process to select among allowable behaviors.

Rough Modeling

If the software development schedule is particularly critical, then it probably makes sense to develop a simplified hardware model whose software interfaces satisfy all of the guarantees that the real hardware makes to the software. This model can be used to verify the software and to guide hardware development, but it cannot be used for verifying hardware because it has no formal fidelity to the hardware. However, it may share code with the executable Design Specification.

The rough model may simply comprise an early version of the Design Specification, but maintaining it as a separate entity make sense if it runs much faster.

Application-level Properties

Another approach toward validating a specification is to check that intended properties of the integrated system are achievable, even if they don't make sense in the isolated context of the specified component.

For example, if the application demands that a particular action is taken in response to a case that is sufficiently rare to be handled in software, then you can check that either the action is taken by hardware or that sufficient information is supplied to the software that the action can be taken.

The main drawback to this approach is that it requires modeling the entire application (without regard for the hardware/software partition). That's probably rather involved, so it might not be ready until the design is near completion, assuming that it is deemed worthwhile to do it at all. On the other hand, the earlier that such problems are discovered, the better.

Verification Responsibilities of Designers

In order for the verification effort to be effective, Design must assume some responsibilities in support of Verification, in particular, those that require intimate knowledge of the design. Because fostering cooperation between the design and verification functions is necessary, it is essential that designers not be criticized for defects, except for defects that are discovered late in the development cycle because the designer neglected his verification responsibilities. Those responsibilities include:

Pointing Out Vulnerabilities

Designers are expected to bring any implementation-specific vulnerabilities to the attention of test writers. For example, if the implementation contains a FIFO that isn't recited by the specification, then verification needs to make sure the full and empty states of the FIFO are covered by some test.

Defining Invariants

An invariant is some implementation-specific property of the state of a design that is alleged always to hold. By stating an invariant, it is easier to reason about whether it actually holds, and what guarantees hold regarding the behavior of the system as a result. Invariants are especially useful when failures that result from violating the invariant may be difficult to observe. Such invariants are called subtle.

It is a good idea to incorporate important subtle invariants into the Verification Specification, such that violations of invariants will be detected promptly. Of course, that also means that the Verification Specification needs to track changes to those invariants.

Invariants can usually be verified at the module level .

Supplying Hint Transactions

Because hint transactions derive from implementation-specific signals by definition, supplying those transactions to the smart checker is most easily accomplished by the designers. This turns out to be a point of contention, because it requires the designers to understand a little about programming. However, if you make somebody else responsible for extracting the hints from the design, then changing the design in such a way that the hints must be extracted differently requires a great deal of coordination.

Because designers might want to take advantage of some of the unique features of C++ themselves, the most reasonable interface at which to draw the boundary between Design's responsibility and Verification's responsibility is in C++. This has the unfortunate consequence that Design winds up with the responsibility for the tedious data marshaling, for example, from Verilog to PL/I to C to RPC to C to C++. It is recommended that scripts be written to automate the generation of such code.

Because hint transactions will also need to be supplied after the netlist is optimized, it is recommended that transactors should rely only on terminals of unflattened modules. Even that isn't necessarily completely safe, but it ought to minimize the amount of necessary post-optimization tracking. Such tracking is also most effectively carried out by designers.

It is observed that hint transactions can often be derived from signals that are already used for module-level verification , so you can sometimes kill two birds with one stone by having the module checker supply the hint transaction to the smart checker.

Negotiating the Verification Specification

The Verification Specification should be negotiated between Design and Verification under the constraint that it must satisfy the Design Specification. Topics of this negotiation include invariants and hint transactions. It is important that the result optimizes the success of the organization, rather than the convenience of one side over the other.

Responding with Integrity to Problem Reports

Designers often believe that it is their responsibility to dispatch problem reports as quickly as possible. The easiest way to accomplish this is to immediately claim that the problem isn't a bug, that it's somebody else's fault, or that private changes have already addressed it. Unfortunately, although that makes the designer's bug response statistics look good, it's also quite counterproductive.

Every problem report should be addressed with at least some analysis. If the designer believes he is not at fault, then he still needs to explain in the problem report why this is the case. Furthermore, it is important for the owner of a bug to be the person who is empowered to act on it, and therefore a problem report should not be closed until the fix is made available (and therefore passes the current regression).

Using Prudent Design Practices

There are a number of questionable design practices that require special attention when they are utilized (in particular because they tend to be modeled optimistically). It is the responsibility of the designer to avoid such practices by default, and failing that, to point out to verification and to others that the practice was employed.

Leveraging Verification Effort

Because the verification effort is becoming an increasingly dominant portion of the overall effort required to successfully execute complex electronic designs, it is extremely valuable to leverage verification infrastructure among similar designs.

Specification Reuse

Specification reuse is essential for sustainable verification leverage. If every product has its own copy of the entire specification, then the specifications can be expected to diverge to the extent that the majority of the effective verification infrastructure will be unique for every product, unless a heroic effort is made to keep them consistent. Poorly structured specifications are extremely troublesome, because once the intent of the specification is decimated, it's hopeless to recover it downstream, and the remainder of the development process is then confounded.

If the product roadmap is planned properly, then the majority of any new product specification will consist of content that is shared as-is with other products. This is something of a delicate art, but it pays off sooner than you think. See On Planning .

It is essential that customers appreciate that they mustn't rely on unspecified properties, because otherwise future compatible specifications must be augmented to include all of the properties on which customers rely. It is furthermore then technically impossible to add features, because customers might rely on the exact behavior of the product. This, of course, means that the Product Specification must recite properties in sufficient detail for all promised features to be utilized. The same goes for internal interface properties in the Design Specification, whose customers are the verification infrastructure and the design of the subsystems that use those interfaces.

It is also essential that the unique differences between any two specifications (including two versions of the same specification) can be determined. Otherwise, concurrent contributors cannot be accommodated, and updates cannot be propagated in a timely manner, because it takes too much time to figure out what changed. This argues strongly for specifications to be in some text-based format, such as HTML or SDF .

Specifications mustn't be modified at will or without notice. Changes must be made only in consideration of the ramifications, which are usually substantial and beyond the understanding of any single individual. This is best achieved through cooperative negotiation among affected parties.

Software Reuse

Assuming that the product interacts with software, the Product Specification should define the software interfaces (collectively known as an application programming interface, or API) that the product supports. In order to promote specification reuse and backward compatibility, this API should change as little as possible from one product to the next, and any changes should conform to the open-closed principle .

To make software compatibility maintainable while retaining the freedom to further optimize the implementation, the API should be defined at a level of abstraction that is meaningful to the application, rather than to the implementation. The product should then include a device driver software component that translates the API into low-level hardware accesses. The driver compensates for changes to the implementation without affecting clients of the API.

For example, the API should be aware of the entries within a given lookup table, but it should not be aware of whether the hardware does a binary lookup or a hash lookup, and should not be aware of the number of copies of the table that exist in hardware.

The API should also be responsible for generating warnings if the software attempts to interact with the hardware in an unsupported manner, or preferably for preventing that from happening at all. This provides a definitive deliverable for such requirements. To improve efficiency, such checking code can be disabled in the production environment (for example, using NDEBUG).

Design Reuse

The best form of verification reuse is design reuse. If a subsystem has been fully verified since the last time any part of it was changed, and no unspecified properties of the subsystem are relied upon, then it is generally unnecessary to reverify the subsystem. Furthermore, if the subsystem has also already been successfully deployed in the field, then the probability of experiencing a latent defect when the subsystem is reused is small.

Another advantage of as-is design reuse is that fixes for defects discovered in the subsystem automatically propagate to all of its clients. This avoids the all-too-common problem of corrected defects reappearing in future releases. Don't worry if that means changing a product that has already been released, so long as you don't break it. (See On Concurrent Development .) In fact, this is quite beneficial, because the previous product can be used as a verification platform for the fixes.

Unfortunately, design reuse is often impractical because the design must optimize a number of different and often competing metrics, such as:

For example, the similarity between two designs that satisfy exactly the same specification, but with substantially different core clock frequencies, is probably unrecognizable by necessity at the RTL representation, at least within the timing-critical portions of the design.

Verification Reuse

Because design reuse is often impractical, we need to rely on verification reuse in order to maintain intended consistencies among products. Because the optimization concerns of the verification infrastructure are few, it is entirely reasonable for it to be developed with reusability as a primary objective. This requires roughly 50% more effort up-front, but given the inevitable specification changes, it probably pays off even before the first design goes into production.

The principal challenge in making the verification infrastructure reusable is making the tests reusable. Here are a bunch of tips to that end:

Use Customer-Visible Interfaces

To the extent possible, tests should be written in terms of customer-visible interfaces, that is, the API as well as the exposed hardware signals. Since customer-visible interfaces are required to be stable, using only those interfaces maximizes the reusability of the test.

This, of course, requires that the API is exposed to tests, and that all supported stimulus patterns are realizable by the test harness. The effort required to accomplish this is nontrivial, but worthwhile. It's a wasted duplicate effort for the test harness to have its own API, and not realizing all possible stimulus necessarily hides defects.

Consider the Product Roadmap

It's much easier to make tests reusable when you know what they will be reused for. The product roadmap can provide a great deal of insight as far as that is concerned. You can't assume that the product roadmap won't change, but it should yield a reasonable picture of the kinds of variation that are likely.

Avoid Code Replication

Because of the diversity of testing that is required for thorough coverage, It is especially common for test suite to consist of families of closely related tests. If a substantial amount of common source code is copied into each of a family of tests, it becomes prohibitively expensive to maintain the common code.

You should avoid this situation by storing shared source code in a shared location. This also drastically reduces the amount of data in the source control database. Sharing source code is most effectively accomplished by using an object-oriented language to express the tests. The Template Method Pattern is known to be particularly useful for this.

Inherit Any Unused Settings

Although the test harness should allow tests to fix any setting it chooses, tests should fix only settings that are germane to the test. The remainder of the settings should be inherited from the environment surrounding the test, for example, its base class, the simulation options, or the system defaults. This permits those settings to be easily varied independently of the core test.

Note that this also argues for using an object-oriented language to express tests.

Write Comprehensive Directed Random Tests Last

You cannot achieve genuinely thorough coverage of complex systems without testing all the legal combinations of discrete settings and stimulus. (Whether a continuous setting is 0, 1, -1 or any other special value, is also considered a discrete setting.) Because the number of such combinations is probably similar to the number of atoms in the universe (literally), it is impractical to attack this using directed tests. Instead, this can be addressed by carefully crafting randomized tests, and hoping that any possible defect is provoked in enough of the legal combinations that such tests can cover it (which is almost always the case).

Crafting effective randomized tests is generally very challenging. Settings and stimulus must be very carefully weighted such that any given legal combination of at least two discrete conditions is covered within about an hour of simulation on average. Estimating the probability of a given discrete condition generally involves running the executable specification in reverse, which is impossible in principle and very difficult in practice. Furthermore, this tends to be quite brittle in the face of changes to the specification, so you'll have to redo the analysis when that happens. Therefore, you should wait until all of the directed tests have been written before you write the randomized tests.

Because it is important to be able to reproduce the results of a random test, you shouldn't use a truly random sequence. Instead, use a pseudo-random sequence with a settable, reported seed. Since it is not uncommon for bugs to hide behind weaknesses in the random number generator, it is critical to use one in which every bit of each number in the sequence is independently sensitive to the seed. In particular, rand48() is usually good enough, whereas random() usually isn't.

It is not possible in general to guarantee that a random test will exhibit similar behavior after a design change. However, you should try to preserve the character of the test for a given seed by using separate pseudo-random sequences for unrelated things.

The amount of random testing that is necessary to reliably detect defects must be determined empirically. It is typically thousands of simulation hours, so you'll need a substantial compute farm to obtain prompt results.

"All Bets Off"

Environmental requirements are every bit as important as operational guarantees, and therefore must be stated explicitly in the specification.

If the environmental requirements of the specification are not satisfied at any point in time, then the resulting future behavior is unspecified, at least until the next reset. In principle, the executable specification should simply stop checking things when this is detected, but in practice that can be misleading, because the test writer might believe that his tests are still actually capable of detecting defects. Therefore, the "all bets off" condition should terminate the simulation promptly, with a descriptive error message.

Similarly, a "many bets off" condition results when a violation of a set of conditional requirements causes the result of some customer-visible transaction to be substantially undefined, but the future behavior must still satisfy at least one operational guarantee. Such a condition should result in a descriptive warning containing some agreed-upon identifiable phrase, such as "UNDEFINED BEHAVIOR". Tests that don't expect this to happen should be capable of configuring the simulator to terminate at that point.

Care must be taken to ensure that specified input requirements are reasonable and satisfiable. For example, setup and hold requirements of synchronous inputs are reasonable. Requirements on how a CPU interacts with the product are also reasonable, provided that there is some means of controlling the behavior of the software. Requiring an input signal to be synchronized to an unobservable and uncontrollable internal node is clearly unreasonable.

Checking the Device Driver

Since the device driver constitutes part of the product, it makes sense to verify it as well as the hardware. Otherwise, defects can hide behind the fact that tests using the driver aren't covering what they were intended to.

The smart checker generally derives API transactions based on the CPU bus activity. Rather than trusting the derived transactions, they can be verified by comparing them to an expect queue based on the actual API calls.

Checking Should be Independent of Stimulus

In order to verify that the test harness produces the stimulus intended by the test, it is often convenient for the checker to receive information directly from the test.

Don't do this. Checkers should be reusable with different test harnesses, in particular, at a higher level of integration. If you want to verify the test harness, do it by comparing the derived stimulus transactions against an expect queue, similarly to how you verify the driver.

On the other hand, a test is allowed to "cheat" by looking at the checker or the design in order to produce the worst-case stimulus, because any stimulus that can be produced by such a closed-loop apparatus could also have been produced by an open-loop apparatus. However, avoid doing that if possible, because it limits the portability of the test.

Checking and Simulation Should be Independent of One Another

Checkers should be capable in principle of running with any simulator, and simulators should be capable of running with any checker. Interactions between the two should occur only via the interfaces recited by the Verification Specification.

In particular, since the checker is written in C++, which is not a pointer-safe language, it is advisable for the checker to be a separate process from the simulator. RPC can be used as the basis of this interface, even if the processes are required to execute on the same host.

State Forwarding

Simulating the initialization sequence through which the device reaches a particular state of interest might be prohibitive. This can be addressed by defining in the Verification Specification calls that can be made to the design which set its state directly. This is justified only when the benefit outweighs the cost of implementation and maintenance.

This approach can also be used to parallelize a very long simulation. The test sequence can be broken into sub-sequences with known beginning and ending states. Each sub-sequence can be simulated starting with its beginning state, provided that its ending state is verified to prove that the concurrent simulation of the next sub-sequence is valid.

Reaching remote states may also be accomplished by adding mechanisms to the actual design that allow secret API calls to modify the state. For example, if a given counter normally overflows every 100,000,000 cycles, then a secret register setting might cause it to overflow every 100,000 cycles instead, such that the overflow condition can be practically simulated. When this approach is used, considerable care must be taken to verify that the correctness of the overall behavior does not rely on using the special unsupported configuration.

Module-level Verification

A unit is defined as a module whose interfaces are recited by the Design Specification. Because the smart checker is designed to verify properties recited by the Design Specification, it is generally impossible to use it to verify anything smaller than a unit.

However, there are some distinct advantages to simulating at the module level (meaning some level below the unit level). For example, a module can then be verified before the remainder of the unit is completed and integrated. Also, simulation will run faster with a lesser amount of logic to be simulated. Furthermore, the module's inputs are trivial to control and its outputs are trivial to observe, and therefore its internal nodes are also easier to control and observe.

Because typically only the designer knows the intended role of a given module in satisfying the specified properties of its unit, the properties to be verified in module-level simulation are generally recited by designers. The apparatus that verifies those properties, called the module-level checker, documents them.

Because ease of integration with the design is a primary criterion for module-level checkers, the leading candidate languages for expressing them are Vera and Specman. The fact that their licensing cost is significant compared to that of a simulator is a matter of some concern, but you're probably going to use one of them to write the tests anyway.

It is typical for module-level checkers to be developed by dedicated verification engineers working closely with designers. However, since the function of those checkers is dictated exclusively by design, it actually makes more sense for designers to develop them, since (given a little practice) that can be accomplished in about the same amount of time as correctly describing the properties to somebody else.

Because module-level checkers are tied closely to the implementation, which is expected to change much more than the Design Specification, you should expect module-level checkers to be much less reusable than the smart checker.

Don't fall into the trap of relying on information from a module-level checker. In particular, don't assume that simulating an integrated system, with every module being checked simultaneously at the module-level, is tantamount to checking the system as a whole. A module-level checker reflects only the designer's intended behavior, which is often not correct at all, and very often not what is needed to interact correctly with other modules to satisfy system-level properties.

On the other hand, it is safe for the smart checker to get hints from module-level checkers, because hints are not actually relied upon. That is, the smart checker will still flag defects (notwithstanding the fact that they might be false defects) if incorrect hints are supplied.

It is probably impractical to run the module-level checkers in emulation , but you might be able to specify additional emulation logic to detect violations of subtle invariants and drive an external signal low if that happens.

Formal Verification

Formal verification refers to proving properties of a design without simulating. The main advantage of formal verification over simulating and checking is that coverage is a non-issue.

Because the smart checker takes the form of a general program in a Turing-complete language, formal verification for lenient specifications reduces to the halting problem, which is known to be undecidable. There might be ways to address this problem, but I wouldn't expect it to be solved within a decade.

On the other hand, it is practical to formally verify a substantial portion of module-level properties, probably most of them in fact. However, the effort required transcends stating the properties. For example, a module's tables and buffers usually need to be reduced to a trivial size in order for formal verification to be computationally practical. It is open to debate whether such effort is worthwhile.

Any properties that are stated for formal verification should also be monitored in simulation, to account for any modeling discrepancies. If the formal verification tool doesn't do that for you, then it probably makes sense to write your own scripts to automate the generation of monitors.

Optimistic Modeling

Because modeling timing uncertainties with unknowns is computationally expensive, and because unknown states tend to propagate uncontrollably, it is generally impractical for the results of a logic simulation to be truly definitive. This is generally addressed by having the logic simulation produce the "least desirable" of the possible results, such that a functioning logic simulation still implies a functioning design.

Unfortunately, the simulator doesn't always know what you want well enough not to give it to you. For example, if you always simulate with maximum delays, you won't catch hold time problems. Fortunately, static timing analysis tools will catch that one for you. Here are some other problems that are more likely to slip through the cracks:

Asynchronous Logic

When logic fed by flip-flops using one clock feeds flip-flops using an unrelated clock, all hell can break loose, and you won't detect it unless you are careful to vary the clock phase relationships and the clock-to-out timing randomly for every flip-flop independently every cycle.

Another technique for finding problems related to asynchronous logic is to replace the flip-flop model with a "glitching" model that produces an unknown for about 1/4 cycle after it transitions. This will make the simulation run more slowly, but it's better to run slowly than not to find bugs.

In order to avoid uncontrolled unknown propagation, you'll probably need call out some of the flip-flops that feed another clock domain by using a special module that resolves to the equivalent glitch-free flip-flop model. This facilitates static analysis of the design, makes it clear that the flip-flop output cannot fan out until it is synchronized, and draws attention to the fact that special physical design attention may be required to optimize the metastability characteristics of the synchronizer flip-flop(s). If your logic synthesis flow supports re-timing, then you'll need to make sure that it respects these synchronization boundaries.

This is really a ball of snakes that you should prefer not to pick up at all. It's better to observe the conservative design practice of using a proven bullet-proof synchronizer module from the library whenever you cross clock domains.

Latches

Latches sometimes lead to optimistic modeling because the static timing analyzer might not know what to do with them, and therefore can't report setup and hold violations. Moreover, they simulate less efficiently than flip-flops, and they cause problems for BIST.

In general, latches should be used only in arrays, where the area savings is substantial, they can be encapsulated in a BIST collar, and their timing requirements can easily be analyzed manually.

Gated Clocks

Gated clocks also tend to confound static timing analysis. Furthermore, they typically require special care in physical design, because they require both low self-skew and low skew relative to the clock from which they are derived.

Even worse, the distinction between a gated clock and an ordinary synchronous logic node is not always immediately apparent. You can avoid that by always generating gated clocks using a particular module, including both the AND gate and the negative-edge triggered flip-flop, for that purpose.

Tristate Busses

Tristate busses also tend to confound static timing analysis. Furthermore, electrical problems can arise if the number of drivers of a given net is not exactly one, so that needs to be verified.

It is also somewhat tricky to verify a module having an interface to a tristate bus. A good way to do this is to verify the tristate bus driver independently, and then to verify the module inside the driver, treating the input path as an input signal, and the output path and tristate enable as output signals.

Disabled Timing Arcs

It is often alleged that the precise cycle that a particular signal arrives is unimportant, and therefore the static timing analyzer is instructed to ignore it. The problem with this practice is that it is unsafe to do so unless it is verified that the arrival cycle is unimportant.

You can verify the unimportance of arrival cycle by simulating with a random delay whose maximum value is several times the minimum clock period, independently on each fanout of a signal whose timing arcs are disabled.

Forcing Initialization

The initial states of certain flip-flops is often alleged to be unimportant, but initializing them to the unknown state leads to an uncontrollable propagation of unknowns. For example, both 1 and 0 produce 0 when XOR'ed with itself. However, X produces X when XOR'ed with itself.

This problem can be averted by initially forcing such flip-flops to a given value, but it won't be safe to do so unless you verify that the initial state is unimportant. In most cases, the prudent solution is to add a reset input to such flip-flops.

If the netlist is already frozen, you might get away with initializing such flip-flops to all zeros for the first simulation, then re-run initializing them to all ones the second time, and then re-run initializing each flip-flop to an independent random state several more times. This yields a partial degree of confidence that the design is functional regardless of the initial state.

Denying Unknowns

The RTL might model unknowns in an impossible way. In particular, an if statement in procedural Verilog converts an unknown in its discriminant to false, for example:
	// 1=>0, 0=>1, X=>1, Z=>1
	module inverter(a,b);
		input a;
		output b;
		always @a begin
			if(a)
				b=0;
			else
				b=1;
		end
	endmodule; // inverter
In reality, if an output depends on an input, and the input is unknown, then the output must also be unknown.

The ideal solution is to add code (invisible to the logic synthesis tool, in order to avoid confusing it) to handle this correctly. Unfortunately, that's unreliable, a lot of work, and inefficient to simulate. Instead, such problems are normally covered by post-synthesis simulation, which you'll need to do anyway in order to verify that the logic synthesis tool did nothing unexpected.

Emulation

Emulation refers to the use of powerful hardware acceleration for simulating a design. Its main advantage over traditional simulation is that it runs around 1000× faster, and therefore can be used to cover product states that cannot be reached by simulation. Its disadvantages are that it typically does not use unknowns, is generally more difficult to provide well-controlled stimulus for, and is substantially more difficult to check the results of.

Because emulation is typically no more accurate than simulation, but more precise (because it doesn't use unknowns), it is necessarily less definitive, in addition to the fact that it is harder to provoke existing defects under emulation. Therefore, emulation does not trump simulation. In other words, the failure to observe a defect in emulation does not negate its existence when it is observed in simulation. Similarly, hardware does not trump emulation

If it is possible to fit an RTL-accurate version of the design into a single FPGA that performs at speed under some favorable conditions (such as low temperature and high voltage), then it usually makes sense to use that for emulation. FPGA's are relatively economical, and can be used for early prototyping of the system that incorporates the product.

Otherwise, it's not clear always whether it makes sense to emulate at all. Before investing millions of dollars in emulation capability, you should consider whether the same goals could be accomplished more economically, for example, by designing for verifiability, using state forwarding, or using static analysis to prove that difficult-to-provoke defects are absent.

Tests should be written at an abstract level, such that they can in principle be used either for simulation or for emulation. In order for emulation to be effective in detecting defects, you'll need to check its results against the smart checker. In general, that requires adding a "hint bus" in emulation to provide hints to the checker. Since the hint bus is probably a bottleneck, you'll need to buffer its input, and timestamp the hints to account for the unpredictable latency that results.

Coverage

Coverage refers to the likelihood that a typical defect would be detected by given a test suite. Contemporary coverage tools cannot determine coverage reliably. (See Code Coverage Deemed Dubious.) Good coverage as reported by tools should be seen as necessary, but not at all sufficient.

If you're paying attention, you've realized that I just claimed that the most important dimension of progress on a complex project is not measurable, at least not until well after the fact. The harsh reality is that unmeasurable objectives are not necessarily unimportant. This presents a substantial challenge for project management.

On the other hand, user-defined high-level coverage metrics (for example, Vera coverage objects) may be capable of effectively measuring the coverage of interaction among distant modules. I don't know how well this works.

The jury is still out on whether it is worthwhile to invest in coverage measurement capabilities beyond the common sense of the verification engineers who write the tests.

Verification is Not Characterization

The verification task is defined as determining whether or not a given design satisfies a definitive (but mutable) specification, and producing at least one counterexample in case it does not. Since this is already the single most difficult challenge in developing complex electronic products, it is important that you don't overburden this responsibility by also demanding characterization of the design.

Characterization is defined as determining all of the "important" properties of a given design, which is much more difficult than verification. For example, "verifying" a design without a specification is characterization. Enumerating the bugs of a design is also characterization, because defects very often hide behind one another. Verifying an arbitrary specification that is not optimized for practical verifiability is also characterization, because it requires guessing the intermediate properties that make the design verifiable. Verifying specified properties that don't gate the product release is also characterization, because it requires guessing which of the properties are actually needed.

If the alleged characteristics of a defect are simple (which is unusual), then you can verify around it by tracking the defect with temporary code in the specification. The set of waived defects is defined in some central location that is approved as part of the release procedure. For example:

	#include "the_project/waivers.h"
	#ifdef WA_PR0123
		// Do the expected thing
	#else
		// Do the correct thing
	#endif // WA_PR0123

On the other hand, verifying around a defect that is difficult to characterize requires avoiding cases that provoke the defect. That's likely to hide additional bugs, including bugs that you wouldn't waive were they known. Don't do that if you require the avoided cases to work at all.

Since you probably can't recruit enough good verification engineers to thoroughly verify the design as it is, don't make their job any harder than it has to be. Instead, optimize your Verification Specification for verifiability, and decide your release criteria with little or no regard for the outstanding defects.

When to Release

As with software, releasing a hardware design is a nontrivial process that is too costly to undertake lightly. Moreover, whereas prototyping software is trivial, obtaining prototypes of a hardware design is typically a very expensive and time-consuming process. For these reasons, it is especially important to manage releases well.

Releases are Expensive

Releasing an electronic design to manufacturing is a costly process. The cost of non-recurring engineering associated with the release, such as mask making and process development, can rival the design budget. The amount of time required to receive prototypes can be weeks or even months, and in the meantime any further design work requires branching off from the release, which is difficult to manage. Furthermore, a significant amount of work is required to execute the release itself. It is therefore important to minimize the number of releases.

Prototype Validation is Difficult

Diagnosing design defects in hardware is much more difficult than in simulation. Among the reasons are:

Because of this, it is much better to discover design defects in simulation rather than in hardware. If the number of bugs that are discovered in hardware is greater than about 10, then it is very likely that those bugs will hide even more bugs, and that the interactions between those bugs will be nontrivial; therefore, reliably extrapolating the effect of prospective design changes will be essentially impossible.

Verification Gates Release

It is obvious that you shouldn't release when there are known design defects that defy the release criteria. It's less obvious, but even more important, that you shouldn't release until the verification effort is complete. Otherwise, it is a virtual certainty that the number of bugs to be discovered in hardware will be prohibitively large, and your very expensive and time-consuming release will be in vain.

Furthermore, when the verification effort is incomplete, one cannot reliably draw conclusions from the shape of the bug curve. Given a fixed level of verification capability, the total number of previously discovered bugs, or bug curve, typically approaches a bug target (in a manner similar to exponential decay) as the design is refined. However, the bug target represents only those bugs that are detectable through verification. As verification capability increases, the bug target increases in a discontinuous and unpredictable manner. Therefore, the final bug target can be extrapolated only after all of the release criteria are verified. You'll probably need to trust your verification engineers to tell you when that has happened (see Coverage).

The "Bug Number"

To calculate the expected number of unknown defects, or bug number, associated with a given release, I recommend the following heuristic:
  1. Start with the number of known bugs that are not fully characterized (see Verification is Not Characterization). You can expect each of these to have on average one unknown bug hiding behind it.
  2. Add to this the difference between the final bug target and the current value of the bug curve. This approximates the number of additional bugs that would be found with an infinite simulation schedule.

As a rule, you should release only when the bug number is about 5 or less.

Final Regression

It is important that time be allotted for the final design to undergo a complete verification cycle prior to release, because late changes are quite likely to introduce new bugs. If the final regression discovers unwaivable or uncharacterizable bugs, then you'll need to unfreeze the design and try again.

The design needs to be completely frozen (in terms of any data seen by Verification) before you can start the timer for the final regression. (I am aware of case in which a prototype failed because a single comment was changed after the final regression.) It is also a good idea to coordinate releases with the system administrators, to make sure that upgrades are not taking place during the final regression.

Chain of Equivalence

It is necessary to ensure that what is released to manufacturing is what was verified. For example, the schematic is verified against the RTL using formal logic equivalence (such as Conformal), and the masks are verified against the schematic using LVS (such as Assura). This is not a design verification function per se, but it is nonetheless essential.

You shouldn't trust automated processes to translate the design from one form to another. Tools make mistakes too. However, once a translation process has proven itself reliable, it might make sense to defer the verification of that process until after release, and then withdraw the release only in the unlikely event that a problem is discovered.

It is important to have a definitive copy of all the equivalence-checked design specifications and databases (including unreleased data) corresponding to each release. This is essential for effective root cause analysis, and the data can be difficult or impossible to recover after the fact.

Conclusion

Successfully executing a complex electronic design is difficult mostly due to the great number of properties that the design must satisfy simultaneously, coupled with the long turn-around time. Because the risk factors are multiplicative, it is essential to manage risks comprehensively. Doing so requires up-front effort and increases the amount of work required before initial release to manufacturing. However, it is essentially certain to expedite volume production, which is what really matters.


1 Steve McConnell, Rapid Development
2 Brown, et al., Anti Patterns
3 John Lakos, Large-Scale C++ Software Design

Anders Johnson, last modified $Date: 2003/12/07 $

[ Home | Resume | Programming | Engineering Philosophy | Family ]