Dynamic testing of operating systems and applications
It is important to realize from the start that program testing can only demonstrate the presence of errors, not their absence. If we assume that only roughly half of the errors that are present are actually found, it can be safely concluded that an average program still contains a number of weaknesses. In practice, an error rate of 0.7 per 100 lines of code, after testing, is often assumed for assembly-language programming. If we take the previously mentioned value of 30,000 lines of source code for a smart card operating system as representative and subtract two-thirds of these as comments, we can calculate that there are still around 70 errors in a fully tested and released operating system. In the areas of military and medical technology, where security is critical, it is assumed that there are still four undiscovered errors for every 10,000 lines of source code, despite the tremendous amount of effort expended on testing and quality assurance [Thaller 93]. Although most undiscovered errors will never manifest themselves, in the right circumstances a single error is sufficient to bypass all the security barriers of a smart card operating system. It is highly beneficial to always bear this in mind as a motivation for careful and well-considered testing. Of course, there are natural limits to testing. Particularly in commercial projects, in contrast to research projects, the amount of time available and the maximum affordable cost are strongly limiting factors. In addition, testing becomes increasing more difficult and demanding as the number of errors present in the program decreases. The search for the last few errors must at some point come to an end, since the time and resources that can be expended on it are fundamentally limited. When a new version of the software is released, it can generally be assumed that it will contain fewer errors, since there has been an opportunity to analyze errors discovered in use and eliminate them. Interestingly enough, this reduction in the number of errors does not continue indefinitely. Instead, the number of errors is usually seen to reach a minimum around the second version, following which it generally increases. This comes about simply because the necessary corrections are based on the original specifications and source code. After a certain time, which can vary, it is likely that correcting one error will produce one or more new errors. This leads to the curve shown in Figure 9.13. After a certain number of versions, it is thus significantly better to make a completely fresh start than to continue building on outdated concepts and repeatedly revised source code. Incidentally, this is true in almost all fields of technology. In accordance with the IEEE 1008 standard (‘Standard for Software Unit Testing’), three test levels can be distinguished for dynamic testing. The first of these is the basic test level, which essentially tests the basic functions and successful execution of the individual commands. The second level is the capability test, which encompasses boundary values and non-successful execution. The third level is the behavior test, in which commands are tested in combination
with each other.

Test methodology
There is a major difference between testing a new operating system and testing a new application. When a smart card operating system is tested, the entire program code must be tested for a wide variety of application cases. This requires a large number of different tests. In the case of a new application, which consists of only a DF and several EFs, the number of tests is reduced to match the amount of additional data and the identification and authentication procedure defined for the application. If a new operating system must be tested, several test applications that are similar to some typical real applications are usually generated. This essentially amounts to creating equivalence classes for the usual applications. These equivalence classes form the basis for the individual tests that are subsequently performed. The approach to testing new smart card operating systems described here has become established in the course of several years in a wide variety of projects. Testing always starts with the data transmission functions, since they form the basis for all further activities. Following this, all available commands are tested. If an application is involved, the next stage is file tests. If all these tests are completed successfully, testing of defined procedures can begin. There are currently only a few international standards that govern the construction and execution of tests for smart card operating systems and applications. A European standard (EN 1292) defines a few tests for the ATR and the T = 1 transmission protocol. For GSM smart cards, relatively extensive tests for the operating system and application are defined in the GSM 11.17 specification. In order to provide an overview, a selection of possible tests in a conventional sequence is presented below. This list does not pretend to be complete, and it is only meant to serve as a detailed illustrative example. The purpose of the listed tests is to test the essential general parameters of a new operating system, including one or more applications.

Data transmission tests
–ATR (parity error detection, and if T = 0 is present, character repetition and ATR structure and contents)
–PTS (PTS structure and contents)
–Data transmission test at OSI layer 2 (start bit, data bits and stop bit, divider, and data transmission convention)
–T=0 transmission protocol (parity error detection and character repetition, various processes)
–T=1 transmission protocol (CWT, BWT, BGT, resync, error mechanisms, various processes)
–Secure messaging
Testing available commands
–Test all possible class bytes
–Test all possible instruction bytes
–Test all available commands using equivalence classes for the supported functions
Testing available files
–Test whether all files are present in the correct locations (MF, DF)
–Test for correct file size
–Test for correct file structure
–Test for correct file attributes
–Test for correct file contents
–Test the defined access conditions (read, write, block, unblock etc.)
Testing available processes
–Test the defined state machines (e.g., the command sequence)
As can easily be imagined, even if equivalence classes are generated and various other minimization techniques are used, a relatively large number of individual tests are required. It can be assumed that 4000 to 8000 different tests must be prepared to cover the essential test cases for a 20-kB smart card operating system, with tests that perform the same operation many times in a single loop (such as sending several hundred different values to the smart card) being counted as single tests. The number of commands sent to the smart card using these tests can easily be on the order of 40,000. The amount of time required to perform all these tests is in the range of one to two days. The only way to manage such a large number of tests with a reasonable amount of effort is to use a suitable database, which can also store the test results. The ‘tree and tabular combined notation’ (TTCN), which is standardized in ISO/IEC 9646-3, is one of the techniques that can be used to formally describe the tests. Any desired test case can be described in a general and standardized form using this notation. An interpreter can then use this description to automatically generate the command APDUs for the card being tested. This allows largely automated test procedures to be defined. The structure of a test tool for smart cards is shown in Figure 9.17. The specification of the card’s software, which is written completely in pseudocode, is contained in an appropriate database. If the specification changes, the necessary modifications to the tests are made automatically. Another database contains all of the tests, which are defined in a high-level language that can also be directly read by a computer. The two databases feed a test pattern generator, which generates the commands (i.e., TPDUs or APDUs as appropriate) for the card being tested. A simulation of the real card, which is largely defined by the specification, is run in parallel. Since there are incompletely predictable processes in the real card (e.g., generating a random number), additional data must be sent to the simulated card. The real and simulated cards send their command responses to a comparator. If they are the same, the real card has provided the correct result, insofar as the simulation is the proper reference. All the data generated during a test run are stored in a log database so they can later be manually evaluated.