These are the notes for Sofware Quality at USJ

Early Draft

Introduction to Software Quality

Why Software Quality matters

Let's have a look at one of the most famous bugs of the whole Software History, the bug that AT&T suffered in 1990.

An example of the sequences of Poor Quality Software

What happened?

At 2:25 PM on Monday, January 15th, network managers at AT&T's Network Operations Centre began noticing an alarming number of red warning signals coming from various parts of their network. Within seconds, the network warnings were rapidly spreading from one computer-operated switching Centre to another. The managers tried to bring the network back up to speed for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through until at 11:30 PM, when network loads were low enough to allow the system to stabilize.

AT&T alone lost more than $60 million in unconnected calls. Of course, there were many additional consequences difficult to be measured such as business that could be done because relied on network connectivity.

The system was failure tolerant, wasn't it?

AT&T's long-distance network was a model of reliability and strength. On any given day, AT&T's long-distance service, which at 1990 carried over 70% of the US long-distance traffic.

The backbone of this massive network was a system of 114 computer-operated electronic switches (4ESS) distributed across the United States. These switches, each capable of handling up to 700,000 calls an hour, were linked via a cascading network known as Common Channel Signalling System No. 7 (SS7). When a telephone call was received by the network from a local exchange, the switch would asses 14 different possible routes to complete the call. At the same time, it passed the telephone number to a parallel signalling network that checked the alternate routes to determine if the switch at the other end could deliver the call to it's local company. If the destination switch was busy, the original switch sent the caller a busy signal and released the line. If the switch was available, a signal-network computer made a reservation at the destination switch and ordered the destination switch to pass the call, after the switches checked to see if the connection was good. The entire process took only four to six seconds.

What went wrong?

The day the bug popped-up, a team of 100 frantically searching telephone technicians identified the problem, which began in New York City. The New York switch had performed a routine self-test that indicated it was nearing its load limits. As standard procedure, the switch performed a four-second maintenance reset and sent a message over the signalling network that it would take no more calls until further notice. After reset, the New York switch began to distribute the signals that had backed up during the time it was off-line. Across the country, another switch received a message that a call from New York was on it's way, and began to update its records to show the New York switch back online. A second message from the New York switch then arrived, less than ten milliseconds after the first. Because the first message had not yet been handled, the second message should have been saved until later. A software defect then caused the second message to be written over crucial communications information. Software in the receiving switch detected the overwrite and immediately activated a backup link while it reset itself, but another pair of closely timed messages triggered the same response in the backup processor, causing it to shut down also. When the second switch recovered, it began to route it's backlogged calls, and propagated the cycle of close-timed messages and shut-downs throughout the network. The problem repeated iteratively throughout the 114 switches in the network, blocking over 50 million calls in the nine hours it took to stabilize the system.

The roots of the issue

The cause of the problem had come months before. Early December, technicians had upgraded the software to speed processing of certain types of messages. Although the upgraded code had been rigorously tested, a one-line bug was inadvertently added to the recovery software of each of the 114 switches in the network. The defect was a C program that featured a break statement located within an if clause, that was nested within a switch clause. In pseudo-code, the program read as follows:

                1  while (ring receive buffer not empty and side buffer not empty) DO
                2    Initialize pointer to first message in side buffer or ring receive buffer
                
                3    get copy of buffer

                4    switch (message) {
                5       case (incoming_message):
                6             if (sending switch is out of service) DO {
                7                 if (ring write buffer is empty) DO
                8                     send "in service" to status map
                9                 else
                10                    break
                              } // END IF
                11            process incoming message, set up pointers to optional parameters
                12            break
                     } // END SWITCH
                13   do optional parameter work
              

When the destination switch received the second of the two closely timed messages while it was still busy with the first (buffer not empty, line 7), the program should have dropped out of the if clause (line 7), processed the incoming message, and set up the pointers to the database (line 11). Instead, because of the break statement in the else clause (line 10), the program dropped out of the case statement entirely and began doing optional parameter work which overwrote the data (line 13). Error correction software detected the overwrite and shut the switch down while it could reset. Because every switch contained the same software, the resets cascaded down the network, incapacitating the system.

Lessons Learned

Unfortunately, it is not difficult for a simple software error to remain undetected, to later bring down even the most reliable systems. The software update loaded in the 4ESSs had already passed through layers of testing and had remained unnoticed through the busy Christmas season. AT&T was fanatical about its reliability. The entire network was designed such that no single switch could bring down the system. The software contained self-healing features that isolated defective switches. The network used a system of "paranoid democracy," where switches and other modules constantly monitored each other to determine if they were "sane" or "crazy." Sadly, the Jan. 1990 incident showed the possibility for all of the modules to go "crazy" at once, how bugs in self-healing software can bring down healthy systems, and the difficulty of detecting obscure load- and time-dependent defects in software.

Software Crisis

But we could think that this bug occurred a while ago and that nowadays we have more advanced technologies, methodologies, training systems and developers.

Is this really true? Just partially, it's true Software Development has evolved a lot, but the type of problems that are solved via Software has also evolved, every day with try to solve more problems and more complex via software.

The Software Crisis term was coined by USA Department of Defence years ago in order to describe that the complexity of the problems addressed of software has outpaced the improvements in the software creation process as shown graphically in .

"Few fields have so large a gap between best current practice and average current practice."

Department of Defence

The Software Complexity Evolution

In other words, the software creation process has evolved very little while the problems software is solving are way too much complex

"We have repeatedly reported on cost rising by millions of dollars, schedule delays, of not months but years, and multi-billion-dollar systems that don't perform as envisioned. The understanding of software as a product and of software development as a process is not keeping pace with the growing complexity and software dependence of existing and emerging mission-critical systems."

Government Accounting Office

Additionally, as depicted in , the need of software developers has increased exponentially, because more software is needed as software is used in nearly every product with a minimum of complexity. Whereas the need of developers has increased exponentially, the availability of developers has unfortunately not grown at the same pace, i.e. there are less developers than what is needed. Due to that, people without the right skills have started developing software, with the believe that developing software is an easy task that nearly everybody could do. Developing software with people not properly trained or without the right skills inherently leads to bad quality software.

The Software Resources Evolution

Legal Warranties

Mortenson, a construction contractor purchased software from Timberline Software Corporation, which Timberline installed in Mortenson's computers. Mortenson, relying on the software, placed a bid which was $1.95 million too low because a bug in the software of which Timberline was aware. The State of Washington Supreme Court ruled in favour of Timberline Software. However, a simple bug in the software lead to multiple problems to both companies. In the US Warranty Laws, the Article 2 of the Uniform Commercial Code includes the "Uniform Computer Information Transaction Act" UCITA) that allows software manufacturers to:

  • Disclaim all liability for defects
  • Prevent the transfer of software from person to person remotely
  • Disable licensed software during a dispute

That act, practically means that software distributors can limit their liability through appropriate clauses in the contracts. For instance, below is shown the disclaimer of warranties of a Microsoft product. Although the law overprotects software developers and distributors, using these disclaimers may prevent legal problems, but there are multiple additional problems related with poor software that are not avoided by them.

DISCLAIMER OF WARRANTIES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, MICROSOFT AND ITS SUPPLIERS PROVIDE TO YOU THE SOFTWARE COMPONENT, AND ANY (IF ANY) SUPPORT SERVICES RELATED TO THE SOFTWARE COMPONENT ("SUPPORT SERVICES") AS IS AND WITH ALL FAULTS; AND MICROSOFT AND ITS SUPPLIERS HEREBY DISCLAIM WITH RESPECT TO THE SOFTWARE COMPONENT AND SUPPORT SERVICES ALL WARRANTIES AND CONDITIONS, WHETHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, ANY (IF ANY) WARRANTIES OR CONDITIONS OF OR RELATED TO: TITLE, NON- INFRINGEMENT, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, LACK OF VIRUSES, ACCURACY OR COMPLETENESS OF RESPONSES, RESULTS, LACK OF NEGLIGENCE OR LACK OF WORKMANLIKE EFFORT, QUIET ENJOYMENT, QUIET POSSESSION, AND CORRESPONDENCE TO DESCRIPTION. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE COMPONENT AND ANY SUPPORT SERVICES REMAINS WITH YOU.

Although the law overprotects software developers and distributors, using these disclaimers may prevent legal problems, but it is just a way to avoid the legal problems of having bad quality software not solving the real problem that is what affect and frustrates end users.

What is Software Quality?

Many people have tried to define what does Software Quality mean. However, it is not an easy task. Quality in general (not only in Software) is such a subjective topic that trying to define it formally is extremely challenging.

There is a very interesting book called "Zen and the Art of Motorcycle Maintenance" [[ZEN-AND-THE-ART-OF-MOTORCYCLE-MAINTENANCE]] in which the narrator talks about the process of creative writing, and specially about quality. The quality of a written text is difficult to define. If you ask people to rank essays (or programs) from best to worst it is very likely they reach a consensus "they have an intuitive understanding that one essay has more quality than another" but it's much more difficult to identify the parts of the essay that give it quality.

In Zen and the Art of Motorcycle Maintenance, Pirsig (the author) explores the meaning and concept of quality, a term he deems to be undefinable. Pirsig's thesis is that to truly experience quality one must both embrace and apply it as best fits the requirements of the situation. According to Pirsig, such an approach would avoid a great deal of frustration and dissatisfaction common to modern life.

Let's think about another example of how the situation determines the quality. For instance, a master chef has prepared an exquisite meal and invited a group of friends to share it at her restaurant on a lovely summer evening. Unfortunately the air conditioning isn't working at the restaurant, the waiters are surly, and two of the friends have had a nasty argument on the way to the restaurant that dominates the dinner conversation. The meal itself is of the highest quality but the experiences of the diners are not.

You could think that writing code is very different to writing an essay, but that is not the case. Usually, when you have a look at a piece of code it is easy for you to determine if you like it or not, but it becomes quite complicated to assess why.

View 1: Formal Definition

Software quality may be defined as conformance to explicitly stated functional and performance requirements, explicitly documented development standards and implicit characteristics that are expected of all professionally developed software.

This definition emphasis from three points:

  1. Software requirements are the foundations from which quality is measured: Lack of conformance to requirement is lack of quality.
  2. Specified standards define a set of development criteria that guide the manager is software engineering: If criteria are not followed lack of quality will almost result.
  3. A set of implicit requirements often goes unmentioned, like for example ease of use, maintainability, etc.: If software confirms to its explicit requirement but fails to meet implicit requirements,software quality is suspected.

For the first item, explicit software requirements, it is going to be relatively easy to check objectively the conformance to them, for the second one, it is going to be more complicated and depends on how documented those standards are, for the implicit characteristics expected, it is going to be even tougher, as measuring conformance to something that is implicit, is, by definition, impossible.

View 2: The Human Point of View

Those "implicit" requirements mentioned in the formal definition are a hint to indicate that there is something more about Software that goes beyond the explicit requirements. At the end of the day, software is going to be used by people, which do not care about the requirements but about their expectations. Hence, the need to look for another point of view.

"A product's quality is a function of how much it changes the world for the better." [[MANAGEMENT-VS-QUALITY]] or "Quality is value to some person" [[QUALITY-SOFTWARE-MANAGEMENT]]. Both definitions stress that the quality may be subjective. I.e. different people are going to perceive different quality in the same software. The software developers should also think about end users and asking themselves questions such as "How are users going to use the software?".

In order to provide a more complete picture, IEEE standard 610.12-1990 combines both views in their definitions of quality:

Software quality is

  • The degree to which a system, component, or process meets specified requirements.
  • The degree to which a system, component or process meets customer or user needs or expectations.

View 3: Internal vs. External Quality

There is another dimension of Software Quality that depends on whether we focused on the part of the Software that is exposed to the users or on the part of the Software that is not.

External Quality is the fitness for purpose of the software, i.e. does the software what it is supposed to do?. The typical way to measure external quality is through functional tests and bugs measurement.

Usually this is related to the conformance requirements that affect end-users (formal definition) as well as to meeting the end-user expectations (human point of view).

Some of the properties that determine the external quality of software are:

  • Conformance to the product specifications and user expectations.
  • Reliability: Is the software working with the same level performance under different conditions and during all the time.
  • Accuracy: Does the software do exactly what is supposed to do.
  • Ease of use and comfort: Is the software easy to use and responds in an amount time according to user expectations?
  • Robustness: Does the software adapt to unforeseen situations, e.g. invalid input parameters, connectivity lost...

Internal Quality is everything the software does but is never seen directly by the end-user. It's the implementation, which the customer never directly sees. Internal quality can be measured by conformance requirements (not focused on end-users but on software structure), software analysis and adherence to development standards or best practices.

If it is not visible to end-user, and our target is make customers happy, we could ask ourselves if Internal Quality is something we should pay attention to.

Internal quality is related with the design of the software and it is purely in the interest of development. If Internal quality starts falling, the system will be less amenable to change in the future. Due to that, code reviews, refactoring and testing are essential as otherwise the internal quality will slip.

An interesting analogy with debts and bad code design was developed Ward [[DEBT-ANALOGY]]. Sometimes companies need to get some credit from the banks in order to be able to invest, however, it is also critical to understand that is impossible to ask for credit continuously as the paying interest will kill the company financially. The same could be used for software, sometimes it is good to assume some technical debt to achieve a goal, for instance, meeting a critical milestone to reach users before our competitors, but it is important to understand that assuming technical debt endlessly would kill the project as it will make the product unmaintainable.

Sometimes, after achieving the target External Quality, we need to refactor our code to improve the Internal Quality. Software Quality is sometimes the art of a continuous refactor.

Let's go back to the analogy of writing an essay or a paper, in that case most people write out the first draft as a long brain-dump saying everything that should be said. After that, the draft is constantly changed (refactored) until it is a cohesive piece of work.

When developing software (for instance in University assignments :-D) the first draft is often finished when it meets the general requirements of the task. So, after that, there is an immediate need to refactor the work into a better state without breaking the external quality. Maybe writing software is also kind of an art?

This is universally true, and the danger of not paying attention to refactor your code is bigger on a larger project where poor quality code can lose you days in debugging and refactoring.

Some of the properties that enable the process of product with good internal quality are:

  • Concision: The code does not suffer from duplication.
  • Cohesion: Each [module|class|routine] serves a particular purpose (e.g. it does one thing) and does it well.
  • Low coupling: Minimal inter-dependencies and interrelation between objects and modules.
  • Simplicity: The software is always designed in the simplest possible manner so that errors are less likely to be introduced.
  • Generality: Specific solutions are only used if they are really needed.
  • Clarity: the code enjoys a good auto-documentation level so that it is easy to be maintained.

The external quality is sometimes compared with "Doing the right things" as opposed to "Doing the things right" which should define what internal quality is.

Usually, the problems with the external quality characteristics (correctness, reliability...) are simply visible symptoms about software problems, that usually are related with internal quality attributes: program structure, complexity, coupling, testability, reusability, readability, maintainability... Sometimes, when the internal quality is bad, external quality can be met during a short period of time, but in the longer term, the external quality will be affected.

An excellent analogy is the Quality Iceberg created by Steve McConell (see ).

The Software Quality Iceberg

View 4: ISO 9126

ISO 9126 defines Software Quality as the totality of characteristics of an entity that bears on its ability to satisfy stated and implied needs.

It recognizes that quality is not only determined by the software itself but also by the process used for software development and the use made of the software. Hence, the following identities are defined:

  • Quality of the process of constructing software: process quality
  • Quality of software product in itself: internal and external quality
  • Quality of software product in use: quality in use
ISO 9126 View on Software Quality

ISO identified 6 characteristics of the software quality that are sub-divided into sub-characteristics:

  • Functionality: A set of attributes that bear on the existence of a set of functions and their specified properties. The following sub-characteristics were defined: Suitability, Accuracy, Interoperability and Security.
  • Reliability: A set of attributes that bear on the capability of software to maintain its level of performance under stated conditions for a stated period of time. The following sub-characteristics were defined. The following sub-characteristics were defined: Maturity, Fault Tolerance and Recoverability.
  • Usability: A set of attributes that bear on the effort needed for use. The following sub-characteristics were defined: Understandability, Learnability and Operability.
  • Efficiency: A set of attributes that bear on the relationship between the level of performance of the software and the amount of resources used. The following sub-characteristics were defined: Time Behaviour and Resource Behaviour.
  • Maintenability: A set of attributes that bear on the effort needed to make specified modifications. The following sub-characteristics were defined: Analyzability, Changeability, Stability and Testability.
  • Portability: A set of attributes that bear on the ability of software to be transferred from one environment to another. The following sub-characteristics were defined: Adaptability, Installability, Conformance and Replaceability.
ISO 9126 Quality Characteristics

Quality in use is defined by ISO as "the extent to which a product used by specified users meets their needs to achieve specified goals with effectiveness, productivity, and satisfaction in specified contexts of use". The quality in use hence depends on the context in which the product is used and its intrinsic quality.

Summary

There is no a single definition of quality. However, the importance of Software Quality is continuously increasing. The concepts of external and internal quality are commonly used across the software industry, but despite that, the properties used to measure the quality diverge across different methodologies, standards or companies.

Key Definitions

Despite the availability of different quality definitions, characteristics and entities, a common understanding is that high quality is usually linked to products with low number of defects. Therefore, it is assumed that a quality problem is due to the impact of a defect.

But in order to identify which is high quality, defining what a defect is needed. In general, there are three concepts used in software quality to refer to defects:

A real life example of all these concepts is described in Example 2.
            Peter is driving his car towards Oxford. While he is driving, the road 
            diverts into two different directions:
              1.  Left road to Oxford
              2.  Right road to Cambridge

            By mistake, Peter takes the road to Cambridge. That is a fault that is 
            committed by Peter.

            Suddenly, Peter is in an error situation or state: Peter is heading 
            Cambridge and not Oxford.

            If Peter goes on and arrives to Cambridge, that would be a failure: 
            Peter was planning to get to Oxford but he has arrived to Cambridge instead.

            If Peter realizes of the error situation while he is driving Cambridge, 
            returns to the junction and takes the right road to Oxford no failure 
            would happen as Peter recovers from the error condition.
          
            public static int numZero (int[] x) { 
              // effects: if x == null throw NullPointerException
              // else return the number of occurrences of 0 in x 
              int count = 0; 
              for (int i = 1; i < x.length; i ++) {
                if (x[i] == 0) { 
                 count ++;
                }
              }
              return count;
            }
          

The fault in the code above is that it starts looking for zeroes at index 1 instead of index 0. For example, numZero([2, 7, 0]) correctly evaluates to 1, while numZero([0, 7, 2]) incorrectly evaluates to 0. In both cases the fault is present and is executed. Although the code is in both cases in an error situation, only in the second case there is a failure: the result is different from the expected one. In the first case, the error condition (the for starts in 1) do not propagates to the output.

Some early conclusions can be already identified:

Software Quality Assurance

Introduction to SQA

Software Quality Assurance (SQA) is the set of methods used to improve internal and external qualities. SQA aims at preventing, identifying and removing defects throughout the development cycle as early as possible, as such reducing test and maintenance costs.

SQA consists of a systematic, planned set of actions necessary to provide adequate confidence that the software development process or the maintenance process of a software system product conforms to established functional technical requirements as well as with the managerial requirements of keeping the schedule and operating within the budgetary confines.

The ultimate target of the SQA activities is that few, if any, defects remain in the software system when it is delivered to its customers or released to the market. As it is virtually impossible to remove all defects, another aim of QA is to minimize the disruptions and damages caused by these remaining defects.

The SQA methodology will also depend on the software development methodology used, as they are inherently couple. For instance, different software development models will focus the test effort at different points in the development process. Newer development models, such as Agile, often employ test driven development and place an increased portion of the testing in the hands of the developer, before it reaches a formal team of testers. In a more traditional model, most of the test execution occurs after the requirements have been defined and the coding process has been completed.

An example of an SQA methodology is available at [[IEEE-QA-TEMPLATE]].

SQA activities are not only carried out by the Software Quality group, the software engineer group is responsible for putting in place the SQA methodology defined, which may include different activities such as testing, inspection, reviews...

SQA Activities

Classification SQA Activities

The activities that are carried out as part of the SQA process can be divided in three different categories.

  1. Defect Prevention: Defect prevention consists on preventing certain types of faults from being injected into the software. As explained in previous section, a fault is the missing or incorrect human actions that lead to error situations in the software. There are two generic ways to prevent defects:
    • Eliminating certain fault sources such as ambiguities or human misconceptions
    • Fault prevention or blocking: Breaking the causal relation between error sources and faults through the use of certain tools and technologies.
  2. Defect reduction: Consists in removing the faults from the software through fault detection and removal. These QA alternatives detect and remove certain faults once they have been injected into the software systems. The most traditional QA activities fall into this category such as:
    • Inspection: directly detects and removes faults from the software code, design, etc.
    • Testing: removes faults based on related failure observations during program execution.
  3. Defect containment: Consists in minimizing the impact of the software faults. The most important techniques in this area are:
    • Fault-tolerance techniques: Try to break the causal relationship between faults and failures. E.g. ensuring that error conditions do not lead to a software failure.
    • Containment measures: Once the error has occurred, if there is no way to prevent the failure, ideally, it should be possible to perform some actions to minimize the impact and consequences of the failure.

The following sections explain in detail these SQA activities.

Fault Prevention

The main goal of these activities is reducing the chance for defect injections and the subsequent cost to deal with these injected defects.

Most of the defect prevention activities assume that there are known error sources or missing/incorrect actions that result in fault injections, as follows:

  • If human misconceptions: Lack of education.
  • If imprecise designs & implementations: Lack of formal methods.
  • If non-conformance to standards: No standard enforcement.
  • If lack of tools and techniques: No technique or tool adoption.
Education and Training

People is the most important factor that determines the quality and, ultimately, the success or failure of most software projects. Hence, it is important that people involved in the software planning, design and development have the right capabilities for doing their jobs. The education and training effort for error source elimination should focus on the following areas:

  • Product and domain specific knowledge.
  • Software development knowledge and expertise.
  • Knowledge about Development methodology, technology, and tools.
  • Development process knowledge.
Formal Methods

Formal methods provide a way to eliminate certain error sources and to verify the absence of related faults. Formal development methods, or formal methods in short, include formal specification and formal verification.

  • Formal specification is concerned with producing an unambiguous set of product specifications. An unclear specification implies that the software target and behaviour may depend on the interpretation of the developer, due to that the likelihood of defects in errors is higher.
  • Formal verification checks the conformance of software design or code against these formal specifications, thus ensuring that the software is fault-free with respect to its formal specifications.

Fault Removal

Even if the best software developers in the world are involved in a software project, and even if they follow the formal methods described in the previous section, some faults will be injected in the software code. Due to that, defect prevention needs to be complemented with other techniques focused on removing as many of the injected faults as possible under project constraints.

Fault distribution is highly uneven for most software products, regardless of their size. Much empirical evidence has accumulated over the years to support the so-called 80:20 rule, which states that 20% of the software components are responsible for 80% of the problems (Pareto Law). There is a great need for risk identification techniques to detect the areas in which the fault removal activities should be focused.

There are two key activities that deal with fault removal: Code Inspection and Testing.

Inspections

The software inspections were first introduced by Michael E. Fagan in 1970s, when he was a software development manager at IBM. The inspections are a means of verifying intellectual products by manually examining the developing product, a piece at a time, by small groups of peers to ensure that it is correct and conforms to product specifications and requirements. Inspections may be done in the software code itself and also in other related items such as design or requirements documents.

Code inspections should check for technical accuracy and completeness of the code, verify that it implements the planned design, and ensure good coding practices ?and standards are used. Code inspections should be done after the code has been compiled and all syntax errors removed, but before it has been unit tested.

There are different kind of inspections depending on factors such as the formality (formal vs informal) the size of the team (peer review, team review), whether it is guided or not; The type of inspection to be done depends on the software to be reviewed, the team involved and the target of the review.

Regardless of the inspection type used, there are clear benefits when inspections are used. For instance, according to Bell-Northen Researh, the cost of detecting a defect is much lower in case of inspections (1 hour per defect) than in the case of testing (2-4 hours per defect).

More information about inspections can be found at [[INSPECTIONS-AND-REVIEWS]] and [[TRUTHS-PEER-REVIEWS]] and in the last chapter.

Testing

Testing is the execution of software and the observation of the program behaviour and outcome. As in the case of the software inspections, there are different kind of testing, usually applied in different phases of the software development process.

Some of the most typical testing types are:

  • Unit Testing: individual units of source code are tested to determine if they are fit for use. A unit is the smallest testable part of an application.
  • Module Testing: A complete module is tested to determine if it fulfils its requirements.
  • Integration Testing: Any type of software testing that seeks to verify the interfaces between modules against a software design.
  • System Testing: It tests a completely integrated system to verify that it meets its requirements
  • Acceptance Testing: Testing performed often in production or pre-production environment to check that the software is ready for being delivered and deployed.

A concept tight related with testing (although applicable in other areas such as reviews) is the handling of the defects. In particular it is very important that the defects detected are properly recorded (defect logging) with all the relevant information as in many situations finding the error related with a fault is not trivial. It is also very important that the issues detected are monitored so that everybody knows what is the status of every defect after the initial discovery (defect tracking).

Defect Containment

The defect reduction activities can only reduce the number of faults to a fairly low level, but not completely eliminate them. For instance, in many situations, the combination of possible situations is so big, that it is impossible to test all those situations, especially those linked to rare conditions or unusual dynamic scenarios.

Depending on the purpose of the software, these remaining faults, and the failure risk due to them may be still inadequate, so some additional QA techniques are needed:

  • Software fault tolerance.
  • Failure containment.

For instance, the software used in the flight control systems is one example of software with very extreme requirements about failures. The report [[CHALLENGES-FAULT-TOLERANT-SYSTEMS]] provides more details about the challenges that this kind of systems pose to software developers.

Software Fault Tolerance

Software fault tolerance ideas originate from fault tolerance designs in traditional hardware systems that require higher levels of reliability, availability, or dependability.

All fault tolerance systems must be based on the provision of useful redundancy that allows to switch between components when one of they fail (due to software or hardware faults). That implies that there has to be some extra components, which ideally should have a different design to avoid the same error to happen twice. Based on how those redundant components structured are used (e.g. when to switch from one to another) there are different kind of systems:

  1. Recovery blocks: Use repeated executions (or redundancy over time) as the basic mechanism for fault tolerance. The software includes a set of "recovery points" in which the status is recorded so that they could be used as fallbacks in case something goes wrong. When a piece of code is executed, a "test acceptance" is internally executed, if the result is OK, a new "recovery point" is set-up, if the result is not acceptable, then the software returns to the previous "recovery point" and an alternative to the faulty code is enacted. This process continues until the "acceptance test" is passed or no more alternatives are available, which leads to a failure. Some key characteristics about this scheme that is depicted in :
    • It is a backward error recovery technique: when an error occurs, appropriate actions are taken to react but no preventing action is taken.
    • It is a "serial technique" in which the same functionality (the recovery block) is never executed in parallel.
    • The "acceptance test" algorithm is the critical part to success as well as the availability of recovery blocks designed in different ways to the original code.
    Recovery Blocks
  2. NVP (N-version programming):

    This technique uses parallel redundancy, where N copies, each of a different version, of codes fulfilling the same functionality are running in parallel with the same inputs. When all of those N-copies have completed the operation, an adjudication process (decision unit) takes place to determine (based in a more or less complex vote) the output.

    Some key characteristics about this scheme that is depicted in

    • It is a forward error recovery technique: preventive actions are taken. Even if no error occurs the same functionality is executed multiple times.
    • It is a "parallel technique" in which the same functionality is always executed in parallel by different versions of the same functionality.
    • The "decision unit" algorithm is the critical part to success as well as the availability of different versions of the same code designed in different ways.
    N version programming

    Obviously a wide range of different variants of those systems have been proposed based in multiple combinations of them [[COST-EFFECTIVE-FAULT-TOLERANCE]] and multiple comparisons between the performance are also available [[PERFORMANCE-RB-NVP-SCOP]].

What do you think are the key advantages and disadvantages of the two fault tolerance techniques described (Recovery Blocks & N-Version)? Exercise 3: Recovery Blocks vs. N-Version

Failure Containment

There is software that is used in safety critical systems, that have severe consequences in case a failure occurs. In those situations it is very important to avoid some of the potential accidents or at lt

Various specific techniques are used for this kind of systems, most of them based on the analysis of the potential hazards linked to the failures:

  • Hazard Elimination through substitution, simplification, decoupling, elimination of specific human errors and reduction of hazardous materials or conditions. These techniques reduce certain defect injections or substitute non-hazardous ones for hazardous ones. The general approach is similar to the defect prevention and defect reduction techniques surveyed earlier, but with a focus on those problems involved in hazardous situations.
  • Hazard Reduction through design for controllability (for example, automatic pressure release in boilers), us of locking devices (for example, hardware/software interlocks), and failure minimization using safety margins and redundancy. These techniques are similar to fault tolerance, where local failures are contained without leading to system failures.
  • Hazard control through reducing exposure, isolation and containments (for example barriers between the system and the environment), protection systems (active protection activated in case of hazard), and fail-safe design (passive protection, fail in a safe state without causing further damages). These techniques reduce the severity of failures, therefore weakening the link between failures and accidents.
  • Damage control through escape routes, safe abandonment of products and materials, and devices for limiting physical damages to equipment or people. These techniques reduce the severity of accidents, thus limiting the damages cause by these accidents and related software failures.

Notice that both hazard control and damage control above are post-failure activities that attempt to "contain" the failures so that they will not lead to accidents or the accident damage can be controlled or minimized. All these techniques are usually very expensive and process/technology intensive, hence they should be only applied when safety matters and deal with rare conditions related to accidents.

Software Quality Engineering

Whereas Quality Assurance defines a set of methods to improve Software Quality, it does not define aspects that are key in order to ensure good quality software is delivered such as:

In order to address these questions, the QA activities should be considered not in an isolated manner, but as part of a full engineering problem. Software Quality Engineering is the discipline that defines the processes to ensure high quality products. QA activities are only a part of that process, which requires further activities such as Quality Planning, Goal Setting or Quality Assessment. The provides an overview of the typical SQE:

SQE Cycle

Pre-QA Activities: Quality Planning

Before doing any QA activity, it is important to consider some aspects such as the target quality, the most appropriate QA activities to be done and when should be done, how are the quality going to be measured. All those activities are usually called Pre-QA or Quality Planning Activities.

The first activity that should be done in SQE is defining what are the specific quality goals for the software to be delivered. In order to do so, it is important to understand what are the expectations of the software end-user/customer. Obviously, it is also key to recognize that the budget is limited and that the quality target should be financially doable. The following activities are key to identify the target quality of the software:

  1. Identify quality views and attributes meaningful to target customers and users. Which aspects will be key for them to perceive the software as high quality one? This may depend a lot on the type of product and on the target customers.
  2. Select direct quality measures that can be used to measure those quality attributes that are key for the customers.
  3. Quantify these quality measures to set quality goals while considering the market environment and the cost of achieving different quality goals.

Once that the quality goals are clear, the QA strategy should be defined. Two key decisions should be made during this stage:

  1. Which QA activities are the most adequate ones to meet the customer quality expectations. For doing this, it is important to translate the quality views, attributes and goals into the QA activities to be performed. It is also very important to determine when every QA activity is going to be executed as part of the full Software Development Process.
  2. The external quality measures should be mapped into internal indirect ones via selected quality models. Good models are required in order to predict external quality based on internal indicators. It is also very important to identify how the results of this measure are going to be collected and used (e.g. what happens if the quality is not good enough or how the feedback is going to be used).

In-QA Activities

These activities have been described in section 1.4.2 and basically consist in executing the QA activities planned and handling the defects discovered as a result of them.

Post-QA Activities

These activities consist in measuring the quality of the software (after the QA activities), assess the quality of the software product and the definition of the decisions and actions need to improve its quality.

All these activities are usually carried out after normal QA activities have started but as part of these "normal" QA activities. Their goal is to provide feedback so that decisions can be made and improvements can be suggested. The key activities include:

  • Measurement: Besides the direct measure of tracking the defects during the in-QA activities, various other measurements are needed in order to the track the QA activities and for project management purposes. The data resulting from this analysis is important to manage software project and quality.
  • Analysis and Modelling: These activities analyse measurement data from software projects and fit them to analytical models that provide quantitative assessment of selected quality characteristics and sub-characteristics. This is key to obtain an objective assessment of the current product quality, predict future quality or identify problematic areas.
  • Providing feedback and identifying improvement potentials: The results of the previous activities can lead to some suggestions to improve the process followed with the software being assessed (e.g. more testing resources are needed, test cases are not sufficient...) or the general SQE methodology.
  • Follow-up Activities: Besides immediate actions, some actions resulting from the analysis may require a longer time. For instance, if major changes are suggested to change the SQE process, they cannot be usually implemented while the current process has not finished.

Quality Improvement Process (QIP)

The overall framework for quality improvement is called QIP, and it includes three interconnected steps:

describes graphically the flow of those steps related to the SQE process.
QIP Flow

The Deming Quality Cycle

W. Edwards Deming in the 1950's proposed that business processes should be analysed and measured to identify sources of variations that cause products to deviate from customer requirements. He recommended that business processes be placed in a continuous feedback loop so that managers can identify and change the parts of the process that need improvements. As a teacher, Deming created a (rather oversimplified) diagram to illustrate this continuous process, commonly known as the PDCA cycle for Plan, Do, Check, Act:

  • Plan Quadrant: one defines the objectives and determines the conditions and methods required to achieve them.
  • Do Quadrant: the conditions are created and the necessary training to execute the plan is performed (new procedures). The work is then performed according to these procedures.
  • Check Quadrant: One must check to determine whether work is progressing according to the plan and whether the expected results are obtained.
  • Action Quadrant: If the checkup reveals that the work is not being performed according to plan or results are not what was anticipated, measures must be devised for appropriate action.

Deming's PDCA cycle can be illustrated as in :

PDCA Circle

By going around the PDCA circle, the working methods are continuously improved as well as the results obtained. However, it is important to take care avoid a situation called "spiral of death" It happens when an organization goes around and around the quadrants, never actually bringing a system into production.

QE in Software Development Process

The quality engineering process cannot be considered in an isolated manner, but as part for the overall software engineering process. For instance, most of the SQE Activities should be included as part of the Software Development activities ():

SQE and Software Development

However, it should be considered that SQE activities have different timing requirements, activities and focus. For instance, represents the typical effort spent in the different quality activities during the software development time.

Focus of SQE Activities during the development process

Focusing on the QA activities, in a typical waterfall development model, the provides an estimate of the key QA activities done during each of the project phases:

Focus of QA Activities during the development process

Another important aspect to be considered is that some of the QA activities cannot be done until it is already too late. For example, for safety critical systems, post-accident measurements provide a direct measure of safety, but due to the damage linked to those accidents, they should be avoided by all means. In order to take early measures, appropriate models that link some of the quality measures during the development process with the end product quality are needed. Last but not least, it should be stressed that there is an increasing cost of fixing problems late instead of doing early, because a hidden problem may lead to other related problems, and the longer it stays in the system, the discovery is more difficult.

The cost of quality

In section 1.1, some of the implications of bad quality software have been introduced. The cost of poor quality (COPQ) is not the only cost that Software Quality Engineering should take into account. The cost of having good quality (COGQ) that may be linked to SQA activities (e.g testing or code inspections) should not be underestimated and considered when the total quality cost is assessed.

As in the case of the external and internal quality, the different costs linked to quality have been represented by some authors as an iceberg, in which some of the costs are easy to be identified (e.g. testing costs, customer returns...) while some others are not always taken account (e.g. unused capacity, excessive IT costs...). In [[COST-OF-QUALITY]] there is a detailed analysis of this approach for identifying quality costs.

Bad and Good Software Quality Cost

Software Quality Metrics

"Quality metrics let you know when to laugh and when to cry", Tom Gilb

"If you can't measure it, you can0t manage it", Deming

"Count what is countable, measure what is measurable. What is not measurable, make measurable", Galileo

These are just some sample sentences about the importance of measuring in general. Obviously, the capability of quantifying characteristics of a product are extremely helpful to manage that product. However, it's also important to stress that is essential to understand the attributes that are being measured so the metric doesn't end-up being a number but a proper indicator with a very clear meaning. Additionally, we studied in Unit 1 that Quality is a extremely subjective thing, so we should not assume that every aspect related with product quality can be quantified, or at least, quantified easily. Albert Einstein put it very nicely when he said his famous sentence: "Not everything that can be counted counts, and not everything that counts can be counted.".

Also, we should bear in mind that the act of measuring some software attribute, is not intended to improve that metric but firstly to understand its impact and its validity as an indicator for some software characteristic. A typical mistake is trying to improve any metric you are calculating in your projects. Doing this for the sake of it is a mistake, as Goodhart explained in his law: "When a measure becomes a target, it ceases to be a good measure". Imagine you are working in a development team and you are told that you need to increase the number of defects found during the development cycles. What is likely to happen is that the team is going to start reporting anything they could suspect is a bug, it doesn't matter how small, difficult to detect, or difficult to reproduce it is. At the end of the day, the team has been asked to find more bugs, and that is what they are going to do!

Introduction

What does this all mean? It means that we shuld try to get metrics, but more importantly, we need to understand how those metrics affect the quality of the product, and how they could become indicators of certain software attributes.

If we do that, we could use those metrics to improve our process and increase product quality over releases, predict potential issues (e.g. last time we got those indicators, the product was a disaster on the field), re-use successful experiences (if a product worked extremely well, check it's metrics and what did it make differente), etc. In summary we should use metrics to understand first and imrpove afterwards.

But, what is relationship between the terms measurement, metric and indicator?

  
Measurement, Metrics, Indicator
Action Concept Examples
Collect (Data)
  • Measure: Quantitative indication of the exact amount (dimension, capacity, etc.) of some attributes
  • Measurement: The act of determining a measure
  • 120 detected bugs
  • 12 months of project duration
  • 10 engineers working on the project
  • 100.000 Lines of Code (LOC)
Calculate (Metrics)
  • Metric: is a quantitative measure of the degree to which a system, component or a process possesses a given attribute
  • 2 bugs found per engineer-month
  • 1.2 bugs per KLOC
Evaluate (Metrics)
  • Indicator: a metric that provides insights into the product, process, project.
  • Bugs found per engineer-month might be an indicator of the test process efficiency
  • bugs/KLOC an indicator of code quality, etc.

There are three main kind of metrics related to software:

Example of measurements:

          * 120 defects detected during 6 months by 2 engineers
          * Defects defected every month: 10, 10, 20, 20, 25, 35
          * Defects remaining in the final product: 40 
          * Size of the Product: 40.000 Lines of Code
        

Metrics and Indicator Examples:

          * Process Metric: Defect Arrival Pattern per month: 10, 10, 20, 20, 25, 35 -> Indicator of Maturity
          * Project Metric: 40 KLOC / 2 / 6 = KLOC per eng-month   -> Indicator of Productivy
          * Product Metric: 40 defects / 40 KLOC = 1 defect / KLOC -> Indicator of Quality
        

Software quality metrics are a subset of software metrics that focus on the quality aspects of the product, process, and project. Software quality metrics can be divided further into:

In general, the quality of a developed product (end-product metrics) is influenced by the quality of the production process (in-process metrics). Identifying the link between those two type of metrics is essential for software development as the end-product metrics, most of the times, can be only discovered when it is too late (i.e. the product is alreay in the market). However, the link between both type of metrics is hard and complex as in the most of the times its relationship is poorly understood.

The link model between a process and a product for manufacturer goods is in most of the cases simple. However, for software, this model is in general more complex because the influence of the humans involved in software development is way higher than in goods manufacturing and the degree of automation is smaller in software development that in manufacturing.

As engineers, our target is:

The ultimate goal of software quality engineering is to investigate the relationships among in-process metrics, project characteristics, and end-product quality, and based on these findings to engineer improvements in both process and product quality.

Product quality metrics

Intrinsic Product Quality Metrics

Reliability, Error Rate and Mean Time To Failure

Software reliability is a measure of how often the software encounters an error that leads to a failure. From a formal point view, Reliability can be defined as the probability of not failing during a specified length of time:

R(n) (where n is the number of time units)

The probability of failing in a specified length of time is 1 minus the reliability for that length of time and it's usually denoted by a capital F letter:

F(n) = 1 - R(n)

If time is measured in days, R(1) is the probability of the software system having zero failures during one day (i.e. the probability of not failing in 1 day)

A couple of metrics related with the software reliability are the "Error Rate" and the "Mean Time To Failure" (MTTF). The MTTF can be defined as the average time that occurs between two system failures. Error Rate is the average number of failures suffered by the system during a given amount of time. Both metrics are related with the following formula:

Error Rate = 1MTTF

The relationship between the error rate and the reliability depends on the statistical distribution of the errors, not only on the error rate.

For instance, the following table shows the errors that occured per day in two different systems during one week.

 
Example of same error rate with different distribution
Defects per Day
DAY 1 DAY 2 DAY 3 DAY 4 DAY 5 DAY 6 DAY 7
Project A 1 2 3 4 5 6 7
Project B 7 6 5 4 3 2 1

It can be seen that both systems suffered the same amount of errors during the week 28 and hence the error rate for both systems is the same: 28/7 = 4 Errors/Day. However, the reliability of the system for the first day is very different.

Unless detailed statistics/models are available, the best estimate of the short-term future behavior is the current behavior. For instance, if a system suffers 24 failures during one day, the best estimate for the next day is that 24 failures will occur (24 errors/day) that correspond to a 1hour MTTF. That means that by default, we could assume that most of the system failures follow an exponential distribution, and hence the following formula could be used to calculate their reliability:

R(t) = e −λt

Where λ is the Error Rate and t is the amount of time for which the system reliability is calculated. A key concept of exponential distributions is that the error rate is constant and hence it doesn't change with the time. The following tables show the Probability and Cumulative Density Functions of exponential curves with different values of λ

Probabily Density Function for Exponential Distributions
Cumulative Density Function for Exponential Distributions

However, this concept of systems having a constant error rate that doesn't change over time, it's not very common in the real world. For instance, in Hardware components, the error rate evolves with the time in different ways:

  • During the early life of the product the error rate starts at high value, but this error rate tends to go down gradually over time.
  • After the early life of the product passes (the duration of the early life depends on the product) the error rate tends to be stable, during a time called Useful Time .
  • After the useful time is over, the wearout period starts. During that period, the error rate keeps growing until the hardware component suffers a failure.
Error Rate evolution in hardware products
A canonical example of this behaviour are lightbulbs. In order to minimize the number of failures users suffer with lightbulbs, right after being produced, the manufacturers keep them switched on during a time similar to the early life of the product. By doing this, they guarantee that the lightbulbs that go to the customer hands, got there at a time when the error rate is as low as possible. Any component with early failures is discarded before it can get to the end-users.

You can find a lot of information about reliability and how maths is used for calculating it at [[RELIABILITY-MATHS]]

Although exponential distribution may be a good compromise that could be applied to any software system, there are other distributions that may describe in a more accurate way the idea of having a non constant error rate. For instance, the Weibull distribution is frequently used in reliability analysis [[WEIBULL-BASICS]].

In a Weibull distribution, the error rate can change with the time. The Reliability function is:

R(t) = e (−t/η) β

That depends on two variables, η and β that define the shape of the distribution function. For a fixed value of η, the failure rate could be constant (β = 1), descending (β < 1) or ascending (β > 1). Please note that combining those three options we could describe the Hardware component failure rate phases.

Weibull Error Rate depending on β

If we take into account Software Upgrades, there are some interesting analysis about how a sawteeth pattern is observed [[SOFTWARE-RELIABILITY]]. Again, such a curve could be defined by combining different Weibull distributions.

Sawteeth distribution of error rate

Defect Density

Defect Density is the number of confirmed defects detected in software/component during a defined period of development/operation divided by the size of the software/component.

Defect Density = Number of Confirmed DefectsSoftware Size

The "defects" are usually counted as confirmed and agreed defects (not just reported). For instance, dropped defects are not counted.

The "period" or metrics time frame, might be for one of the following:

  • for a duration (say, the first month, the quarter, or the year).
  • for each phase of the software life cycle.
  • for the whole of the software life cycle, usually known as Life Of Product (LOF) and may comprise time after the software product's release to the market.

The "opportunities for error" (OFE) or sofware "size" is measured in one of the following:

  • Source Lines of Code that are usually counted as thousands of Lines Of Code (KLOC)
  • Function Points (FP)

In the following chapters both ways of measuring OFE will be studied separately.

Lines of Code

Counting the lines of code (LOC) is way more complex that what it could be initially considered. The major problem for couting lines of code comes from the ambiguity of the operational definition, the actual counting. In the early days of Assembler programming, in which one physical line was the same as one instruction, the LOC definition was clear. With the availability of high-level languages the one-to- one correspondence broke down. Differences between physical lines and instruction statements (or logical lines of code) and differences among languages contribute to the huge variations in counting LOCs. Even within the same language, the methods and algorithms used by different counting tools can cause significant differences in the final counts. Multiple variations were already described by Jones in 1986 such as:

  • Count only executable lines.
  • Count executable lines plus data definitions.
  • Count executable lines, data definitions, and comments.
  • Count executable lines, data definitions, comments, and job control language.
  • Count lines as physical lines on an input screen.
  • Count lines as terminated by logical delimiters.

For instance, next example includes two approaches for coding the same functionality. As the functionality is the same, and it is writing in the same manner, the opportunities for error should be the same, however, the lines of code differ. For instance, if we count all the lines (job control language, comments...), in the first case only one line of code is used whereas in the second case 5 lines of code have been used.

                  for (i=0; i<100; ++i) printf("I love compact coding"); /* what is the number of lines of code in this case? */

                  /* How many lines of code is this? */
                  for (i=0; i<100; ++i)
                  {
                    printf("I am the most productive developer"); 
                  }
                  /* end of for */
                

Some authors have considered LOC not only a less useful way to measure software size but also a harmful thing for sofware economics and productivity. For instance, the paper written by Capers Jones called "A Short History of Lines of Code (LOC) Metrics" [[LOC-HISTORY]] offers a very interesting historical view about the evolution of Software Programming Languages and LOC metrics.

Regardless of the LOC measurements used, when a software product is released to the market for the first time, and when a certain way to measure lines of code is specified, it is relatively easy to state its quality level (projected or actual). However, when enhancements are made and subsequent versions of the product are released, the measurement is more complicated. In order to have a good insight on the product quality is important to follow a two-fold approach:

  • Measure the quality of the entire product.
  • Measure the quality of the new/changed parts of the product.

The first measure may improve over releases due to aging and defect removal, but that improvement in the overall defect rate may hide problems on the developement/quality process (e.g. new code contains a higher defect density that the "old" code which indicates a problem in the process). In order to be able to calculate defect rate for the new and changed code, the following must be available:

  • LOC count: The entire software product as well as the new and changed code of the release must be available.
  • Defect tracking: Defects must be tracked to the release origin, i.e. the portion of the code that contains the defects and at what release the portion was added, changed, or enhanced. When calculating the defect rate of the entire product, all defects are used; when calculating the defect rate for the new and changed code, only defects of the release origin of the new and changed code are included.

These tasks are enabled by the practice of change flagging. Specifically, when a new function is added or an enhancement is made to an existing function, the new and changed lines of code are flagged. The change-flagging practice is also important to the developers who deal with problem determination and maintenance. When a defect is reported and the fault zone determined, the developer can determine in which function or enhancement pertaining to what requirements at what release origin the defect was injected. The following is an example on how the overall defect rate and the defect rate for new code is mesaured at IBM according to the book "Metrics and models in software quality engineering" by Stephen H. Kan.

In the first version of a software product 30 Defects were reported by end-users and the software size was 30KLOC. After fixing all the discovered bugs, the team works in a new version that includes 10 new KLOC. End-users report 10 additional defects in this new version that were injected in the new 10KLOC.

The Defect Density of the first version was 1 defect/KLOC.

If we calculate the Defect Density of the second version in the same way, it would be: DD = 10/40 = 0.25 defects/KLOC. If we compare this value, we could conclude that the second version was way better that the first one.

But, this could be misleading as indeed, the second version could include only defects in the new 10KLOC, not in the old ones. We could calculate the same metric but just counting the new Lines of Code. If we do so, the result would be: DD = 10/10 = 1 defect/KLOC which is the same than for the first release.

We could conclude that for end-users, the second version is going to be a significant improvement as the number of defects are going to perceive is smaller both in absolute and relative terms. However, the team has been doing a similar job in terms of defects remaining after releasing the product (defect injection and detection).

It is important to think how useful is this metric from two points of view:

  • Drive Quality Improvement: Very important for the development team.
  • Meet customer expectations.

From the customer's point of view, the defect rate is not as relevant as the total number of defects that might affect their business. Therefore, a good defect rate target should lead to a release-to-release reduction in the total number of defects, regardless of size. I.e. Not only the defect rate should be reduced but also the total number of defects. If a new release is larger than its predecessors, it means the defect rate goal for the new and changed code has to be significantly better than that of the previous release in order to reduce the total number of defects.

In the example above, from the initial release to the second release the defect rate didn't improve. However, customers experienced a 66% reduction [(30 - 10)/30] in the number of defects because the second release is smaller.

Function Points

As explained in the previous chapter, measuring the opportunities for error through the lines of code has some problems. Counting lines of code is but one way to measure size. Another alternative is the using the function point. In recent years the function point has been gaining acceptance in application development in terms of both productivity (e.g., function points per person-year) and quality (e.g., defects per function point)

A function can be defined as a collection of executable statements that performs a certain task, together with declarations of the formal parameters and local variables manipulated by those statements. The ultimate measure of software productivity is the number of functions a development team can produce given a certain amount of resource, regardless of the size of the software in lines of code. The defect rate metric, ideally, is indexed to the number of functions a software provides. If defects per unit of functions is low, then the software should have better quality even though the defects per KLOC value could be higher — when the functions were implemented by fewer lines of code. Although this approach seems very powerful and promising, from a practical point of view it is very difficult to be used.

The function point metric was originated by Albrecht and his colleagues at IBM in the mid-1970s. The name could be a bit misleading as the technique itself does not count the functions. Instead it tries to measures some aspects that determine the software complexity withouth taking into the differences between programming languages and development styles that change the LOC metric. In order to do so, it takes into account five major components that comprise a software product:

  • External Inputs (EIs): Elementary process in which data crosses the boundary from outside to inside. This data may come from a data input screen or another application. The data may be used to maintain one or more internal logical files. The data can be either control information or business information. If the data is control information it does not have to update an internal logical file.
  • External Outputs (EOs): Elementary process in which derived data passes across the boundary from inside to outside. Additionally, an EO may update an ILF. The data creates reports or output files sent to other applications. These reports and files are created from one or more internal logical files and external interface file.
  • External Inquiries (EQs): Elementary process with both input and output components that result in data retrieval from one or more internal logical files and external interface files. The input process does not update any Internal Logical Files, and the output side does not contain derived data.
  • Internal Logical Files (ILFs): A user identifiable group of logically related data that resides entirely within the applications boundary and is maintained through external inputs.
  • External Interface Files (EIFs): A user identifiable group of logically related data that is used for reference purposes only. The data resides entirely outside the application and is maintained by another application. The external interface file is an internal logical file for another application.

Following figure provides a graphical example on how all these components work together and how they interact with the end-users.

Function Points Overview

Apart from being technology independent, this way of identifying the key software functions is very interesting as it is focused on the end-user point view: most of the components are thought from the user’s perspective (not the developers one), hence it works well with use cases.

The number of function points is obtained by the addition of the number of occurrences of those components (each of them weighted by a different factor) multiplied by an adjustment factor chosen based on the software characteristics:

FP = FC x VAF

Where:

  • FP: Is the Function Points
  • FC: Is the weighted function count
  • VAF: Is the adjustment factor that depends on software characteristics

In order to calculate the Function Points, every component is classified in three categories according to its complexity (low/medium/high). A different weight factor is assigned to every component type and category. The following weights are defined for every component and complexity:

  • Number of external inputs: 3-4-6.
  • Number of external outputs: 4-5-7.
  • Number of external inquiries: 3-4-6.
  • Number of logical internal files: 7-10-15.
  • Number of external interface files: 5-7-10.

When the number of components (classified by complexity) is available, given the previous weighting factors, the Function Counts (FCs) can be calculated based on the following formula:

FC = j = 1, n = 1, 3,5 w ij x ij

Where wij are the weighting factors and xij the number of ocurrences of each component in the software. i denotes complexity and j denotes the component type. The following table shows graphically how can this function be calculated easily.

 
FC Calculation
Type Low Complexity Mid Complexity High Complexity Total
EI _ x 3 + _ x 4 + _ x 6 + =
EO _ x 4 + _ x 5 + _ x 7 + =
EQ _ x 3 + _ x 4 + _ x 6 + =
ILF _ x 7 + _ x 10 + _ x 15 + =
EIF _ x 5 + _ x 7 + _ x 10 + =

The complexity classification of each component is based on a set of standards that define complexity in terms of objective guidelines. For instance, for the external output component, if the number of data element types is 20 or more and the number of file types referenced is 2 or more, then complexity is high. If the number of data element types is 5 or fewer and the number of file types referenced is 2 or 3, then complexity is low. The following tables provide the standard categorization where:

  • DETs are equivalent to non-repeated fields or attributes.
  • RETs are equivalent to mandatory or optional sub-groups.
  • FTRs are equivalent to ILFs or EIFs referenced by that transaction.

ILF and EIF Complexity Matrix
RETs 1-19 DETs 20-50 DETs 51+ DETs
1 Low Low Medium
2-5 Low Medium High
6+ Medium High High

EI Complexity Matrix
FTRs 1-4 DETs 5-15 DETs 16+ DETs
0-1 Low Low Medium
2 Low Medium High
3+ Medium High High

EO and EQ Complexity Matrix
FTRss 1-5 DETs 6-19 DETs 20+ DETs
0-1 Low Low Medium
2-3 Low Medium High
4+ Medium High High

In order to calculate the Value Adjustment Factor (VAF), 14 characteristics of the software system must be scored (in a scale from 0 to 5) in terms of their effect on the software. The list of characteristics is:

  1. Data communications: How many communication facilities are there to aid in the transfer or exchange of information with the application or system?
  2. Distributed data processing: How are distributed data and processing functions handled?
  3. Performance: Did the user require response time or throughput?
  4. Heavily used configuration: How heavily used is the current hardware platform where the application will be executed?
  5. Transaction rate: How frequently are transactions executed daily, weekly, monthly, etc.?
  6. On-Line data entry: What percentage of the information is entered On-Line?
  7. End-user efficiency: Was the application designed for end-user efficiency?
  8. On-Line update: How many ILF’s are updated by On-Line transaction?
  9. Complex processing: Does the application have extensive logical or mathematical processing?
  10. Reusability: Was the application developed to meet one or many user’s needs?
  11. Installation ease: How difficult is conversion and installation?
  12. Operational ease: How effective and/or automated are start-up, back-up, and recovery procedures?
  13. Multiple sites: Was the application specifically designed, developed, and supported to be installed at multiple sites for multiple organizations?
  14. Facilitate change: Was the application specifically designed, developed, and supported to facilitate change?

Once all these characteristics are assessed, they are summed, based on the following formula, to arrive at the value adjustment factor (VAF):

VAF = 0.65 + 0.01 i = 1 14 c i

Where ci is the score for general system characteristic i.

Over the years the function point metric has gained acceptance as a key productivity measure from a practical point of view. However, the meaning of function point and the derivation algorithm and its rationale may need more research and more theoretical groundwork. Furthemore, function point counting can be time-consuming and expensive, and accurate counting requires certified function point specialists.

Customer satisfaction metrics

Customer Problem Metrics

Another product quality metric vastly used in the software industry measures the problems customers encounter when using the product.

For the defect denstity metric (section 1.2.1.2), the numerator was the number of valid defects. However, from the customers’ standpoint, all problems they encounter while using the software product, not just the valid defects, are problems with the software. Software problems suffered by end- users that are not valid defects may be:

  • Usability problems.
  • Unclear documentation or information.
  • Duplicates of valid defects (defects that were reported by other customers and fixes were available but the current customers did not know of them).
  • User errors.

These so-called non-defect-oriented problems, together with the defect problems, constitute the total problem space of the software from the customers’ perspective.

The problems metric is usually expressed in terms of problems per user month (PUM):

PUM = Total number of problems that customers reported during a period of time Total number of license months of the software during that period

Where the total number of license-months is the number of months all the users have been using the software and may be calculated multipling the Number of install licenses of the software by the Number of months in the calculation period.

PUM is usually calculated for each month after the software is released to the market, and also for monthly averages by year. Note that the denominator is the number of license-months instead of thousand lines of code or function point, and the numerator is all problems customers encountered. Basically, whereas the defect density focuses in the number of real problems with regards to the software complexity, this metric relates detected problems to software usage.

There are different approaches to minimize PUM:

  • Improve the development process and reduce the product defects.
  • Reduce the non-defect-oriented problems by improving all aspects of the products (such as usability, documentation), customer education, and support.
  • Increase the sale (the number of installed licenses) of the product.

The first two approaches reduce the numerator of the PUM metric, and the third increases the denominator. The result of any of these actions will be that the PUM metric has a lower value. All three approaches make good sense for quality improvement and business goals for any organization. The PUM metric, therefore, is a good metric. The only minor drawback is that when the business is in excellent condition and the number of software licenses is rapidly increasing, the PUM metric will look extraordinarily good (low value) and, hence, the need to continue to reduce the number of customers’ problems (the numerator of the metric) may be under- mined. Therefore, the total number of customer problems should also be monitored and aggressive year-to-year or release-to-release improvement goals set as the number of installed licenses increases. However, unlike valid code defects, customer problems are not totally under the control of the software development organization. Therefore, it may not be feasible to set a PUM goal that the total customer problems cannot increase from release to release, especially when the sales of the software are increasing.

The key points of the defect rate metric and the customer problems metric are briefly summarized in the following table. The two metrics represent two perspectives of product quality. For each metric the numerator and denominator match each other well: Defects relate to source instructions or the number of function points, and problems relate to usage of the product. If the numerator and denominator are mixed up, poor metrics will result. Such metrics could be counterproductive to an organization’s quality improvement effort because they will cause confusion and wasted resources.

DDR vs PUM
Defect Density Rate PUM
Numerator Valid and Unique defects All customer problems
Denominator Size of Product Usage of Product
Measurement Producer Perspective Consumer Perspective
Scope Intrinsic Product Quality Intrinsic Product Quality + Other

The customer problems metric can be regarded as an intermediate measurement between defects measurement and customer satisfaction. To reduce customer problems, one has to reduce the functional defects in the products and, in addition, improve other factors (usability, documentation, problem rediscovery, etc.)

Customer Satisfaction Metrics

Customer satisfaction is often measured by customer survey data in which the users are asked to qualify the software or characteristics of the software through a scale.

Based on the survey result data, several metrics with slight variations can be constructed and used, depending on the purpose of analysis. For example:

  • Percent of completely satisfied customers.
  • Percent of satisfied customers (satisfied and completely satisfied)
  • Percent of dissatisfied customers (dissatisfied and completely dissatisfied)
  • Percent of nonsatisfied (neutral, dissatisfied, and completely dissatisfied)

In addition to forming percentages for various satisfaction or dissatisfaction categories, the net satisfaction index (NSI) is also used to facilitate comparisons across product. NSI ranges from 0% (all customers are completely dissatisfied) to 100% (all customers are completely satisfied). If all customers are satisfied (but not completely satisfied), NSI will have a value of 75%. This weighting approach, however, may be masking the satisfaction profile of one’s customer set. For example, if half of the customers are completely satisfied and half are neutral, NSI’s value is also 75%, which is equivalent to the scenario that all customers are satisfied. If satisfaction is a good indicator of product loyalty, then half completely satisfied and half neutral is certainly less positive than all satisfied.

In process Quality Metrics

Defect Density After a Development Cycle

Defect rate during a development cycle is usually positively correlated with the defect rate in the next phases. For instance, the defect rate after integration testing is usually positively correlated with the defect rate in the field. Higher defect rates found during a phase is an indicator that the software has experienced higher error injection during that phase, unless the higher testing defect rate is due to an extraordinary testing effort (for example, additional testing or a new testing approach that was deemed more effective in detecting defects). The rationale for the positive correlation is simple: Software defect density never follows the uniform distribution. If a piece of code or a product has higher testing defects, it is a result of more effective testing or it is because of higher latent defects in the code. Myers suggested a counterintuitive principle that the more defects found during testing, the more defects will be found later.

This simple metric of defects per KLOC or function point is especially useful to monitor subsequent releases of a product in the same development organization. The development team or the project manager can use the following scenarios to judge the release quality:

  1. If the defect rate during testing is the same or lower than that of the previous release (or a similar product), then ask: Does the testing for the current release deteriorate?
    • If the answer is no, the quality perspective is positive.
    • If the answer is yes, you need to do extra testing (e.g., add test cases to increase coverage, blitz test, customer testing, stress testing, etc.).
  2. If the defect rate during testing is substantially higher than that of the previous release (or a similar product), then ask: Did we plan for and actually improve testing effectiveness?
    • If the answer is no, the quality perspective is negative. Ironically, the only remedial approach that can be taken at this stage of the life cycle is to do more testing, which will yield even higher defect rates.
    • If the answer is yes, then the quality perspective is the same or positive.

This concept is shown graphically in the next diagram:

Understanding Evolution of Defect Density

Defect Arrival Pattern

Overall defect density during testing is a summary indicator. The pattern of defect arrivals (or for that matter, times between failures) gives more information. Even with the same overall defect rate during testing, different patterns of defect arrivals indicate different quality levels in the field.

Next figure shows two contrasting patterns for both the defect arrival rate and the cumulative defect rate. Data were plotted from 44 weeks before code-freeze until the week prior to code-freeze. In both projects, the overall defect count is the same, however the potential forecast of the quality on the field is quite different. In the first project, during the last weeks, the number of defects reported every week is smaller and is tending to zero. The second project, represented by the charts on the right side, follows the opposite pattern. This obviously indicates that testing started late, the test suite was not sufficient, and that the testing ended prematurely. It's extremely likely that this project, if released as it is, would lead to even more defects in the field.

Two Contrasting Arrival Patterns during Testing

The objective is always to look for defect arrivals that stabilize at a very low level, or times between failures that are far apart, before ending the testing effort and releasing the software to the field. Such declining patterns of defect arrival during testing are indeed the basic assumption of many software reliability models. The time unit for observing the arrival pattern is usually weeks and occasionally months. For reliability models that require execution time data, the time interval is in units of CPU time.

When we talk about the defect arrival pattern, there are actually three slightly different metrics, which should be looked at simultaneously:

  • The defect arrivals (defects reported) during the testing phase by time interval (e.g. week). These are the raw number of arrivals, not all of which are valid defects.
  • The pattern of valid defect arrivals when problem determination is done on the reported problems. This is the true defect pattern.
  • The pattern of defect backlog overtime. This metric is needed because development organizations cannot investigate and fix all reported problems immediately. This metric is a workload statement as well as a quality statement. If the defect backlog is large at the end of the development cycle and a lot of fixes have yet to be integrated into the system, the stability of the system (hence its quality) will be affected. Retesting (regression test) is needed to ensure that targeted product quality levels are reached.

Defect Removal Metrics

We have just introduced an interesting concept. Detecting a defect doesn't mean it is going to be automatically removed. This can happen because of many different reasons:

  • Lack of time: The solution of the defect requires a lot of time to be properly fixed (detect root, fix it, ensure no other functionality breaks because of the fix, etc.)
  • Lack of bandwidth: The team is focused on implementing features so no immediate attention is put on fixing that defect.
  • Lack of importance: The defect is minor or not important enough to become a priority for the team.
  • Lack of knowdlege: The defect is well-known but the root of the defect is not known yet (remember a defect is just a symptom).

A metric intended to distinguish between defect detection and defect removal is the Defect Removal Pattern, that describes the evolution of the number of defects removed over the time. This metric can be calculated per unit of time or per phase of the project (e.g. iteration).

Some related metrics are the "average time to detect a defect" (which provides an indication about how good the process is about detecting defects) and the "average time to fix a defect" which is an indiator of how good the process is with respect to fixing defects once they have been detected. These metrics are key as we should remember that the later a defect is detected and fixed, the more expensive it is.

Defect Backlog / Burndown

A burndown chart is a graphical representation of the amount of work left to be done vs. the time. It's typically used in Agile methodologies to check sprint evolution, measure team speed, etc. In those cases, the amount work is always decreasing as the tasks for the sprint are identified before the sprint starts.

Typical Burndown Chart

An equivalent concept is the Defect Backlog / Burndown. In that case the graphic does not represent the amount of work to be done but the amount of unfixed defects. The curve can go down if no more defects are found and the remaining ones are fixed, or go up if the number of detected defects overpace the fix rate.

Defect Burndown Chart

In the example above, we can see that the red line shows the number of cumulative defects (fixed or unfixed), the green one shows the number of fixed ones, whereas the black line shows the "delta" between detected and fixed defects (i.e. the size of the defect backlog).

Ideally, the defect backlog count should be zero before releasing a product. But in big product that is nearly impossible. This requires product managers to play with the four key aspects of any software project: resources, scope, time and quality. In particular, the following actions could be done:

  • Increase the number of resources fixing defects: The idea is adding more resources to work on fixing defects. However, we should be extremely cautious about this as adding resources late to a project can lead to additional delays.
  • Reduce the product scope: don't let more features being added at a given point of time (no more features 1 month before the target release date). This leads to more resoures focused on fixing defects and fewer bugs injected by new features landing on the project.
  • Postpone the release date: Ask for more time so more time can be spent fixing defects.
  • Reduce the target quality by not paying attention to any bug but just to the critical ones (e.g. blocker defects).

One approach that is used sometimes to make the defect backlog go to zero is using triage meetings to determine which defects are blockers and which ones are not. The idea is that the closer the release date is, the more difficult is to consider a bug as a blocker for the release. However, having a clear set of guidelines about what is a blocker and what is not, is also very helpful, for instance, Mozilla used a particualr one for FirefoxOS

Defect Removal Effectiveness

Defect removal effectiveness can be defined as follows:

DRE = Defects Removed in a Development Phase * 100% Defects latent in the product

It provides a measure of the percentage of defects removed in one phase with regards to the overall number of defects in the code when entering into that phase. As the total number of latent defects in the product at any given phase is not known, the denominator of the metric can only be approximated which is usually done through:

Defects Found in a phase +Defects found later

The metric can be calculated for the entire development process, for the front end (before code integration), and for each phase. It is called early defect removal

The higher the value of the metric, the more effective the development process and the fewer the defects escape to the next phase or to the field. This metric is a key concept of the defect removal model for software development.

For instance, if during the development of a product 80 bugs were found and fixed, but there were still 20 defects latent that were found by the customers when the product hit the field, the DRE would be:

DRE = 80 / (80 + 20) = 80%

In average, 80 of every 100 defects were removed.

Another view of this metric is depicted in the following Figure.

Defect Removal and Injection

It shows how are defects injected, detected and repair during a project phase. Based on it, another way to calculate the defect removal efficiency can be found:

DRE = Defects Removed (at the step) Defects existing on step entry + Defects Injected during this step * 100

The following table is an example in which the data about when are errrors injected and detected in a software project is provided.

Origin of the defect
Iteration 1 Iteration 2 Iteration 3 Iteration 4 TOTAL REMOVED
Where Found? Iteration 1 5 - - - 5
Iteration 2 10 15 - - 25
Iteration 3 5 5 10 - 20
Iteration 4 5 5 0 5 15
Total Injected 25 25 10 5 65

With that data, the DRE could be calculated for different phases of the software development process. Some examples are shown below:

              During the First Iteration the total numbers of defect injected was 25.

              The meaning of the value "5" in the intersection of "Iteration 1"
              row and column is that during that phase, only 5 defects were
              removed.

              The meaning of the value 10 in the intersetion of row "Iteration
              2" and column "Iteration 1" is that during Iteration 2, 10
              defects that were originated in Iteration 1 were removed. 

              Equally, the meaning of the value 15 in the intersetion of row
              "Iteration 2" and column "Iteration 2" is that during Iteration
              2, 15 defects that were also originated in Iteration 2 were
              removed.

              We could calculate the DRE of all the different iterations quite
              easily:

              * Iteration 1: DRE = 5/25 = 20%
              * Iteration 2: DRE = (10+15) / [(25+25) - 5] = 25/45 = 55%;
              * Iteration 3: DRE = (5+5+10) / [(25+25+10) - (5+25)] = 20/30 = 66%;
              * Iteration 4: DRE = (5+5+0+5) / [(25+25+10+5) - (5+25+20)] = 15/15 = 100%;
            

The following table describes for each of the software development process phases the most important sources of defect injection and removal.

Delopment Phase Defect Injection Defect Removal
Requirements Requirements Gathering Process and Specification Development Requirement Analysis and Review
High Level Design Design High Level Design Inspections
Low Level Design Design Low Level Design Inspections
Code Implementation Coding Code Inspections Testing
Integration Build Integration and Build Process Build Verification Testing
Unit Test Bad Fixes Testing Itself
Component Test Bad Fixes Testing Itself
System Test Bad Fixes Testing Itself

Software Metrics VS. Quality Metrics

This chapter has been addresing so far metrics that are related with a direct measurement of the software quality. However, it is also critical to consider that those metrics usually have a direct relationship with some software characteristics that could not be considered as directly related with the software quality.

Some intrinsic characteristics of the software that usually affect the software quality (either internal or external) are:

Different metrics exist to take into account all those aspects in early phases of the software development process and take preventive measures. E.g. if the code is extremely complex a refactoring of the software should be done in order to minimize the likelyhood of defects.

Some example of metrics used (in Object Oriented Programming) are:

Software Configuration Management

What is SCM?

Quality Assurance is only one part of the activities that are used to improve Software Quality. However QA is per-se not enough as it does not define how the software is managed, for instance:

SCM could be defined as a framework for managing the evolution of software throughout all the stages of Software Development Process.

There are multiple definitions for SCM and in some cases the SCM acronym is used with different meanings (Software/Source Code Management, Software/Source Code Change Control Management, Software/Source Configuration Management...). Roger Pressman states [[SOFTWARE-ENGINEER-PRACTICIONER]] that SCM is a "set of activities designed to control change by identifying the work products that are likely to change, establishing relationships among them, defining mechanisms for managing different versions of these work products, controlling the changes imposed, and auditing and reporting on the changes made."

In summary, SCM is a set of activities intended to guarantee:

Why SCM?

When used effectively during a product's whole life cycle, SCM identifies software items to be developed, avoids chaos when changes to software occur, provides needed information about the state of development, and assists the audit of both the software and the SCM processes. Therefore, its purposes are to support software development and to achieve better software quality. Additionally, a good SCM system should also help to reduce (or at least control) costs and effort involved in making changes to a system.

Key SCM Activities

IEEE's (IEEE Std. 828-1990) traditional definition of SCM included four key activities: configuration identification, configuration control, configuration status accounting and configuration audits. However, a successful implementation of SCM also requires careful planning and a good release management and processing. Next figure represents all these activities graphically:

SCM Activities

The following figure provides a breakdown of all the SCM activities into more granular topics.

SCM Activities breakdown

Management and Planning

A successful SCM implementation requires careful planning and management. This, in turn, requires an understanding of the organizational context for, and the constraints placed on, the design and implementation of the SCM process.

Some aspects that should be decided during this activity are:

  • The types of documents to be managed and a document-naming scheme.
  • Who takes responsibility for the CM procedures and creation of baselines?
  • Policies for change control and version management.
  • Tools to be used and process linked to their usage.

Configuration Identification

The software configuration identification activity identifies items to be controlled, establishes identification schemes for the items and their versions, and establishes the tools and techniques to be used in acquiring and managing controlled items. These activities provide the basis for the other SCM activities.

Configuration Item: A configuration item is any possible part of the development or delivery of a system or product that it's necessary to identify, produce, store, use and change individually. Many people associate configuration item with a source code file, but configuration items are not limited to that, many other items could be identified and managed such as:

  • System data files
  • System build files and scripts
  • Requirements, Interface, Design specifications
  • Test plans, procedures, data sets and results
  • User documentation
  • Compilers, Linkers, Debuggers
  • Shell scripts
  • Other related support tools

For each configuration item, additional information apart from the item itself is controlled by the SCM. As it is data about data, it is called metadata. Every configuration item must have a unique identification that is sometimes also called label. Additionally, metadata may include additional information such as:

  • Name
  • Version
  • Status
  • Date
  • Location
  • ...

A first step in controlling change is to identify the software items to be controlled. This involves understanding the software configuration within the context of the system configuration, selecting software configuration items, developing a strategy for labelling software items and describing their relationships, and identifying the baselines to be used.

Software Configuration: A software configuration is the set of functional and physical characteristics of software as set forth in the technical documentation or achieved in a product.

Selecting Configuration Items: It is an important process in which a balance must be achieved between providing adequate visibility for project control purposes and providing a manageable number of controlled items. The items of a configuration should include all the items that are part of a given software release.

Defining relationships and interfaces between the various configuration items is key as it also affects other SCM activities such as software building or assessing the impact of suggested changes. The identification or labelling scheme used should support the need to evolve software items and their relationships (e.g. configuration item X requires version A of configuration item Y).

Identifying the baselines is another critical task of SCM Identification. A software baseline is a set of software configuration items formally designated and fixed at a specific time during the software life cycle. The term is also used to refer to a particular version of a software configuration item that has been agreed on. In either case, the baseline can only be changed through formal change control procedures. A baseline, together with all approved changes to the baseline, represents the current approved configuration.

Configuration Change Control

The software is subject to continuous changes that are coming from different sources:

  • Users that have new needs
  • Developers that identify issues on the software
  • Market forces that identify new business needs and opportunities

Change Control takes care of keeping track of these changes and ensures that they are implemented in a controlled manner.

The most important activity from a Change Control point of view is the definition of how are changes made:

  • Can any developer change any configuration items?
  • Do developers need to raise an issue or a Change Request before changing a configuration item?
  • Which configurations or baselines developers can modify? E.g. some baselines should be read-only (tags), some others may be modifiable only after a Change Request has been made (branches) and some others may be modifiable with no restriction.
  • Do changes need a third-party approval before they are added to the SCM system?
  • How are other developers notified about changes?
  • What is the link to the Issue Tracker system?
  • Is there any need to test anything or build the system before accepting a change?

In short, the key thing is having a clear working flow, you can find an example about the FirefoxOS flow. the process for making changes is clear, it is also important to specify how the revision history of configuration items is going to be kept, how other developers are going to be notified about those changes:

  • Maintaining baselines
  • Processing changes
  • Developing change report forms
  • Controlling release of the product

Configuration Status Accounting

Configuration status accounting main target is recording and reporting of information needed for effective management of the software configuration.

The information that should be available is diverse:

  • Which are the different available baselines.
  • Which is the approved configuration.
  • Which are the issues raised for every configuration.
  • Which is the status of all the issues and changes.

In order to provide and control all this information a good tool support is needed. This could be part of the Configuration Item Management system or another independent tool that is integrated with it.

Reported information can be used by various organizational and project elements, including the development team, the maintenance team, project management, and software quality activities. Reporting can take the form of ad hoc queries to answer specific questions or the periodic production of predesigned reports. Some information produced by the status accounting activity during the course of the life cycle might become quality assurance records.

In addition to reporting the current status of the configuration, the information obtained by this system can serve as a basis for various measurements of interest to management, development, and SCM. Examples include the number of change requests per configuration item and the average time needed to implement a change request, defect arrival pattern per release/component...

Configuration Audting

The purpose of configuration audits is to ensure that the software product has been built according to specified requirements (Functional Configuration Audit, FCA), to determine whether all the items identified as a part of CI are present in the product baseline (Physical Configuration Audit, PCA), and whether defined SCM activities are being properly applied and controlled (SCM system audit or in-process audit). A representative from management, the QA department, or the customer usually performs such audits. The auditor should have competent knowledge of both SCM activities and the project.

The author should check the product is complete, consistent (e.g. "Are all the correct versions of files used in this current release?"), that no outstanding issues exist (e.g. "There are no critical defects or CRs") and that the product has passed all the required tests to ensure its quality.

The output of the audit should specify whether the product's performance requirements have been achieved by the product design and the product design has been accurately documented in the configuration documentation.

In order to properly perform this activity is important to:

  • Define the audit schedule and procedures.
  • Identify who will perform the audits.
  • Do the audits on the established baselines.
  • Generate audit reports.

Release Build, Management and Delivery

The term "release" is used to refer to a software configuration that is distributed outside of the development team. This includes internal releases as well as distribution to end-users. When different versions of software are available for different platform configurations it is frequently necessary to create multiple releases for delivery.

Building the release:

In order to release a software product, the configuration items must be combined, packaged with the right configuration and in most of the cases built into an executable program that can be installed by the customers. Build instructions ensure that the proper build steps are taken and in the correct sequence. In addition to building software for new releases, it is usually also necessary for SCM to have the capability to reproduce previous releases for recovery, testing, maintenance, or additional release purposes.

Software is built using particular versions of supporting tools, such as compilers. It might be necessary to rebuild an exact copy of a previously built software configuration item. In this case, the supporting tools and associated build instructions need to be under SCM control to ensure availability of the correct versions of the tools (i.e. not only source code evolve, but also the tools we use).

A tool capability is useful for selecting the correct versions of software items for a given target environment and for automating the process of building the software from the selected versions and appropriate configuration data. For large projects with parallel development or distributed development environments, this tool capability is necessary. Most software engineering environments provide this capability.

Release Management:

Software release management encompasses the identification, packaging, and delivery of the elements of a product, for example, executable program, documentation, release notes, and configuration data.

Given that product changes can occur on a continuing basis, one concern for release management is determining when to issue a release. Some aspects to take such a decision are the severity of the problems addressed by the release and the measurements of the fault densities of prior releases.

The packaging task must identify which product items are to be delivered, and then select the correct variants of those items, given the intended application of the product. The information documenting the physical contents of a release is known as a version description document. The release notes typically describe new capabilities, known problems, and platform requirements necessary for proper product operation. The package to be released also contains installation or upgrading instructions. The latter can be complicated because some current users might have versions that are several releases old.

Finally, in some cases, the release management activity might need to track the distribution of the product to various customers or target systems. An example would be a case where the supplier was required to notify a customer of newly reported problems. A tool capability is needed for supporting these release management functions. It is useful to have a connection with the tool capability supporting the issue tracker in order to map release contents to the issues that have been received. This tool capability might also maintain information on various target platforms and on various customer environments.

Summary

SCM Activities Relationships

SCM In Practice

Introduction

This chapter provides a set of best practices or patterns that should be used in SCM. There are multiple tools that can be used for SCM. Some of them focus on the configuration identification and change control, some others pay special attention to the auditing and accounting and some others are focused on the build and release part. In most of the cases different tools are required and what is important is all the tools are properly integrated. For instance, a typical situation is using a tool for managing the source code (e.g. Subversion or git), another one for keeping track of the issues, defects or releases (e.g. Redmine or Bugzilla), another one for Agile Management (e.g Trello) and maybe another one for Continous Integration (e.g. Travis). In such a multi-tool environment it is important to ensure that the changes on the configuration items can be linked to the issues and releases in the issue tracker and the agile manaagement tool.

It is important to stress that there are multiple different paradigms for managing the source code, being the most important distinction whether the system is centralized or distributed. Linus Torvalds (who was the creator of git) gave an interesting speech in the Google Tech Talk event in which he compared both approaches [[LINUS-SCM-GOOGLE]].

The patterns described in this chapter try to be generic enough so they can be applied in centralized and distributed systems. However some of them may also apply to one of them. Additionally, it is important to stress that usually, a distributed system can be configured to work in a centralized way. Additinally, during the last years distributed systems have proliferated and it seems they have become the de-facto standard for SCM.

Configuration Identification

Baselines

In previous chapter the formal definition of baseline (according to IEEE) has been provided. From a more "practical" point of view, a baseline is a consistent set of Configuration Items (sometimes also called tagging or labelling). A baseline is a reference basis for evolution and releasing.

The frequency of the baseline releasing, depends a lot on the software development methodology that is used:

  • Waterfall: Consists of performing the development process in a single time. Simplistically: determine user needs, define requirements, design the system, implement the system, test, fix, and deliver.
  • Incremental: The incremental strategy determines user needs and defines the system requirements, then performs the rest of the development in a sequence of builds. The first build incorporates part of the planned features and subsequent builds are released until the system is complete.
  • Evolutionary: Similar to the incremental approach but acknowledges that the end-user needs may not fully understood and that all the requirements may not be identified before starting the development. User needs and requirements are partially defined at the beginning and are further refined in every build.

Obviously, working in an evolutionary manner, requires more frequent releases of the baselines than in a waterfall model. Hence, although the need of having an easy way to release is good in general, it's even more important in agile approaches.

Repositories

A repository is a system that stores the different versions of all the configuration items. The repository remembers every change ever written to it: every change to every configuration item and changes on the repository structure (such as the addition, deletion and rearrangement of files and directories).

Depending on the type of approach (centralized or distributed) there may be a centralized repository that is considered as the mastercopy of the project.

  • In the case of centralized systems, such as SVN, there use to be only a unique repository that is the central one. All the contributors to the project have working copies of that repository but they don't usually have their own repositories as all the work is done in the central repository.
  • In distributed systems, any contributor to the project has a repository he/she works with. Contributors' repositories are continuously synchronized (via Pulls and Pull Requests) by the users. However, in most of the cases, there is a centralized repository, that is usually called upstream. The upstream could be the repository of the organization or the repository of an individual with a very high reputation. Sometimes, small projects start from an individual repository and when they grow, they end-up in an organization repository.

A workspace is a copy of the repository that developers have in their machines and that is used to progress on the software development. The changes that developers make in their working copies are not available to other developers until they have transferred the data to the repository. A working copy does not have all the versions of the configuration items but just one. However, developers have the opportunity to retrieve from the repository any version of any configuration item they are interested in.

Synchronization

When a developer wants to create a working copy based on the content of the repository, he should perform a "checkout" or "clone" of the repository. A checkout is the operation that copies all the configuration items of a repository to create a new working copy. The checkout operation can be requested in any version of the repository, but by default, it requests the latest one (a.k.a HEAD). The checkout does not only retrieves the content of the configuration items but also all their revisions, configuration information and branches.

When a developer makes some changes in his working copy that wants to submit to the repository (so that other developers can use it) he should perform a "commit" operation (a.k.a. check-in). The commit operation allows developers to contribute to the repository new versions of one or multiple configuration items. In some systems (e.g. SVN) the commit is directly submitted to the repository. However, in some others, such as distributed systems (e.g git), a commit needs to be sent to the repository. This could be achieved in different ways:

  • By "pushing" the commit from the working copy to the repository. This is mostly done for the repositories owned by the developer as it requires "push" permission.
  • By sending a "Pull Request" from a repository to another one. This is useful when you don't have permissions to push directly to a repository on when you want other developers to review your commit before merging them (accepting and including them in the repository).

Once a developer has a working copy he can request at any time to synchronize with the latest version available in other repository. In centralized systems, the synchronization will be performed just with one repository, the central one. However, in disitributed systems, developers can (and usually do) synchronize with multiple repositories. For instance, a typical Git configuration is having a remote repository name upstream pointing out to the project upstream repository and another one named "origin" that points to the developer repository. This operation is called "update" in centralized systems and "pull" in distributed ones. When this operation is requested, the configuration items that have been changed in the repository are updated in the developer working copy.

SCMs will never incorporate other people's changes (update), nor make your own changes available to others (commit), until you explicitly tell it to do so.

Revisions

Different systems have different approaches with regards to configuration item versioning and identification. In Subversion or Git, every time a commit is performed in the repository a new revision of the repository is created.

  • In Subversion, a revision ID is a number that identifies a version of a repository in a given moment of time and that increases every time a new commit is performed.
  • In Git, a revision ID is a Hash string that is calculated after the commit is performed, hence two consecutive revisions do not have consequetive revision numbers.

As revisions are always linked to a commit, they are also called "commit IDs". In Git, as they are hashes, they are also referred to as "Hash IDs". For instance, the revision identifier of the first commit in the repository of these notes is 53a3797f7f406f15220955f5f6883cbae36e826f as you can see here.

A commit may include one or more configuration items with changes, due to that between two subsequent revisions, more than one item may differ from one to another.

For instance, this commit includes changes in one file and add 2 new ones and it just adds a new revision on top of the previous one.

It is important to stress than in modern SCM systems the configuration items are not identified individually but as part of a revision. This is an important change with regards to other old systems such as CVS (Concurrent Versioning System). In order to identify a particular version of a configuration item, the revision in which that configuration item version was available should be referred to.

Branches and Tags

The master or trunk is the main line of development of the workspace, that is the place where the evolution of the software product should be done. However, having a unique development point in the repository is not enough in most of the software products.

Branching: A non-software example. Suppose your job is to maintain a document for a division in your company, a handbook of some sort. One day a different division asks you for the same handbook, but with a few parts "tweaked" for them, since they do things slightly differently. What do you do in this situation? You do the obvious thing: you make a second copy of your document, and begin maintaining the two copies separately. As each department asks you to make small changes, you incorporate them into one copy or the other. You often want to make the same change to both copies. For example, if you discover a typo in the first copy, it's very likely that the same typo exists in the second copy. The two documents are almost the same, after all; they only differ in small, specific ways. Maintaining the two branches is an extra burden.

As you have seen, maintaining extra branches is expensive, hence before creating long-life parallel branches you need to think if there are altenartives to that: configuration parameters, specific modules, different runtime behaviours...

When a developer wants to create another development line in the repository he creates a branch. A branch is a line of development that exists independently of another line but shares a common history if you look far enough back in time. A branch always begins life as a copy of something, and moves on from there, generating its own history. However, branches that started from a common point and diverged later on can merge eventually again.

In the Software Development process it is sometimes convenient to identify a particular version, release or baseline of the software. This is achieved by tags. A tag is a snahpshot of the repository at a specific point in history. Typically people use this functionality to mark release points (v1.0, and so on). Tags are intended not to change by any means. Different SCMs have different strategies for implementing tags, but most of them implement this feature as a specific branch that does not change with the time.

Best Practices and Patterns

Tips for branching

Before Git was used, branches were used with a lot of care care since merging in other SCM systems such as SVN was very difficult. Merging is the process by which two configuration items are combined into a new one. Depending on the amount of configuration items to be combined and on the type of changes done in them, and on the SCM system used, merging can be a very difficult operation.

Branches are created to save some work by allowing developers to work in independent features in an independent manner. However, that may end up in some times in spending extra time doing a difficult merge task. The reason why Git is so successful nowadays is that it has simplified the way merges are done and hence has enabled developers to create and work on separate branches.

However, easy merging, does not mean branches should be used without care. For instance, in general, overcomplicated structures where branches are created from branches different to master continuously (arborescent approach) should be avoided.

Branching works better when you integrate with the origin of the branch as quickly as possible.

Best Practice 1: Simplify the branching model.

Althoug branching is cheap in systems such as Git that should not be an excuse for creating too complex tree structures diverging from the master branch. Ask developers to branch from the master branch that is the "home codeline" in which you merge all of your development on, except in special circumstances. Branching always from master reduces merging and synchronization effort by requiring fewer transitive change propagations.

It is also important that the expected branches are planned in advance and that a branch diagram is used. Having a diagram is of huge help to the development as it allows at a glance to have a clear understanding of the different branches available and the relationship across them. There are many tools for getting such a diagram automatically.

You can see below an example of such a diagram:

Branch Diagram Example

Best Practice 2: Create specific development branches for every feature you implement

Like shown in the previous diagram (branches Story A and Story B), for every feature to be added or for every bug you fix you should create a separate branch so you can work in a isolated and independent manner.

Best Practice 3: Development branches should be short-lived.

More information about when should a development branch be merged will be provided in the following sections, but by having a look at the diagram is easy understand than the later we merge, the more difficult it will be as branches will have diverged more.

Best Practice 4: When development branches must live for a long time, relatively frequent intermediate merges should be done.

When you create a develop branch and it's going to take a long time before you can merge your changes to master branch, try to sync with master frequently so you can avoid your working branch to diverge from master. The more time you wait, the more difficult the merge would be.

Best Practice 5: Branch Customer Releases.

When a new software version to users, the "usual" situation is that the team must work at least in two versions in parallel:

  • A new version with new features (that is developed in the master branch)
  • The released version, where typically only bugfixes should be added. This is usually done in a "release" branch. Please note that this bugfixes should also land on the master branch if that branch is also affected.

Due to that, when a version is released to customers a release branch should be created. In this way bugfixing can be done on the release branch without exposing the customer to new feature work in progress on the mainline.

The typical workflow for customer releases is:

  • A release branch (e.g. v1.0) is created based on the master branch when the team thinks the software is ready for release (say, a 1.0 release). In that mmoment the content and history of both branches (v1.0 and master) will be the same.
  • The team continues to work in both branches in parallel. Depending on the release strategy the product could be released at this moment to all the users, maybe to some early/beta users or maybe just for another testing round.
  • In any case, if bugs are discovered in either version, fixes are ported back and forth as necessary. Usually, as time passes, only critical fixes will be landed in the release branch, that eventually will be frozen as new releases supersede it.
  • Fixing bugs that affect both branches might be done using different strategies. A recommended one is to land all the code always in the master branch and then uplift or cherrypick those changes that affect the release to the release branch. Obviously, as time passes and branches diverge there might be some bugs that affect only the release branch, in that case, a direct merge on the release branch could be done.

Best Practice 6: Branch long-lived parallel efforts.

Long lived parallel efforts that multiple people will be working on should be done in independent branches. Imagine you want to experiment with a new feature and you know that before having something that can be merged in the master branch a lot of time is going to be needed. In that case, it makes sense to create a specific branch (similar to the release branch) so that the team can work in that feature while others can check your progress.

Best Practice 7: Be always flexible, there may be some very strong reasons for breaking these "rules".

These are just a set of recommendations, but there are different ways to work with branches and all of them are right and wrong at the same time as it is impossible to have a perfect framework. For instance, some authors [[SUCCESSFUL-GIT-BRANCHING]] promote the idea of having an integration branch called develop and with an infinite lifetime (as master) as shown in the following figure:

Use of integration branches

Merging

When working in multiple branches, the task of combining them into a single line of code (merging) is of endeavour importance.

When the work in the two branches to merge has no overlapping configuration items (no configuration item has been modified in both), the merging task is easier. However, although no conflicts should occur during the merging, it does not mean that the result of the merge is going to be good enough. Let's have a look at a non-software example:

Imagine you are Dr. Frankenstein and you want to build a human being. You have a development team composed by two developers, in order to avoid problems when merging their contributions you ask one to develop the legs and another one to develop the arms. When both have finished their task the merging is done with no problem, i.e. 2 arms and 2 legs are assembled in the body. However, imagine what happens if the left legs are twice longer than the righ leg: the merge worked OK but the result is a monster.

In conclusion, a merge without conflicts can also be a bad merge.

When some configuration items are modified in both branches, the merging task is not immediate as manual intervention is required to suggest how to solve the conflicts that result of modiying separately the same file. A conflict in a merge is said to occur when two configuration items have been modified with divergent changes.

Best Practice 8: Developers making the changes should be the ones responsible to fix the conflicts.

They are the ones who know better the code they have modified so the best way to prevent a Frankestein to be created is asking them to ensure the merge work leads to a fully functional result.

Working Copies vs. Repository

Software is developed in teams because concurrent work is needed. Nonetheless, the more people in your team, the more potential for conflicting changes.

In order to minimize the number of conflicts and facilitate the work of the team it is important to encourage team members to:

  • Work in features as small as possible.
  • Create the Pull Request as soon as possible but only if they work properly to avoid others to suffer chained problems.

But finding the right balance for this last issue (check stable code but soon) is usually difficult.

Working from a highly tested stable line is not always an option when new features are being developed, otherwise the frequency of the commits would not be as high as it is needed. However, although not being highly tested, at least it is expected that the code that is retrieved from the repository has a reasonable quality. In order to get to a good trade-off it is important to require developers to perform simple procedures before submitting code to the codeline, such as a preliminary build, and some level of testing

The good trade-off is having a development line stable enough for the work it needs to do. Do not aim for a perfect active development line, but rather for a mainline that is usable and active enough for your needs.

An active development line will have frequent changes, some well tested checkpoints that are guaranteed to be "good", and other points in the codeline are likely to be good enough for someone to do development on the tip of the line.

Some aspects that should be considered by developers are:

  • Work in your development branch and test your changes on it.
  • Before creating a Pull Request run Regression Test to make sure that you have not broken anything.
  • Ask for a code review if needed and repeat the previous steps iteratively depending on the review feedback.
  • After the review has been positively completed.
  • An Automated Integration Build should be done before accepting the Pull Request or righ after accepting it.

Many of the concepts we have just described are very related with the concept of Continuous Integration and will be explained in the next chapter.

Best Practice 9: Before pushing a contribution (Pull Request or Direct Push), ensure that the latest version of the repository is available in the working copy.

Best Practice 10: Think globally by building locally. Ensure the system builds before pushing.

The only way to truly test that any change is 100% compatible with the system is through the centralized integration build. However, if we do not test it in our working copy, it is highly likely our changes break the build and disturbs the work of other developers. Before making a submission to source control, developers should build the system using a Private System Build that is similar to the centralized build. A private system build does take time, but this is time spent by only one person rather than each member of the team should there be a problem.

Best Practice 11: Code can be committed with bugs if they are known and do not introduce regressions.

Do not wait to have the final version of your software. Sometimes it's better to have the code available in the master branch soon (even with known bugs) than waiting extra time to fix and land the code later (when more conflicts can happen and less time will be spent in testing by other developers)

Continuous Integration and Building

Since many people are making changes in the repository, it is impossible for a developer to be 100% sure that the entire system builds correctly after they integrate their changes in the repository even if they create a local build before and extensively test it.

Continuous Integration (CI) is a software development practice where members of a team commit their work frequently, leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. This approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.

Build Process

Building is the process of getting the sources turned into a running system. This can often be a complicated process involving compilation, moving files around, generating configuration files, loading schemas into the databases, and so on. However this process can (and as a resullt should) be automated.

Automated environments for builds are a common feature of systems. The Unix world has had make for decades, the Java community developed Ant, the .NET community has had Nant and now has MSBuild, for node.js and Javascript we have now Grunt, Gulp and many more... What is important, regardless of the Programming Language and framework is to make sure you can build and launch your system using these scripts using a single command. A common mistake is not to include everything in the automated build. This should be avoided by all means as anyone should be able to bring in a virgin machine, check the sources out of the repository, issue a single command, and have a running system on their machine.

Best Practice 12: The full build process should be automated and include everything that is required.

A big build often takes time, and with CI we want to detect issues as soon as practical so optimizing build time is key to meet this target as in some cases, building a complete sytem might take hours. In order to save time, good build tools analyzes what needs to be changed as part of the process and perform only the required actions. The common way to do this is to check the dates of the source and object files and only compile if the source date is later. One of the trickiest aspects of building in an incremental way is managing depedencies: if one object file changes those that depend on it may also need to be rebuilt.

Best Practice 13: Try to minimize the time required to generate the build.

As explained before, multiple tools existe in order to perform the build, that depend, for instance, in the OS of the host machine for the repository: Make, Ant, Grunt ... There are also some cross-platform tools that allow to create a custom centralized build process in any OS and SCM system.

The build process should take into account that different targets or configurations may be supported. For instance, desktop software must be usually built for Windows, OS-X and Linux so the build system should be able to create builds for all these systems.

Having a central build ensures the software is always built in the same manner. The software build process should be reproducible, so the same build could be created as many times as needed and as close as possible to the final product build.

Best Practice 14: Have a centrazlied and reproducible build system.

Self Testing Builds

A build may be successfully created and it may run, but that doesn't mean it does the right thing. Modern statically typed languages can catch many bugs, but far more are not detected by the compilers.

A good way to catch bugs quickly and efficiently is to include automated tests in the build process. Testing isn't perfect, of course, but it can catch a lot of bugs.

The good news is that the rise of TDD has lead to a wide availability of automated testing frameworks and tools such as the XUnit family, Selenium and plenty of others.

Of course the self-testing is not going to find everything as tests do not prove the absence of bugs but they help to detect bugs early and hence minimize their impact. As in the case of the build generation, passing the tests takes time and we should try to optimize the testing process (in terms of performance and the amount of relevant tests to be passed).

Every commit creates a build

As we are encouraging developers to commit frequently, ensuring the mainline stays in a healthy state is an important but difficult task.

The best way to ensure that is by having regular builds on an integration machine and only if this integration build succeeds should the commit be considered to be done. Since the developer who commits is responsible for this, that developer needs to monitor the mainline build so they can fix it if it breaks. Your work is not completely done until the mainline build is finished and has passed all the self-tests.

Best Practice 15: Create a new build with every commit.

A continuous integration server acts as a monitor to the repository. Every time a commit against the repository is done the server automatically checks out the sources onto the integration machine, initiates a build, passes the self-test and notifies the committer of the result of the build and tests.

The best way to monitor the repository is by using tools such as hooks. Hooks are a set of actions that could be configured in Git to be done every time a user commits a file to the repository. A hook could be pre-commit or post-commit, depending on whether the hook is executed before the commit or after the commit is done respectively.

Pre-commit hooks may be used for intance to reject commits that have some errors (e.g. with the changes the system does not build), in that case if the error is detected the commit is rejected and notified to the user doing the commit.

If post-commit hooks are used, it's the developer or the repository administrator the one responsible to perform corrective actions in case the build is not properly generated or the tests do not pass.

Fix Broken mainline immediately

A key part of doing a continuous build is that if the mainline build fails, it needs to be fixed right away. The whole point of working with CI is that you're always developing on a known stable base.

It's not a terrible thing for the mainline build to break, although if it's happening all the time it suggests people aren't being careful enough about updating and building locally before a commit. When the mainline build does break, however, it's important that it gets fixed fast. Usually, the fastest way to fix the build is to revert the latest commit from the mainline, taking the system back to the last-known good build, this is sometimes known as backing out the commit. Unless the cause for the breakage is immediately obvious and can be fixed really fast, developers should just revert the mainline and debug the problem on the working copy leaving the repository clean.

Best Practice 16: Back out any commit that breaks the master build immediately.

Visibility of CI

Continuous Integration is all about communication, so it is important to ensure that everyone can easily see the state of the system and the changes that have been made to it.

SCM systems such as Git provide us the information about the changes done but Git as such does not communicate the state of the mainline build. The ideal solution should be providing a web site (either integrated in the SCM or as a standalone one) that will show you if there's a build in progress and what was the state of the last mainline build. An example of such a system is Travis [[TRAVIS-CI]].

Releasing

A release is a version of the product that is made available to its intended customers. External releases are published to end-users whereas internal releases are made available only to developers. The releases are identified by release numbers which are totally independent from the SCM version numbers.

Releases can be also classified in full or partial releases, depending on whether it requires a complete installation or not respectively. Partial releases require a previous full release to be installed.

Release creation involves collecting all files and documentation required to create ystem release. Configuration descriptions have to be written for different hardware and installation scripts have to be written. The specific release must be documented to record exactly what files were used to create it. This allows it to be re-created if necessary

Release planning is concerned with when to issue a system version as a release. The following factors should be taken into account for defining a release strategy:

  • Technical Quality of the System: If serious system faults are reported which affect the way in which many customers use the system, it may be necessary to issue a fault repair release. However, minor system faults may be repaired by issuing patches (often distributed over the Internet) that can be applied to the current release of the system.
  • Platform Changes: You may have to create a new release of a software application when a new version of the operating system platform is released.
  • Lehman's fifth law: This suggests that the increment of functionality that is included in each release is approximately constant. Therefore, if there has been a system release with significant new functionality, then it may have to be followed by a repair release.
  • Competition: A new system release may be necessary because a competing product is available.
  • Marketing Requirements: The marketing departm ent of an organisation may have made a commitment for releases to be available at a particular date.
  • Customer Change Proposals: For customised systems, customers may have made and paid for a specific set of system change proposals and they expect a system release as soon as these have been implemented

Controlling Changes

There is a need to submit continuously changes to the repository. The reasons for checking-in changes are multiple:

  • Defects: A defect has been detected and needs to be fixed.
  • New Features: New features must be added to the software.
  • Improvements: A functionality already existing can be improved.

Even in a continuous integration model, it is important to have the possibility to control the changes that have been done to the configuration items in the repository. Control in this context does not mean approval but traceability. I.e. It is not always needed that someone approves a change, but what is needed is that it is possible to identify for every change committed the reasons for it. Lack of control in the process leads to project failures, confusion and chaos. Using a good control mechanism enables communication, sharing data and efficiency.

Depending on the codeline in which the changes are done, the level of information required and the flow that should be followed for implementing and approving it should be different.

For instance, changes in master should be encouraged rather than discouraged. In order to do so, developers should be free to commit their changes to the repository if:

  • The change is inline with the product backlog or requirements he was developing for.
  • He raises an issue on the issue tracker describing the reason for the changes and the changes themselves.
  • When he performs the commit, he indicates in the commit information the related issue (the one he created).
  • After the commit is done, the is sue is either automatically or manually marked as resolved.

Additionally, anybody in the development team should be free to raise additional issues that can be assigned to anybody within the team. Giving freedom to the development team (within some limits) is usually a good idea.

If the changes are going to be applied in a branch that was created based on a commercial release, the process usually follows a more strict control. For instance:

  • The developer creates an issue in the issue tracker, this kind of issues are usually called Change Request (CR). A CR typically has the following information:
    • Project name, date, requestor, and priority.
    • Description of the problem.
    • Affected configuration items, branches and releases.
    • Suggested fix.
    • Severity.
    • Log files, screen shots...
  • The CR is then analysed by the configuration control board that decides what to do, approve it or reject it.
    • In case it is rejected the initiator could review and create a new version (depending on the reasons for the rejection).
    • In case it is approved, it is assigned to a developer who will implement it and add it to the repository.

Regardless of how "controlled" is the implementation of changes in the repository, the system used should allow some features such as:

  • Identify which changes have been implemented between two different baselines.
  • Check the status of a particular issue, defect or CR for the different platforms and releases.
  • Check if the resolution of a problem can be merged in a branch.

With respect to the tools, there are multiple tools that allow issue tracking. There are commercial ones such as Jira or free ones such as Bugzilla or Redmine. The later is a very powerful as it is quite flexible and can be integrated with the most popular source code management tools such as git and svn. Additionally, other Agile management tools can be used such as Trello.

Additionally, the Source Code Management tools should also have adequate authorization mechanism to ensure the traceability of the changes, i.e. identified who made any particular change in the repository. For instance, SVN offers and authentication mechanism and allows to use others such as LDAP. The important thing is that SVN identifies which user has done which commit in order to trace back a change to the author. SVN also allows the definition of permissions in per-branch or per-configuration item basis, access to some branches may be only allowed to some users.

Testing

Introduction

The purpose of software testing is to ensure that the software systems work as expected when their target customers and users use them.

The basic idea of testing involves the execution of software and the observation of its behavior or outcome. If a deviation from the expected behavior is observed, the execution record is analyzed to find and fix the bug(s) that caused the failure.

Testing could be hence defined as a controlled experimentation through program execution in a controlled environment before product release. Therefore testing fulfills two primary purposes:

Types of Testing

Testing could be categorized in different types based on different criteria.

Criteria 1: Functional and Structural Testing

The main difference between functional and structural testing is the knowledge (or lack of knowledge) about the software internals and hence the related focus:

  • Functional testing considers that there is no information about software internals or implementation details. Hence, this type of testing is usally knonw as "Black-Box". This affects both the test defintion as well as the test execution.
  • Structural testing considers that the information about the internals of the software is known and should be used for both test definition an execution. This type of testing are usually known as "white-box" testing.

Functional Testing

When a "black-box" approach is followed, the definition of the test-cases that should be executed does not take into account the structure of the software. The execution of the test cases focuses on the observation of the program external behavior during execution. It checks what is the external output of software based in some inputs.

There are different levels in which Black-Box testing can be performed:

  • At the most detailed level, individual program elements can be tested such as functions, methods.
  • At the intermediate level, various program elements or program components may be treated as an interconnected group, and tested accordingly.
  • At the most abstract level, the whole software systems can be treated as a "blackbox", while we focus on the functions or input-output relations instead of the internal implementation.

Structural Testing

Structural testing requires the knowledge of the internals of the software implementation. It verifies the correct implementation of internal units, such as program statements, data structures, blocks... and the relations among them.

Defining test cases in a structural way, consists in using the knowledge of the software implementation in order to reduce the number of test cases to be passed. With the tendency of defining the tests cases (even automating them as we will see in the TDD section) before the software is implemented. This kind of techniques are not that useful nowadays.

When executing tests in a structural way, as the key focus is the connection between execution behavior and internal units, the observation of the results is not enough and additional software tools are also required. For instance, debuggers, that help us in tracing through program executions. By doing so, the tester can see if a specific statement has been executed, and if the result or behavior is expected.

This kind of testing is usually very complex due to the use of these tools. However, its key advantage is that once a problem is detected it is also localized (the failure leads directly to the bug). Because of this complexity, this testing is only done when the root of a bug discovered via functional testing cannot be found or in very late stages of the project.

Criteria 2: Coverage vs. Usage based testing

One of the most important decisions that should be taken by the QA (and the whole software and product) team is to decide when to stop testing.

Obviously, an easy (but wrong decision) would be stopping based on the resources, e.g. stop when you run out of time or of money or time. As such a decision would lead to quality problems, we need to find a quality-based criteria to decide when our product has passed enough tests. In order to identify when the product has reached the quality goals, there are different points of view:

  • Measure the quality directly by in-use metrics. The issue is that this approach requires actual customer usage.
  • Measure the quality indirectly through the execution of a set of tests that provide a good test coverage.

Usage-Based Statistical Testing (UBST()

Actual customer usage of software products can be viewed as a form of usage-based testing. Measuring directly the quality in a real environment is the most accurate way to identify if the software quality targets have been achieved.

The so-called beta test makes use of continuous iterations, through controlled software release so that these beta customers help software development and organizations improve their software quality.

In usage-based statistical testing (UBST), the overall testing environment resembles the actual operational environment for the software product in the field, and the overall testing sequence, as represented by the orderly execution of specific test cases in a test suite, resembles the usage scenarios, sequences, and patterns of actual software usage by the target customers.

Although very useful, as this approach is helpful to detect not only bugs but also other type of prblems, this approach could be dangerous if it is not use with care as it could damage the software vendor's reputation, for instance if the produce released as beta has very bad quality. Due to that, it is recommended to use this approach mainly for final software stages or when the team feels very confident about software stability.

Coverage-Based Testing (CBT)

Most traditional testing techniques, either Black or White Box, use various forms of test coverage as the stopping criteria. This means that the testing process is stopped when a set of tests are executed successfully in the software. In this case, the key aspects are identifying which is the required test coverage and what executed successfully means.

With respect to coverage, in the case of Functional Testing, it could consist on completing a checklist of major functions based on product specification (system requirements), it could consist in having a minimal number of Test Cases per User Story, etc...

In case of Structural Testing, it could consist on completing a checklist of all the product components or all the statements of the software.

With respect to what executed successfully means, as we know, it is impossible to have a 100% bug-free software, so in some cases, it is assumed that some bugs could be allowed in order to decide the product to be finished, obviously, this depends on the type of product, the criticality of the bugs, the timescales...

Comparing CBT with UBST

The key differences that distinguish CBT from UBST are the perspective and the related stopping criteria.

With regards to the perspective, UBST views the objects of testing from a user's perspective and focuses on the usage scenarios, sequences, patterns, and associated frequencies or probabilities. On the other hand, CBT views the objects from a developer's perspective and focuses covering functional or implementation units and related entities.

With regards to the stopping criteria, UBST uses product in use metrics as the exit criterion. CBT uses coverage goals - that are supposed to be an approximations of in-use goals - as the exit criterion.

Criteria 3: Test Target

Tests are frequently grouped by where they are added in the software development process, or by the target (element) to be tested.

Unit Testing

Unit testing refers to tests that verify the functionality of a specific section of code, usually at the function level. In an object-oriented environment, this is usually at the class level, and the minimal unit tests include the constructors and destructors.

These types of tests are usually written by developers as they work on code (white-box style), to ensure that the specific function is working as expected. One function might have multiple tests, to catch corner cases or other branches in the code. Unit testing alone cannot verify the functionality of a piece of software, but rather is used to assure that the building blocks the software uses work independently of each other.

Unit testing is also called component testing.

Integration Testing

Integration testing is any type of software testing that seeks to verify the interfaces between components against a software design. Software components may be integrated in an iterative way or all together ("big bang"). Normally the former is considered a better practice since it allows interface issues to be localised more quickly and fixed.

Integration testing works to expose defects in the interfaces and interaction between integrated components (modules). Progressively larger groups of tested software components corresponding to elements of the architectural design are integrated and tested until the software works as a system.

System Testing

System testing tests a completely integrated system to verify that it meets its requirements.

System Integration Testing

System integration testing verifies that a system is integrated to any external or third-party systems defined in the system requirements.

Criteria 4: Objectives of testing

Although testing has a common set of goals, the targets of testing can be very different. Some examples of type testing based on their goals are listed in this section.

Regression Testing

Regression testing focuses on finding defects after a major code change has occurred. Specifically, it seeks to uncover software regressions, or old bugs that have come back. Such regressions occur whenever software functionality that was previously working correctly stops working as intended. Typically, regressions occur as an unintended consequence of program changes, when the newly developed part of the software collides with the previously existing code. Common methods of regression testing include re-running previously run tests and checking whether previously fixed faults have re-emerged. The depth of testing depends on the phase in the release process and the risk of the added features. They can either be complete, for changes added late in the release or deemed to be risky, to very shallow, consisting of positive tests on each feature, if the changes are early in the release or deemed to be of low risk.

Acceptance Testing

Acceptance testing can mean one of two things:

  • A smoke test is used as an acceptance test prior to introducing a new build to the main testing process, i.e. before integration or regression.
  • Acceptance test ing is performed by the customer, often in their lab environment on their own hardware, is known as user acceptance testing (UAT). Acceptance testing may be performed as part of the hand-off process between any two phases of development.[citation needed]

Alpha Testing

Alpha testing is simulated or actual operational testing by potential users/customers or an independent test team at the developers' site. Alpha testing is often employed for off-the-shelf software as a form of internal acceptance testing, before the software goes to beta testing.

Beta Testing

Beta testing comes after alpha testing and can be considered a form of external user acceptance testing. Versions of the software, known as beta versions, are released to a limited audience outside of the programming team. The software is released to groups of people so that further testing can ensure the product has few faults or bugs. Sometimes, beta versions are made available to the open public to increase the feedback field to a maximal number of future users.A

Test Activities

As in many other software related activities, the typical plan, execute and assess flow is also used in testing as depicted in the figure below.

Test Activities as part of Software develpoment process

Testing Planning and Preparation

Most of the key decisions about testing are made during this stage. During this phase an overall testing strategy is fixed by making the following decisions:

  • Overall objectives and goals, which can be refined into specific goals for specific testing. Some specific goals include reliability for usage-based statistical testing or coverage for various traditional testing techniques.
  • Objects to be tested and the specific focus: Functional testing views the software product as a black-box and focuses on testing the external functional behavior; while structural testing views the software product or component as a (transparent) whitebox and focuses on testing the internal implementation details.

As soon as the first models are being generated (for example, usage models, system models, architectural models, etc), they can be used to generate test cases: A test case is a collection of entities and related information that allows a test to be executed or a test run to be performed. The collection of individual test cases that will be run in a test sequence until some stopping criteria are satisfied is called a test suite. IEEE Standard 610 (1990) defines test case as follows:

  • A set of test inputs, execution conditions, and ex pected results developed for a particular objective, such as to exercise a particular program path or to verify compliance with a specific requirement.
  • (IEEE Std 829-1983) Documentation specifying inputs, predicted results, and a set of execution conditions for a test item.

According to Ron Patton: "Test cases are the specific inputs that you'll try and the procedures that you'll follow when you test the software."

From a more practical point of view, a test case is composed by:

  • Preconditions that should be established before the test is conducted.
  • Clear sequence of actions and input data that constitutes the test sequence.
  • Expected Result.

On the other hand, a test run, is a dynamic unit of specific test activities in the overall testing sequence on a selected testing object. Each time a static test case is invoked, an individual dynamic test run is created.

One aspect that should be considered when planning the test cases is the sequencing of the individual test cases and the switch-over from one test run to another. Several concerns affect the specific test procedure to be used, including:

  • Dependencies among individual test cases. For instance, does a test case require the execution of another test case before?
  • Defect detection related sequencing. Many problems can only be effectively detected after others have been discovered and fixed.
  • Natural grouping of test cases, such as by functional and structural areas or by usage frequencies, can also be used for test sequencing and to manage parallel testing.

Testing Execution

The most important activities related with test execution are:

  • Allocating test time and resources.
  • Invoking and running tests, and collecting execution information and measurements.
  • Checking testing results and identifying system failures.

One of the critical aspects in order to fulfill the objectives of testing is checking if the result of the test run is successful or not. In order to do so, it must be possible to observe the results of the test and determine whether the expected result was achieved or not.

Is enough with observing the results? In some situations, such as in object-oriented software, the execution of a test-run may have affected the state of an object. That state might also affect in the future the software that has been tested. Due to that, in some situations it is helpful to examine the state of some objects before and after a test is conducted. The reason for that is that only a small percentage of the overall functionality of an object can be observed via the return values.

This may be in conflict with using a "black-box" testing approach, in which only events observable outside can be used to verify the results of a test-run. However, the meaning of observed may be different for different software projects: outside a method? An object? The whole software?

When a failure is observed, it needs to be recorded and tracked until their resolution. In order to allow developers to trace back the failure to the bug causing it, it is important that detailed information about failure observations and the related activities is registered.

But not only failures must be registered, successful executions also need to be recorded as it is very important for regression testing.

In general for every test-run the following information should be gathered:

  • Run identification.
  • Timing. Start and end time.
  • Tester. The tester who attempted the te st run.
  • Transactions. Transactions handled by the test run.
  • Results. Result of the test run.

Testing Analysis and Follow-up

The results of the testing activities (i.e. the measurement data collected during test execution), together with other data about the testing and the overall environment provide valuable feedback to test execution and other testing and development activities.

Obviously, as a consequence of testing, there are some direct follow-up activities:

  • Defect fixing. The development team must repair detected defects.
  • Management decisions, such as product release and transition from one development phase or sub-phase to another. For instance, given the results of testing I can decid if my product mature enough for being published.

In order to fix an issue, it is important to follow these steps:

  1. Understanding the problem by studying the execution record.
  2. Being able to recreate the same problem scenario and observe the same problem.
  3. Problem diagnosis too examine what kind of problem it is, where, when and possible causes.
  4. Fault locating, to identify the exact location(s) of fault(s).
  5. Defect fixing, to fix the located fault(s) by adding, removing, or correcting certain parts of the code. Sometimes, design and requirement changes could also be triggered or propagated from the above changes due to logical linkage among the different software components.

In order to take appropriate management decisions, some analysis can be performed on the overall testing results:

  1. Reliability analysis for usage-based testing, which can be used to assess current product reliability. Sometimes, low reliability areas can be identified for focused testing and reliability improvement. In case of this type of testing, it is also very helpful to detect problems non linked to defects, using the PUM metric.
  2. Coverage analysis for coverage-based testing, which can be used as a surrogate for reliability and used as the stopping criterion.
  3. Overall defect analysis, which can be used to examine defect distribution and to identify high-defect areas for focused remedial actions.

Test Cases and Test Coverage definition

Exhaustive testing is the execution of every possible test case. Rarely can we do exhaustive testing. Even simple systems have too many possible test cases. For example, a program with two integer inputs on a machine with a 32-bit word would have 264 possible test cases. Thus, testing is always executing a very small percentage of the possible test cases.

Two basic concerns in software testing definition have been already introduced (1) what test cases to use (test case selection) and (2) how many test cases are necessary (stopping criterion). Test case selection can be based on either the specifications (functional), the structure of the code (structural), the flow of data (data flow), or random selection of test cases. Test case selection can be viewed as an attempt to space the test cases throughout the input space. Some areas in the domain may be especially error-prone and may need extra attention. It has been also mentioned that the stopping criterion can be based on a coverage criterion, such as executing N test cases in each subdomain, or the stopping criterion can be based on a behavior criteria, such as testing until an error rate is less than a threshold x.

A program can be thought of as a mapping from a domain space to an answer space or range. Given an input, which is a point in the domain space, the program produces an output, which is a point in the range. Similarly, the specification of the program is a map from a domain space to an answer space.

Please also remember, that a specification is essential to software testing. Correctness in software is defined as the program mapping being the same as the specification mapping. A good saying to remember is "a program without a specification is always correct". A program without a specification cannot be tested against a specification, and the program does what it does and does not violate its specification.

Test Coverage criterion

A test coverage criterion is a rule about how to select tests and when to stop testing. One basic issue in testing research is how to compare the effectiveness of different test coverage criteria. The standard approach is to use the subsumes relationship.

Subsumes

A test criterion A subsumes test coverage criterion B if any test set that satisfies criterion A also satisfies criterion B. This means that the test coverage criterion A somehow includes the criterion B. For example, if one test coverage criterion required every statement to be executed and another criterion required every statement to be executed and some additional tests, then the second criterion would subsume the first criterion.

Researchers have identified subsumes relationships among most of the conventional criteria. However, although subsumes is a characteristic that is used for comparing test criterian, it does not measure the relative effectiveness of two criteria. This is because most criteria do specify how a set of test cases will be chosen. Picking the minimal set of test cases to satisfy a criterion is not as effective as choosing good test cases until the criterion is met. Thus, a good set of test cases that satisfy a "weaker" criterion may be much better than a poorly chosen set that satisfy a "stronger" criterion.

Functional Testing

In functional testing, the specification of the software is used to identify subdomains that should be tested. One of the first steps is to generate a test case for every distinct type of output of the program. For example, every error message should be generated. Next, all special cases should have a test case. Tricky situations should be tested. Common mistakes and misconceptions should be tested. The result should be a set of test cases that will thoroughly test the program when it is implemented. This set of test cases may also help clarify to the developer some of the expected behavior of the proposed software.

In the book "The Art of Software Testing", Glenford Myers poses the following functional testing problem: Develop a good set of test cases for a program that accepts three numbers, a, b, c, interprets those numbers as the lengths of the sides of a triangle, and outputs the type of the triangle. Myers reports that in his experience most software developers will not respond with a good test set.

An approach to define the test cases for this classic triangle problem, is dividing the domain space into three subdomains, one for each different type of triangle that we will consider: scalene (no sides equal), isosceles (two sides equal), and equilateral (all sides equal). We can also identify two error situations: a subdomain with bad inputs and a subdomain where the sides of those lengths would not form a triangle. Additionally, since the order of the sides is not specified, all combinations should be tried. Finally, each test case needs to specify the value of the output. Following table shows a possible solution.

Test Cases for Triangle Problem - Functional Testing Approach
Subdomain Test Description Test Case
Scalene Increasing Size (3,4,5) -> Scalene
Decreasing Size (5,4,3) -> Scalene
Largest is second (4,5,3) -> Scalene
Isosceles a=b & other side larger (5,5,8) -> Isosceles
a=c & other side larger (5,8,5) -> Isosceles
b=c & other side larger (8,5,5) -> Isosceles
a=b & other side smaller (8,8,5) -> Isosceles
a=c & other side smaller (8,5,8) -> Isosceles
b=c & other side smaller (5,8,8) -> Isosceles
Equilateral a=b=c (5,5,5) -> Equilateral
Not a triangle Largest first (6,4,2) -> Not a triangle
Largest second (4,6,2) -> Not a triangle
Largest third (1,2,3) -> Not a triangle
Bad Inputs One bad input (-1,4,2) -> Bad Inputs
Two bad inputs (-1,2,0) -> Bad Inputs
Three Bad Inputs (0,0,0) -> Bad Inputs

This list of subdomains could be increased to distinguish other subdomains that might be considered significant. For example, in scalene subdomains, there are actually six different orderings, but the placement of the largest might be the most significant based on possible mistakes in programming.

Note that one test case in each subdomain is usually considered minimal but acceptable.

Test Matrices

A way to formalize this identification of subdomains is to build a matrix using the conditions that we can identify from the specification and then to systematically identify all combinations of these conditions as being true or false.

The conditions in the triangle problem might be:

  1. a=b or a=c or b=c
  2. a=b and b=c
  3. a >= b + c OR b >= a + c OR c >= a + b
  4. a <= 0 or b <= 0 or c <= 0 (equals to a>0 and b>0 and c>0).

These four conditions can be put on the rows of a matrix. The columns of the matrix will each be a subdomain. For each subdomain, a T will be placed in each row whose condition is true and an F when the condition is false. All valid combinations of T and F will be used. If there are four conditions, there may be 2^4 = 8 subdomains (columns). Not all the combinations are possible as some of the conditions depend on others to be true or false. Additional rows will be used for defining possible values of a, b, and c and for the expected output for each subdomain test case.

Next table shows an example of this matrix:

Test Cases for Triangle Problem - Test Matrices - Functional Testing
Conditions 1 2 3 4 5 6 7 8
a=b or a=c or b=c T T T T T F F F
a=b and b=c T T F F F F F F
a>=b+c or b>=a+c or c>=a+b T F T T F T T F
a<=0 or b<=0 or c<=0 T F T F F T F F
Sample Test Case 0,0,0 3,3,3 0,4,0 3,8,3 5,8,5 0,5,6 3,4,8 3,4,5
Expected Output Bad inputs Equilateral Bad inputs Not a triangle Isosceles Bad inputs Not a triangle Scalene

Structural Testing

Structural testing coverage is based on the structure of the source code. The simplest structural testing criterion is every statement coverage, often called C0 coverage.

C0 - Every State Coverage

This criterion is that every statement of the source code should be executed by some test case. The normal approach to achieving C0 coverage is to select test cases until a coverage tool indicates that all statements in the code have been executed.

The pseudocode in the following table implements the triangle problem. The table also shows which lines are executed by which test cases. Note that the first three statements (A, B, and C) can be considered parts of the same node.

Test Cases for Triangle Problem - C0 - Structural Testing
Node Source 3,4,5 3,5,3 0,1,0 4,4,4
A read a,b,c * * * *
B type="scalene" * * * *
C if((a==b) || (b==c) || (a==c))) * * * *
D type="isosceles" * * *
E if((a==b)&&(b==c)) * * * *
F type="equilateral" *
G if((a>=b+c) || (b>=a+c) || (c>=a+b))) * * * *
H type="not a triangle" *
I if((a<=0>) || (b<=0>) || (c<=0>))) * * * *
J type="bad inputs" *
K print type * * * *

By the fourth test case, every statement has been executed. This set of test cases is not the smallest set that would cover every statement. However, finding the smallest test set would often not find a good test set.

C1 - Every Branch Testign

A more thorough test criterion is every-branch testing, which is often called C1 test coverage. In this criterion, the goal is to go both ways out of every decision.

If we model the program of previous table as a control flow graph, this coverage criterion requires covering every arc in the following control flow diagram.

Next table shows the test cases identified with this criteria.

Control Flow Graph
Test Cases for Triangle Problem - C1 Approach - Structural Testing
Arcs Test Case: (3,4,5) Test Case: (3,5,3) Test Case: (0,1,0) Test Case: (4,4,4)
ABC-D * * *
ABC-E *
D-E * * *
E-F *
E-G * * *
F-G *
G-H *
G-I * * *
H-I *
I-J *
I-K * * *
J-K *
Every Path Testign

Even more thorough is the every-path testing criterion. A path u is a unique sequence of program nodes that are executed by a test case. In the testing matrix above, there were eight subdomains. Each of these just happens to be a path. In that example, there are sixteen different combinations of T and F. However, eight of those combinations are infeasible paths. That is, there is no test case that could have that combination of T and F for the decisions in the program. It can be exceedingly hard to determine if a path is infeasible or if it is just hard to find a test case that executes that path.

Most programs with loops will have an infinite number of paths. In general, every-path testing is not reasonable.

Next table shows the eight feasible paths in the triangle pseudocode as well as the test cases required for testing all of them.

Test Cases for Triangle Problem - Every Path Approach - Structural Testing
Path T/F Test Case Output
ABCEGIK FFFF 3,4,5 Scalene
ABCEGHIK FFTF 3,4,8 Not a triangle
ABCEGIJK FFTT 0,5,6 Bad inputs
ABCDEGIK TFFF 5,8,5 Isosceles
ABCDEGHIK TFTF 3,8,3 Not a triangle
ABCDEGHIJK TFTT 0,4,0 Bad Inputs
ABCDEFGIK TTFF 3,3,3 Equilateral
ABCDEFGHIJK TTTT 0,0,0 Bad Inputs
Multiple Condition Coverage

A multiple-condition testing criterion requires that each primitive relation condition is evaluated both true and false. Additionally, all combinations of T/F for the primitive relations in a condition must be tried. Note that lazy evaluation of expressions will eliminate some combinations. For example, in an "and" of two primitive relations, the second will not be evaluated if the first one is false.

In the pseudocode for the triangle example, there are multiple conditions in each decision statement as displayed in the tables below. Primitives that are not executed because of lazy evaluation are shown with an 'X'.

Test Cases for Triangle Problem - Multiple Condition: Condition if(a==b||b==c||a==c) - Structural Testing
Combination Possible Test Case Branch
TXX 3,3,4 ABC-D
FTX 4,3,3 ABC-D
FFT 3,4,3 ABC-D
FFF 3,4,5 ABC-E
Test Cases for Triangle Problem - Multiple Condition: Condition (a==b&&b==c) - Structural Testing
Combination Possible Test Case Branch
TT 3,3,3 E-F
TF 3,3,4 E-G
FX 4,3,3 E-G
Test Cases for Triangle Problem - Multiple Condition: Condition (a>=b+c||b>=a+c||c>=a+b) - Structural Testing
Combination Possible Test Case Branch
TXX 8,4,3 G-H
FTX 4,8,3 G-H
FFT 4,3,8 G-H
FFF 3,3,3 G-I
Test Cases for Triangle Problem - Multiple Condition: Condition (a<=0||b<=0||c<=0) - Structural Testing
Combination Possible Test Case Branch
TXX 0,4,5 I-J
FTX 4,-2,-2 I-J
FFT 5,-4,3 I-J
FFF 3,3,3 I-K
Subdomain Testing

Subdomain testing is the idea of partitioning the input domain into mutually exclusive subdomains and requiring an equal number of test cases from each subdomain. This was basically the idea behind the test matrix. Subdomain testing is more general in that it does not restrict how the subdomains are selected. Generally, if there is a good reason for picking the subdomains, then they may be useful for testing. Additionally, the subdomains from other approaches might be subdivided into smaller subdomains. Theoretical work has shown that subdividing subdomains is only effective if it tends to isolate potential errors into individual subdomains.

Every-statement coverage and every-branch coverage are not subdomain tests. There are not mutually exclusive subdomains related to the execution of different statements or branches. Every-path coverage is a subdomain coverage, since the subdomain of test cases that execute a particular path through a program is mutually exclusive with the subdomain for any other path.

For the triangle problem, we might start with a subdomain for each output. These might be further subdivided into new subdomains based on whether the largest or the bad element is in the first position, second position, or third position (when appropriate). Next table shows the subdomains and test cases for every subdomain.

Test Cases for Triangle Problem - Subdomain Testing
Subdomain Possible Test Case
Eauilateral 3,3,3
Isosceles first largest 8,5,5
Isosceles second largest 5,8,5
Isosceles third largest 5,5,8
Scalene first largest 5,4,3
Scalene second largest 3,5,4
Scalene third largest 3,4,5
Not a triangle first largest 8,3,3
Not a triangle second largest 3,8,4
Not a triangle third largest 4,3,8
Bad Inputs first largest 4,3,0
Bad Inputs second largest 3,4,0
Bad Inputs third largest -1,4,5

Data Flow Testing

Data flow testing is testing based on the flow of data through a program. Data flows from where it is defined to where it is used.

A definition of data, or DEF, is when a value is assigned to a variable. For example, with respect to a variable x, nodes containing statements such as input x and x = 2 would both be defiing nodes.

Usage nodes (USE) refer to situations in which a variable is used by the software. Two main kinds of use have been identified:

  • The computation use, or C-USE, is when the variable is used in a computation (e.g. it appears on the right-hand side of an assignment statement) such as in print x or a = 2+x. A C-USE is said to occur on the assignment statement.
  • The predicate use, or P-USE, is when the variable is used in the condition of a decision statement (e.g. if x>6). A P-USE is assigned to both branches out of the decision statement.

There are also three other types of usage node, which are all, in effect, subclass of the C-USE type:

  • O-use: output use - the value of the variable is output to the external environment (for instance, the screen or a printer print(x)).
  • L-use: location use - the value of the variable is used, for instance, to determine which position of an array is used (e.g. a[x]).
  • I-use: iteration use - the value of the variable is used to control the number of iterations made by a loop (for example: for (int i = 0;i <= x; i++))

A definition free path, or def-free, is a path from a definition of a variable to a use of that variable that does not include another definition of the variable.

Next figure depicts the Control Flow Graph of Triangle Problem and is annotated with the definitions and uses of the variables type, a, b, and c.

Control Flow Graph

More details about the control-flow procedure and examples can be found in the paper "Data Flow Testing - CS-399: Advanced Topics in Computer Science, Mark New (321917)"

Random Testing

Random testing is accomplished by randomly selecting the test cases. This approach has the advantage of being fast and it also eliminates biases of the testers. Additionally, statistical inference is easier when the tests are selected randomly. Often the tests are selected randomly from an operational profile.

For example, for the triangle problem, we could use a random number generator and group each successive set of three numbers as a test set. We would have the additional work of determining the expected output. One problem with this is that the chance of ever generating an equilateral test case would be very small. If it actually happened, we would probably start questioning our pseudo random number generator.

Operational Profile

Testing in the development environment is often very different than execution in the operational environment. One way to make these two more similar is to have a specification of the types and the probabilities that those types will be encountered in the normal operations. This specification is called an operational profile. By drawing the test cases from the operational profile, the tester will have more confidence that the behavior of the program during testing is more predictive of how it will behave during operation.

A possible operational profile for the triangle problem is shown in next table:

Test Cases for Triangle Problem - Multiple Condition: Condition 1 - Structural Testing
# Description Probability
1 Equilateral 20%
2 Isosceles - Obtuse 10%
3 Isosceles - Right 20%
4 Scalene - Right 10%
5 Scalene - All Acute 25%
6 Scalene - Obtuse Angle 15%

Statistical Inference from testing

If random testing has been done by randomly selecting test cases from an operational profile, then the behavior of the software during testing should be the same as its behavior in the operational environment.

For instance, if we selected 1000 test cases randomly using an operational profile and found three errors, we could predict that this software would have an error rate of less than three failures per 1000 executions in the operational environment.

Boundary Testing

Often errors happen at boundaries between domains. In source code, decision statements determine the boundaries. If a decision statement is written as x<1 instead of x<0, the boundary has shifted. If a decision is written x=<1, then the boundary, x=1, is in the true subdomain. In the terminology of boundary testing, we say that the on tests are in the true domain and the off tests are values of x greater than 1 and are in the false domain.

If a decision is written x<1 instead of x=<1, then the boundary, x=1, is now in the false subdomain instead of in the true subdomain.

Boundary testing is aimed at ensuring that the actual boundary between two subdomains is as close as possible to the specified boundary. Thus, test cases are selected on the boundary and off the boundary as close as reasonable to the boundary. The standard boundary test is to do two on tests as far apart as possible and one off test close to the middle of the boundary.

Next figure shows a simple boundary. The arrow indicates that the on tests of the boundary are in the subdomain below the boundary. The two on tests are at the ends of the boundary and the off test is just above the boundary halfway along the boundary.

Boundary Conditions

In the triangle example, for the primitive conditions, a>=b+c or b>=a+c or c >= a + b, we could consider the boundary. Since these are in three variables as a plane in 3D space. The on tests would be two (or more) widely separated tests that have equality - for example, (8,1,7) and (8,7,1). These are both true. The off test would be in the other domain (false) and would be near the middle - for example, (7.9, 4,4).

Test Automation

For large software systems, the test coverage required to ensure a proper quality may be huge. Due to that, it is impossible to run all the tests manually and mechanisms to automate the tests are used. However, it should be noted that in many cases (if not all) a full automation of the procedure is impossible due to the need of manual intervention or analysis of the results. Hence, when automation is used, it should be assessed in which areas of the software functionality is going to lead to the higher benefits.

Among the three major test activities, preparation, execution, and follow-up, execution is a prime candidate for automation.

The testing that programmers do is generally called unit testing (aka Object Testing):

The rhythm of an Object Test is similar to any other test:

In order to facilitate this process, a number of frameworks have been built for different programming languages such as:

And many more to over 30 programming languages and environments.

Although the implementations are different for every environment, the concepts are the same in any of these frameworks that are known in the abstract as xUnit.

Most software developers just want to write code; testing is simply a necessary evil in our line of work. Automated tests provide a nice safety net so that we can write code more quickly, but we will run the automated tests frequently only if they are really easy to run.

What makes tests easy to run? Four specific goals answer this question:

With these four goals satisfied, one click of a button (or keyboard shortcut) is all it should take to get the valuable feedback the tests provide. Let's look at these goals in a bit more detail.

Goal1: Fully Automated Tests

A test that can be run without any Manual Intervention is a Fully Automated Test. Satisfying this criterion is a prerequisite to meeting many of the other goals. Yes, it is possible to write Fully Automated Tests that don't check the results and that can be run only once. The main() program that runs the code and directs print statements to the console is a good example of such a test.

Goal2: Self-checking Tests

A Self-Checking Test has encoded within it everything that the test needs to verify that the expected outcome is correct. The Test Runner "calls us" only when a test did not pass; as a consequence, a clean test run requires zero manual effort. Many members of the xUnit family provide a Graphical Test Runner (see Test Runner) that uses a green bar to signal that everything is OK; a red bar indicates that a test has failed and warrants further investigation.A

Goal3: Repeatable Tests

A Repeatable Test can be run many times in a row and will produce exactly the same results without any human intervention between runs. Unrepeatable Tests increase the overhead of running tests significantly. This outcome is very undesirable because we want all developers to be able to run the tests very frequently, as often as after every "save". Unrepeatable Tests can be run only once before whoever is running the tests must perform a Manual Intervention. Just as bad are non determinstic Tests that produce different results at different times; they force us to spend lots of time chasing down failing tests. The power of the red bar diminishes significantly when we see it regularly without good reason. All too soon, we begin ignoring the red bar, assuming that it will go away if we wait long enough. Once this happens, we have lost a lot of the value of our automated tests, because the feedback indicating that we have introduced a bug and should fix it right away disappears. The longer we wait, the more effort it takes to find the source of the failing test.

Tests that run only in memory and that use only local variables or fields are usually repeatable without us expending any additional effort. Unrepeatable Tests usually come about because we are using a Shared Fixture of some sort. In such a case, we must ensure that our tests are self-cleaning as well. When cleaning is necessary, the most consistent and foolproof strategy is to use a generic Automated Teardown mechanism. Although it is possible to write teardown code for each test, this approach can result in Erratic Tests when it is not implemented correctly in every test.

Goal4: Simplicity

Coding is a fundamentally difficult activity because we must keep a lot of information in our heads as we work. When we are writing tests, we should stay focused on testing rather than coding of the tests. This means that tests must be simple - simple to read and simple to write. They need to be simple to read and understand because testing the automated tests themselves is a complicated endeavor. They can be tested properly only by introducing the very bugs that they are intended to detect; this is hard to do in an automated way so it is usually done only once (if at all), when the test is first written. For these reasons, we need to rely on our eyes to catch any problems that creep into the tests, and that means we must keep the tests simple enough to read quickly.

Of course, if we are changing the behavior of part of the system, we should expect a small number of tests to be affected by our modifications. We want to Minimize Test Overlap so that only a few tests are affected by any one change. Contrary to popular opinion, having more tests pass through the same code doesn't improve the quality of the code if most of the tests do exactly the same thing.

Tests become complicated for two reasons:

  • We try to verify too much functionality in a single test.
  • Too large an "expressiveness gap" separates the test scripting language (e.g. Java) and the before/after relationships between domain concepts that we are trying to express in the test.

The tests should be small and test one thing at a time. Keeping tests simple is particularly important during test-driven development because code is written to pass one test at a time and we want each test to introduce only one new bit of behavior. We should strive to Verify One Condition per Test by creating a separate Test Method for each unique combination of pre-test state and input.

The major exception to the mandate to keep Test Methods short occurs with customer tests that express real usage scenarios of the application. Such extended tests offer a useful way to document how a potential user of the software would go about using it; if these interactions involve long sequences of steps, the Test Methods should reflect this reality.

Goal5: Maintainability

Tests should be maintained along with the rest of the software. Testware must be much easier to maintain that production software as otherwise:

  • It will slow the development down.
  • It will get left behind.
  • It will have less value.
  • Developers will go back to manual testing.

Test Driven Development

What is TDD

The steps of test first design (TFD) are overviewed in the UML activity diagram of next figure. The first step is to quickly add a test, basically just enough code to fail. Next you run your tests, often the complete test suite although for sake of speed you may decide to run only a subset, to ensure that the new test does in fact fail. You then update your functional code to make it pass the new tests. The fourth step is to run your tests again. If they fail you need to update your functional code and retest. Once the tests pass the next step is to start over (you may first need to refactor any duplication out of your design as needed, which is what converts TFD into TDD).

TFD Steps

Dean Leffingwell describes TDD with this simple formula:

TDD = Refactoring + TFD.

TDD completely turns traditional development around. When you first go to implement a new feature, the first question that you ask is whether the existing design is the best design possible that enables you to implement that functionality. If so, you proceed via a TFD approach. If not, you refactor it locally to change the portion of the design affected by the new feature, enabling you to add that feature as easy as possible. As a result you will always be improving the quality of your design, thereby making it easier to work with in the future.

Instead of writing functional code first and then your testing code as an afterthought, if you write it at all, you instead write your test code before your functional code. Furthermore, you do so in very small steps - one test and a small bit of corresponding functional code at a time. A programmer taking a TDD approach refuses to write a new function until there is first a test that fails because that function isn't present. In fact, they refuse to add even a single line of code until a test exists for it. Once the test is in place they then do the work required to ensure that the test suite now passes (your new code may break several existing tests as well as the new one). This sounds simple in principle, but when you are first learning to take a TDD approach it proves require great discipline because it is easy to "slip" and write functional code without first writing a new test.

An underlying assumption of TDD is that you have a testing framework available to you. Agile software developers often use the xUnit family of open source tools, such as JUnit or VBUnit, although commercial tools are also viable options. Without such tools TDD is virtually impossible. Next figure presents a UML state chart diagram for how people typically work with the xUnit tools (source Keith Ray).

Testing via xUnit

Kent Beck, who popularized TDD, defines two simple rules for TDD (Beck 2003):

  • First, you should write new business code only when an automated test has failed.
  • Second, you should eliminate any duplic ation that you find.

Beck explains how these two simple rules generate complex individual and group behavior:

  • You design organically, with the running code providing feedback between decisions.
  • You write your own tests because you can't wait 20 times per day for someone else to write them for you.
  • Your development environment must provide rapid response to small changes (e.g you need a fast compiler and regression test suite).
  • Your designs must consist of highly cohesive, loosely coupled components (e.g. your design is highly normalized) to make testing easier (this also makes evolution and maintenance of your system easier too).

For developers, the implication is that they need to learn how to write effective unit tests.

TDD also improves documentation

Most programmers don't read the written documentation for a system, instead they prefer to work with the code. And there's nothing wrong with this. When trying to understand a class or operation most programmers will first look for sample code that already invokes it. Well-written unit tests do exactly this - they provide a working specification of your functional code - and as a result unit tests effectively become a significant portion of your technical documentation. The implication is that the expectations of the pro-documentation crowd need to reflect this reality. Similarly, acceptance tests can form an important part of your requirements documentation. This makes a lot of sense when you stop and think about it. Your acceptance tests define exactly what your stakeholders expect of your system, therefore they specify your critical requirements. Your regression test suite, particularly with a test-first approach, effectively becomes detailed executable specifications.

Are tests sufficient documentation? Very likely not, but they do form an important part of it. For example, you are likely to find that you still need user, system overview, operations, and support documentation. You may even find that you require summary documentation overviewing the business process that your system supports. When you approach documentation with an open mind, I suspect that you will find that these two types of tests cover the majority of your documentation needs for developers and business stakeholders. Furthermore, they are an important part of your overall efforts to remain as agile as possible regarding documentation.

Why TDD

A significant advantage of TDD is that it enables you to take small steps when writing software. This is far more productive than attempting to code in large steps. For example, assume you add some new functional code, compile, and test it. Chances are pretty good that your tests will be broken by defects that exist in the new code. It is much easier to find, and then fix, those defects if you've written two new lines of code than two thousand. The implication is that the faster your compiler and regression test suite, the more attractive it is to proceed in smaller and smaller steps. I generally prefer to add a few new lines of functional code, typically less than ten, before I recompile and rerun my tests.

The act of writing a unit test is more an act of design than of verification. It is also more an act of documentation than of verification. The act of writing a unit test closes a remarkable number of feedback loops, the least of which is the one pertaining to verification of function.

The first reaction that many people have to agile techniques is that they're ok for small projects, perhaps involving a handful of people for several months, but that they wouldn't work for "real" projects that are much larger. That's simply not true. Beck (2003) reports working on a Smalltalk system taking a completely test-driven approach which took 4 years and 40 person years of effort, resulting in 250,000 lines of functional code and 250,000 lines of test code. There are 4000 tests running in under 20 minutes, with the full suite being run several times a day. Although there are larger systems out there, it's clear that TDD works for good-sized systems.

A simple example

You are asked to write a code to extract the UK postal area code from any given full UK postcode. Example Input "SS17 7HN", output should be "SS", for input "B43 4RW", output should be "B".

Before starting coding, you need to create the tests and for doing so you need to think about the structure of the software. An obvious approach is creating a class named PostCode with a method that retrieves the postal area code (e.g. areaCode()). In that way, in order to calculate the are code something similar to:

postCode = new PostCode("SS17 THN");

string area = postCode.areaCode();

Hence, the first thing that should be done is developing the test cases for such a solution, as you have been already provided with two examples, a good idea is using them as the test cases:

  • Test Case 1: ("SS17 THN") -> "SS"
  • Test Case 2: ("B43 4RW") -> "B"

Depending on the programming language, you should build those test cases through the xUnit tool. If you try to execute them, both are going to fail because the class PostCode does not exist yet.

The next step is building an empty class PostCode with a method postalArea that always return the string postcode.

In this case, again the tests are going to fail, as the return value is not the expected one.

In the following iteration we could implement the solution for the first part of the problem, when the postal code is 8 chars long, the area code is composed by the 2 first ones. If we try again to run the tests, the first one ("SS17 THN") will complete successfully whereas the second one will fail. We have added a very reduce functionality, and we have nearly immediately checks that what we has added is correct.

In the following iteration we could implement the solution for the second part of the problem, when the postal code is 7 chars long, the area code is composed by the firs ones. If we try again to run the tests, both will complete successfully. Again, we have just added very few lines code (e.g. could be 2-3), and have checked immediately that what we has added is correct.

Imagine now that you have been told that there are postal codes of 8 digits in which only the first one denotes the area, e.g. "W1Y2 3RD", now you should first create a new test case for that new example. Obviously, it will fail as it will return "W1" instead of "W " but adding the funcionality to implement the new feature should be quite safe as it would be easy to check if in the process of implementing it you are breaking the existing features.

QA Activities beyond Testing

Although testing is one of the key QA activities, there are many additional actions that could be taken in order to assure that the quality of the product meets some targets. Some of them are going to be analysed in this chapter.

Defect Prevention and Process Improvement

The best way to avoid defects is preventing them and the most common technique for doing so is Defect Causal Analysis (DCA). This type of analysis consists in identifying the causes of defects and other problems and taking action to prevent them from occurring in the future by improving the process and reducing the causes that originate the defects.

DCA can be seen as a systematic process to identify and analyze causes associated with the occurrence of specific defect types, allowing the identification of improvement opportunities for the organizational process assets and the implementation of actions to prevent the occurrence of that same defect type in future projects. DCA is used by many companies, for instance, HP has extesively used it with very good results [[SOFTWARE-FAILURE-ANALYSIS-HP]].

There are multiple methodologies to implement a DCA system, but in general, the following activities should be conducted in all of them:

  1. Defect Identification: Defects are found by QA activities specifically intended to detect defects such as Design review, Code Inspection, function and unit testing.

  2. Defect Classification: Once defects are identified they need to be classified. There are multiple ways and techniques to classify defects, for instance: Requirements, Design, Logical and Documentation. These categories can be again divided in second and third levels depending on the complexity and size of the product.

    Orthogonal Defect Classification (ODC) [[ODC]] is one of the most important techiques used for clasifying defects. It means that a defect is categorized into classes that collectively point to the part of the process which needs attention, much like characterizing a point in a Cartesian system of orthogonal axes by its (x, y, z) coordinates.

  3. Defect Analysis: After defects are logged and classified, the next step is to review and analyze them using root cause analysis (RCA) techniques.

    As doing a defect analysis for all the defects is a big effort. A useful tool before doing this kind of analysis is a Pareto chart. This kind of charts shows the defect type with the highest frequency of occurrence of defects. It shows the frequencies of occurrences of the various categories of problems encountered, in order to determine which of the existing problems occur most frequently. The problem categories or causes are shown on the x-axis of the bar graph and the cumulative percentage is shown on the y-axis of the graph. Such a diagram helps us to identify the defect types should be given higher priority and must be attended first.

    For instance, the following picture shows an example of a Pareto diagram:

    Pareto Diagram Example

    Root-cause analysis is the process of finding the activity or process which causes the defects and find out ways of eliminating or reducing the effect of that by providing remedial measures.

    Defects are analyzed to determine their origins. A collection of such causes will help in doing the root cause analysis. One of the tools used to facilitate root cause analysis is a simple graphical technique called cause-and-effect diagram / fishbone diagram which is drawn for sorting and relating factors that contribute to a given situation.

    It is important that this process uses the knowledge and expertise of the team and that it considers that the target is providing information and analysis in a way that helps implementing changes in the prcoesses that help prevent defects later on.

    For instance, the following picture shows an example of a Fishbone diagram:

    Fishbone Diagram Example
  4. Defect Prevention: Once the causes of the defects are known it is key to identify actions that can be put in place to cut down these causes. This can be achieved, for intstance with meetings where all the possible causes are identified from the cause-and-effect diagram and debated among the team. All suggestions are listed and then the ones that are identified as the main reasons for causes are separated out. For these causes, possible preventive actions are discussed and finally agreed among project team members.

  5. Process Improvement: Once the preventive actions have been identified, they need to be put in place and verify their effectiveness, for instance by observing the Defect Density and comparing it with previous projects.

You can find some examples and more details about this process at [[DEFECT-PREVENTION-NUTSHELL]] and [[DEFECT-ANALYSIS-AND-PREVENTION]].

Code Inspection and Formal Verification

During many years, people considered that the only consumers of software were machines and human beings were not intended to review the code after it was written. This attitude began to change in the early 1970s through the efforts of many developers who saw the value in reading code as part of a QA culture.

Nowadays, not all the companies apply techniques based in reading code as part of their Software Development (including QA) process, but the concept of studying program code as part of defect removal preocess is widely accepted as benefitial. Of course, the likelihood of those techniques being successful depend on multiple: factors the size or complexity of the software, the size of the development team, the timeline for development and, of course, the background and culture of the programming team.

Part of the skepticism for this kind of methods is because many people believe that tasks lead by humans could lead to worse results than mathematical proofs conducted by a computer. However, it has been proven that simple and informal code review techniques contribute substantially to productivity and reliability in three major ways.

Code Reviews are generally effective in finding from 30 to 70 percent of the logic-design and coding errors in typical programs. They are not effective, however, in detecting high-level design errors, such as errors made in the requirements analysis process. Note that a success rate of 30 to 70 percent doesn't mean that up to 70 percent of all errors might be found but up to 70% of the defects that are going to be detected (remember we don't know how many defectsi are in a software).

Of course, a possible criticism of these statistics is that the human processes find only the easy errors (those that would be trivial to find with computer-based testing) and that the difficult, obscure, or tricky errors can be found only by computer-based testing. However, some testers using these techniques have found that the human processes tend to be more effective than the computer-based testing processes in finding certain types of errors, while the opposite is true for other types of errors. This means that reviews and computer-based testing are complementary; error-detection efficiency will suffer if one or the other is not present.

Different ways of performing code reviews exist and in the following sections we are going to assess few of them.

Formal Code Inspection

For historical reasons, formal reviews are usually called inspections. This is due to the work Michael Fagan conducted and presented in his 1976 study at IBM regarding the efficacy of peer reviews. We are going to called them Formal Code Inspection to distinguish them from other types of Code Reviews.

There is always a inspection team that usually consists of four people. One of them plays the role of moderator who should be an expert programmer, but not the author of the program (he does not need to be familiar with the software either).

Moderator duties include:

  • Distributing materials for, and scheduling, the inspection session.
  • Leading the session.
  • Recording all errors found.
  • Ensuring that the errors are subsequently corrected.

The rest of the team is the developer of the code, a software architect (could be the architecture of the software) and a Quality Assurance engineer.

The Inspection Agenda is distributed some days in advance of the Inspect Session. Together with the agenda, the moderator distributes the software, specification and any relevant material to the inspection team so they can become familiar with the material before the meeting takes place.

During the review session the moderator ensures that two key activities take place:

  1. The programmer describes, statement by statement, the logic of the software. Other participants are free (and encouraged) to raise questions in order to determine whether errors exist. It is likely that the developer himself, instead of the rest of team, is the one that find many of the errors identified during this stage. In other words, the simple act of reading aloud a program to an audience seems to be a remarkably effective error-detection technique.
  2. The program is analyzed with respect to checklists of historically common programming errors.

When the session is over, the programmer receives an error list that includes all the errors that have been discovered. Hence, the session is focused on finding defects not fixing them. Despite that, in some occasions, when a problem is discovered, the review team could propose and discuss some design changes. When some of the detected defects require significant changes in the code, the review team could agree to have follow-up meetings in order to review again the code after the changes are implemented.

The list of errors is not only used by the developer in order to fix them; it is also used by moderator to verify if the error checklist could be improved with the results.

The review sessions are typically very dynamic and hence the moderator should be responsible not only for reviewing the code but also to keeping it focused so time is used efficiently (these sessions should be of 90-120 minutes maximum).

This kind of approaches requires of the right attitude, specially from the developer whose work is going to be under scrutiny. He must forget about his ego and think about the process as a way to improve the quality of his work and improve his development skills, as he usually receives a lot of feedback about programming styles, algorithms and techniques. But it is not only the developer but also the rest of the team the ones who could learn by such an open exchange of ideas.

The following diagram describes theis process grafically:

Formal Code Inspections Flow

The following tables describes some checklists used in formal code reviews as explained in [[ART-OF-TESTING]].

Checklist for Formal Inspections - 1
Checklist for Formal Inspections - 2

Walkthrough

A Walkthrough is quite similar to "Formal Code Inspections" as it is also very formal, it is conducted by a team, and it takes place during a pre-scheduled session of 90-120 minutes. However, there is a key difference: the procedure during the meeting. Instead of simply reading the software and use checklists, the participants "play computer", which means that a person that is designated as the tester comes to the meeting with a set of pre-defined test cases for the software. During the meeting, each test case is mentally executed; that is, the test data are "walked through" the logic of the program. The state of the program is monitored on paper or a whiteboard.

The test cases must not be a complete set of test cases, especially because every mental execution of a test case use to take a lot of time. The test cases themselves are not the critical thing; they are just an excuse for questioning the developer about the assumptions and decisions taken.

Although the size of the team is quite similar (three to five), the role of the participants is slightly different. Apart from the author of the software and a moderator, there are two key roles in walkthroughs: a tester role (that is the one responsible for guiding the execution of the test cases) and a secretary that writes down all the errors found. Additionally, other participants are welcome, typically experience programmers.

Over the shoulder Review

The two formal approaches described formerly, are good, and help to detect many defects. Additionally, they provide extra metrics and information about the effectiveness of the reviews themselves. However, this require a lot of effort, and consumes a lot of extra developer time. Many studies during the last yeasr have shown that there are other less formal methods that could achieve similar results but requiring less training and time.

The first one we are going to study is over-the-shoulder reviews. This is the most common and informal of code reviews. An over-the-shoulder review is just that: a developer standing over the developer's computer while the author walks the reviewer through a set of code changes.

Typically the author "drives" the review by sitting at the computer opening various files, pointing out the changes and explaining why it was done that way. Multiple tools can be used by the developer and it's usual to move back and forth between files.

If the reviewer sees something wrong, they can take different actions, such as doing a little of "pair-programming" while the developer implements the fix or just take note of the issue to be solved offline.

With cooperation tools such as videoconferencing, desktop sharing and so on, it is possible to perform this kind of reviews remotely but obviously, they are not so effective as the greatest asset of this technique is the closeness between developers and the easyness to take ad-hoc actions taken the opportunity of being together.

The key advantage of this approach is its simplicity: no special training is required and can be done at any time without any preparation. It also encourages human interaction and encourages people to cooperate. Reviewers tend to be more verbose and brave when speaking than when they need to record their reviews in a system such as DataBase.

Of course it has some drawbacks. The first one is that due to its informal nature, it is really difficult to be enforced, i.e. there is no way (document, tool, etc...) to check if such a review has been conducted. The second one is that, as the author is the one leading the whole process he might omit parts of the code. The third one is the lack of traceability to check that the detected defects have been properly addressed.

The following diagram describes this process grafically

Over the shoulder reviews workflow

Offline Reviews

This is the second-most common form of informal code review, and the technique preferred by most open-source projects. Here, whole files, or changes are packaged up (ZIP file, URL, Pull Request, etc...) by the author and sent to reviewers via e-mail or any other tool. Reviewers examine the files offline, ask questions and discuss with the author and other developers, and suggest changes.

Collecting the files to be reviewed was formerly a difficul task but nowadays, with Source Code Management systems such as Git, it is extremely easy to identify the files that the developer has modified and hence the changes he wants to merge into the main repository.

But SCM tools have helped not only to identifying the changes made by the developer, but also in other multiple areas such as:

  • Sending E-mail notifications: Request for review, review done, comments need to be addressed, etc...
  • Recording review comments: Things to be changed, result of the review, whether the changes have been implemented or not, etc. This is key for having a way to enforce reviews.
  • Combined display: Allow developers and reviewers to easily check the differences between files allowing different views.
  • Discussion: Sometimes it's needed some type of discussion between the developer and the reviewer in order to undertand a bit more the code, clarify reasons behind some decisions, etc.

Obviously, the main advantage with respect to over-the-shoulder reviews is that it can work perfectly with developers that are not based in the same place, either across a building or across an ocean. Additionally, by using this technique is extremely easy to allow multiple reviewers to review the code in parallel, in many cases, if the reviews are done in an SCM system, even anyone with access to the SCM could comment in the review, even if he/she is not a reviewer.

The main disadvantage with an over-the-shoulder review is that it takes longer as it usually requires different interactions, this could be especially painful if people are in different timezones.

In general, we could say that offline code reviews, done properly integrated in an SCM gets a good balance between speed, effectiveness and traceability.

The following diagram describes this process grafically

Offline Code Reviews Workflow

Pair Programming

Pair Programming it is a development process that incorporates continuous code review in the development process itself. It consists in two developers writing code at a single terminal with only one developer typing at a time and continuous free-form discussion and review.

Studies of pair-programming have shown it to be very effective at both finding bugs and promoting knowledge transfer. However, having the reviewing developer so involved in the development itself is seen by many people as a risk to be biased: it's going to be more difficult for him to go a step back and critique the code from a fresh point of view. However, it could be argued that deep knowledge and understanding also provides him the capabiltiy to provide more effetive comments.

The key difference with the other techniques mentioned above is that introducing this way of working affects not only how QA Activities are performed but also development ones (i.e. you could combine all the other review techniques with different ways of developing code). Adopting this way of working requires evaluating properly how are developers going to work in such an environment and the time required for working in this way.

Code Review Techniques: Summary

Each of the types of review is useful in its own way. Offline reviews strike a balance between time invested and ease of implementation. In any case, and any kind of code review is better than nothing, but it should be also acknowledged that code reviews are not enough to guarantee the quality of a final product.

Assertion Driven Development & Design by contract

Deffensive Programming

Deffensive Programing consists in including in the software as many checks as possible, even if they are redundant (e.g. checks made by callers and callees). Sometimes it's said, "maybe they don't help, but they don't harm either".

The problem with this way of working, is that, in some cases, it ends-up adding a lot of redundancy "just in case", which means adding unnecessary complexity and increasing software size. The bigger and more complex a software is, the easier defects can affect it.

The ideas behind deffensive software, are interesting, but in order to make these ideas have a possitive effect, a more systematic approach should be followed.

Contract Concept

A contract, in the real world, is an agreement between two parties in which each party expect some benefits from the contract if they meet some obligations. Both are linked, i.e. if the obligations are not met by any of the parties, there is no guarantee the benefits will happen. Those benefits and obligations are clearly documented so that there are no misunderstanding between the parties.

Imagine a courier company that has a express service within Madrid city. That express service can be only done if the customer meets some conditios (e.g. the package is within the limits, the address is valid and in Madrid, the user pays...). If the customer meets this conditions, he gets the benefit of the package being delivered in 4 hours. If the customer does not meet them, there is no guarantee he can get the express deliver benefits. The following table shows the obligations/benefits of this example:

Party Obligations Benefits
Client Provide letter or package of no more thant 5 Kilograms, each dimension no more than 2 meters. Pay 100 Euros. Provide a valid recipient address in Madrid. Get package delivered witouth any damage to recipient in 4 hours or less.
Supplier Deliver package to recipient in four hours or less. No need to deal with deliveries too big, too heavy or unpaid.

One important remark, is that when a contract is exhaustive, there is a guarantee that all the obligations are related to the benefits. This is also called the "No hidden clause" rule. This does not mean that the contract could not refer to external laws, best practices, regulation... it only means they do not need to be explicitly stated. For instance, in case the courier fails to meet their obligations, it is highly likely a law establishes a compensation to the customer.

Contracts in Software

It is easy to understand how the concept of contracts in the real world could be extrapolated to software development. In software every task can be split in multiple sub-tasks, the idea of sub-tasks is similar of contracting something to a company. I create a function, module, etc... that handles this part that is essential to meet the complete task.

           task is
           do
             subtask1:
             subtask2:
             subtask3:
           end
          

If all the subtasks are completed correctly, the task will be also finished successfully. If there is a contract between the task and the subtasks, the task will have some guarantees about the completion. Subtasks in software developmentare typically functions, object methods...

Please also think about the Spotify way of working in which they created an architecture that manage every team to deliver different parts of Spotify client independently. It is quite similar, they have divided the main task (the Spotify client) in multiple subtasks (the components of the architecture). If all the components behave properly, the final task will be working properly too.

Design By Contract

Design by Contract (DbC) is based on the definition of formal, precise and verifiable interface specifications for every software component. These specifications extend the ordinary definition of abstract types with preconditions, postconditions and invariants. Those specifications are also known as contracts.

A software contract could be defined as the set of three different things:

  • Preconditions: A certain conditions to be guaranteed on entry by any client module that calls it. It is an obligation for the client module and a benefit for the supplier (no need to handle cases outside of the precondition)
  • Postconditions: Gurantee a certain property on exit. This is an obligation for the supplier and a benefit for the client.
  • Class-invariant: Guarantee that certain properties are not going to be changed on exit.

This could be formalized as three questions developers must try to solve when implementing a function:

  • What does contract expect?
  • What does contract gurantee?
  • What does contract maintain?

Using Design By Contract

The ideal environment for Design by Contract is one in which the language developers use has support for it in a native way. Unfortunately not too many of them support this capability, being Eiffel the most known one. For those languages, the contract is part of the function definition. For instance, see an Eiffel example below:

          class ACCOUNT create
            make
            feature
              ... Attributes as before:
                  balance , minimum_balance , owner , open ...
              deposit (sum: INTEGER) is
                    -- Deposit sum into the account.
                require
                  sum >= 0
                do
                  add (sum)
                ensure
                  balance = old balance + sum
                end
          

In programming languages with no direct support, in most of the cases assertions are used as a way to implement DbC techniques. There are libraries that try to simplify the process of defining these assertions. An assertion is a predicate used to indicate that if the software is a correct status, the predicate should be always true at that place. If an assertion evaluates to false, that should mean that the software is in a wrong status (e.g. the contract has been broken).

Of course, the functions can still do some checkings, but only for conditions that are not part of the contract. The idea of DbC is removing any duplication and minimizing the amount of code necessary to check that the contract is met.

Monitoring the assertions

One question that could be raised is "What happens if one of these conditions fails during execution?". This depens on whether assertions are monitored or not during runtime (and this use to be customizable depending on developer needs), but it is not a critical aspect. The target of DbC is implementing reliable software that work, what happens when they do not work is interesting, but not the main target.

Developer can choose from various levels of assertion monitoring: no checking, preconditions only, pre and postconditions, conditions and invariants...

If a developer decides not to check assertions, the assertions or contracts do not have any impact on system execution. If a condition is not met, then the software could be in an error situation and no extra actions will be taken, these are just bugs. In most of the cases this is the typical configuration for released products.

If a developer decides to check assertions, the effect of assertions not met is typically an exception being fired. The typical use case for enabling assertion checking is debugging, i.e. detecting defects not in blind but based on consistency conditions. In most of cases this is the typical configuration for released products.

There might be also special treating of these exceptions, for instance in Eiffel routines a rescue clause which expresses the alternate behaviour of the routine (and is similar to clauses that occur in human contracts, to allow for exceptional, unplanned circumstances). When a routine includes a rescue clause, any exception occurring during the routine's execution interrupts the execution of the body and starts the execution of the rescue clause. This could be used for shielding the code inn some situations.

Fault Tolerance and Failure Containment

There is no practical way to guarantee that a given software has no bugs. It doesn't matter how good are our tools, methodologies and engineers... It doesn't matter how deep inspections and testings are either. In many cases the presence of bugs is tolerated as something "natural", but there are some systems were the reliability and security requirements are so important that extra measures should be taken to mitigate the consequences of undetected bugs.

When a system has extreme reliability requirements, fault tolerant solutions should be put in place. The idea behind fault tolerant solutions consists in breaking the bug/failure cause/effect relationship. A result of this is a increase of the reliability as reliability is inversely proportional to the frequency of failures. These techniques are usually expensive as they typically require redundancy of sort so they are only in systems that require it.

For instance, the software used in the flight control systems is one example of software with very extreme requirements about failures. The report [[CHALLENGES-FAULT-TOLERANT-SYSTEMS]] provides more details about the challenges that this kind of systems pose to software developers.

However, in some situations, failures cannot be prevented and hence reliability cannot be improved. However, there are ways to minimize the consequences of failures with the target of maximizing safety. It is important we don't confuse reliability with safety: for instance a medical system could not be 100% reliable but it should be 100% safe. The techniques intended to increase system safety are called failure containment techniques.

In this chapter we are going to study both type of techniques.

Fault Tolerance

Fault Tolerance techniques are used to tolerate software faults and prevent system failures from occurring when a fault occurs. These kind of techniques are used in software that is very sensitive to failures such as aerospacial software, nuclear power, healthcare...

Single Version Software Environments (No Redundancy)

In this case, only one instance of the software exists, and it tries to detect faults and recover from them without the need to replicate the software. This kind of techniques are really difficult to be implemented, as it has been studied that efficient fault tolerant systems require some kind of redundancy as we will see in next section.

Multiple Version Software Environments (Redundancy)

Redundancy in real world activities is the best way to increase reliability: multiple engines in a plane, lights in a car... For instance, the NASA performed a research to calculate the possibility of survival in a mission depending of the amount of redundant equipment in the spacecraft and the results demonstrated that the survival chances are extremely dependent on redundancy as show in the figure below.

Mission Survival vs. Material Redundancy (source NASA)

The same is true for software systems but with some caveats. Redundancy in software only works if the redundant system works when the original one fails: this is normal in hardware systems but not in software. If I have two identical software systems and the first one fails, it is extremely likely the second one fails too. Due to this is important that redundant software systems are uncorrelated. Designing uncorrelated systems usually requires two teams, working isolated, with different techniques... This means duplicating at least the development cost.

In these systems, those multiple instances of the software developed independently can work in different configurations: N-Version programming (NVP), Recovery Blocks (RcB), N self checking programming (NSCP)...

Software Fault Tolerance
  1. Recovery blocks: Use repeated executions (or redundancy over time) as the basic mechanism for fault tolerance. The software includes a set of "recovery points" in which the status is recorded so that they could be used as fallbacks in case something goes wrong. When a piece of code is executed, a "test acceptance" is internally executed, if the result is OK, a new "recovery point" is set-up, if the result is not acceptable, then the software returns to the previous "recovery point" and an alternative to the faulty code is enacted. This process continues until the "acceptance test" is passed or no more alternatives are available, which leads to a failure. Some key characteristics about this scheme that is depicted in :
    • It is a backward error recovery technique: when an error occurs, appropriate actions are taken to react but no preventing action is taken.
    • It is a "serial technique" in which the same functionality (the recovery block) is never executed in parallel.
    • The "acceptance test" algorithm is the critical part to success as well as the availability of recovery blocks designed in different ways to the original code.
    Recovery Blocks
  2. NVP (N-version programming):

    This technique uses parallel redundancy, where N copies, each of a different version, of codes fulfilling the same functionality are running in parallel with the same inputs. When all of those N-copies have completed the operation, an adjudication process (decision unit) takes place to determine (based in a more or less complex vote) the output.

    Some key characteristics about this scheme that is depicted in

    • It is a forward error recovery technique: preventive actions are taken. Even if no error occurs the same functionality is executed multiple times.
    • It is a "parallel technique" in which the same functionality is always executed in parallel by different versions of the same functionality.
    • The "decision unit" algorithm is the critical part to success as well as the availability of different versions of the same code designed in different ways.
    N version programming

    Obviously a wide range of different variants of those systems have been proposed based in multiple combinations of them [[COST-EFFECTIVE-FAULT-TOLERANCE]] and multiple comparisons between the performance are also available [[PERFORMANCE-RB-NVP-SCOP]].

What do you think are the key advantages and disadvantages of the two fault tolerance techniques described (Recovery Blocks & N-Version)? Exercise 3: Recovery Blocks vs. N-Version

Types of recovery

Error Recovery is the key part in fault tolerant systems. However, although the most important one, it's the last step in a series of 4 parts:

  1. Error Detection: This consists in the ability to detect that the Software is an erronous state, for instance, via asserions as it has been studied in this unit.
  2. Error Diagnosis: Assess the causes for the error situation in which the Software has fallen.
  3. Error Containment: Before trying to correct the error, we should stop the error propagation, so further damages will not happen.
  4. Error Recovery: Consists in replacing the erroneous state with an error-free state.

Depending on how the new error-free state is calculated we could distinguish two approaches:

  • Backward Recovery: In this mechanisms the system tries to go back to a previously saved state that the system knows it is correct. Recoding or saving these states is called checkpointing. Some systems instead of recording the states completely keep deltas with respect to previous states, which is know as differential checkpointing. Recovery blocks is one example of this type of technique.
  • Forward Recovery: In this approach, the system tries to find a new state from which the system can continue the operation. This state can be calculated by Error Compensation. Error Compensation relies on redundancy based algorithms in which the redundancy system provides a set of potential results from which a compensation is executed to derive an answer deemed as acceptable. An example of this technique is N-Version programming. This techniques are more complex and requires more resources so they are mostly used when the system is critical with respect to delays (i.e. there is no time for a backward recovery).

Failure Containment

There is software that is used in safety critical systems, that have severe consequences in case a failure occurs. In those situations it is very important to avoid some of the potential accidents or at lt

Various specific techniques are used for this kind of systems, most of them based on the analysis of the potential hazards linked to the failures:

  • Hazard Elimination through substitution, simplification, decoupling, elimination of specific human errors and reduction of hazardous materials or conditions. These techniques reduce certain defect injections or substitute non-hazardous ones for hazardous ones. The general approach is similar to the defect prevention and defect reduction techniques surveyed earlier, but with a focus on those problems involved in hazardous situations.
  • Hazard Reduction through design for controllability (for example, automatic pressure release in boilers), us of locking devices (for example, hardware/software interlocks), and failure minimization using safety margins and redundancy. These techniques are similar to fault tolerance, where local failures are contained without leading to system failures.
  • Hazard control through reducing exposure, isolation and containments (for example barriers between the system and the environment), protection systems (active protection activated in case of hazard), and fail-safe design (passive protection, fail in a safe state without causing further damages). These techniques reduce the severity of failures, therefore weakening the link between failures and accidents.
  • Damage control through escape routes, safe abandonment of products and materials, and devices for limiting physical damages to equipment or people. These techniques reduce the severity of accidents, thus limiting the damages cause by these accidents and related software failures.

Notice that both hazard control and damage control above are post-failure activities that attempt to contain the failures so that they will not lead to accidents or the accident damage can be controlled or minimized. All these techniques are usually very expensive and process/technology intensive, hence they should be only applied when safety matters and deal with rare conditions related to accidents.