These are the notes for Sofware Quality at USJ
Early Draft
Let's have a look at one of the most famous bugs of the whole Software History, the bug that AT&T suffered in 1990.
At 2:25 PM on Monday, January 15th, network managers at AT&T's Network Operations Centre began noticing an alarming number of red warning signals coming from various parts of their network. Within seconds, the network warnings were rapidly spreading from one computer-operated switching Centre to another. The managers tried to bring the network back up to speed for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through until at 11:30 PM, when network loads were low enough to allow the system to stabilize.
AT&T alone lost more than $60 million in unconnected calls. Of course, there were many additional consequences difficult to be measured such as business that could be done because relied on network connectivity.
AT&T's long-distance network was a model of reliability and strength. On any given day, AT&T's long-distance service, which at 1990 carried over 70% of the US long-distance traffic.
The backbone of this massive network was a system of 114 computer-operated electronic switches (4ESS) distributed across the United States. These switches, each capable of handling up to 700,000 calls an hour, were linked via a cascading network known as Common Channel Signalling System No. 7 (SS7). When a telephone call was received by the network from a local exchange, the switch would asses 14 different possible routes to complete the call. At the same time, it passed the telephone number to a parallel signalling network that checked the alternate routes to determine if the switch at the other end could deliver the call to it's local company. If the destination switch was busy, the original switch sent the caller a busy signal and released the line. If the switch was available, a signal-network computer made a reservation at the destination switch and ordered the destination switch to pass the call, after the switches checked to see if the connection was good. The entire process took only four to six seconds.
The day the bug popped-up, a team of 100 frantically searching telephone technicians identified the problem, which began in New York City. The New York switch had performed a routine self-test that indicated it was nearing its load limits. As standard procedure, the switch performed a four-second maintenance reset and sent a message over the signalling network that it would take no more calls until further notice. After reset, the New York switch began to distribute the signals that had backed up during the time it was off-line. Across the country, another switch received a message that a call from New York was on it's way, and began to update its records to show the New York switch back online. A second message from the New York switch then arrived, less than ten milliseconds after the first. Because the first message had not yet been handled, the second message should have been saved until later. A software defect then caused the second message to be written over crucial communications information. Software in the receiving switch detected the overwrite and immediately activated a backup link while it reset itself, but another pair of closely timed messages triggered the same response in the backup processor, causing it to shut down also. When the second switch recovered, it began to route it's backlogged calls, and propagated the cycle of close-timed messages and shut-downs throughout the network. The problem repeated iteratively throughout the 114 switches in the network, blocking over 50 million calls in the nine hours it took to stabilize the system.
The cause of the problem had come months before. Early December, technicians had upgraded the software to speed processing of certain types of messages. Although the upgraded code had been rigorously tested, a one-line bug was inadvertently added to the recovery software of each of the 114 switches in the network. The defect was a C program that featured a break statement located within an if clause, that was nested within a switch clause. In pseudo-code, the program read as follows:
1 while (ring receive buffer not empty and side buffer not empty) DO 2 Initialize pointer to first message in side buffer or ring receive buffer 3 get copy of buffer 4 switch (message) { 5 case (incoming_message): 6 if (sending switch is out of service) DO { 7 if (ring write buffer is empty) DO 8 send "in service" to status map 9 else 10 break } // END IF 11 process incoming message, set up pointers to optional parameters 12 break } // END SWITCH 13 do optional parameter work
When the destination switch received the second of the two closely timed messages while it was still busy with the first (buffer not empty, line 7), the program should have dropped out of the if clause (line 7), processed the incoming message, and set up the pointers to the database (line 11). Instead, because of the break statement in the else clause (line 10), the program dropped out of the case statement entirely and began doing optional parameter work which overwrote the data (line 13). Error correction software detected the overwrite and shut the switch down while it could reset. Because every switch contained the same software, the resets cascaded down the network, incapacitating the system.
Unfortunately, it is not difficult for a simple software error to remain undetected, to later bring down even the most reliable systems. The software update loaded in the 4ESSs had already passed through layers of testing and had remained unnoticed through the busy Christmas season. AT&T was fanatical about its reliability. The entire network was designed such that no single switch could bring down the system. The software contained self-healing features that isolated defective switches. The network used a system of "paranoid democracy," where switches and other modules constantly monitored each other to determine if they were "sane" or "crazy." Sadly, the Jan. 1990 incident showed the possibility for all of the modules to go "crazy" at once, how bugs in self-healing software can bring down healthy systems, and the difficulty of detecting obscure load- and time-dependent defects in software.
But we could think that this bug occurred a while ago and that nowadays we have more advanced technologies, methodologies, training systems and developers.
Is this really true? Just partially, it's true Software Development has evolved a lot, but the type of problems that are solved via Software has also evolved, every day with try to solve more problems and more complex via software.
The Software Crisis term was coined by USA Department of Defence years ago in order to describe that the complexity of the problems addressed of software has outpaced the improvements in the software creation process as shown graphically in .
"Few fields have so large a gap between best current practice and average current practice."
Department of Defence
In other words, the software creation process has evolved very little while the problems software is solving are way too much complex
"We have repeatedly reported on cost rising by millions of dollars, schedule delays, of not months but years, and multi-billion-dollar systems that don't perform as envisioned. The understanding of software as a product and of software development as a process is not keeping pace with the growing complexity and software dependence of existing and emerging mission-critical systems."
Government Accounting Office
Additionally, as depicted in , the need of software developers has increased exponentially, because more software is needed as software is used in nearly every product with a minimum of complexity. Whereas the need of developers has increased exponentially, the availability of developers has unfortunately not grown at the same pace, i.e. there are less developers than what is needed. Due to that, people without the right skills have started developing software, with the believe that developing software is an easy task that nearly everybody could do. Developing software with people not properly trained or without the right skills inherently leads to bad quality software.
Mortenson, a construction contractor purchased software from Timberline Software Corporation, which Timberline installed in Mortenson's computers. Mortenson, relying on the software, placed a bid which was $1.95 million too low because a bug in the software of which Timberline was aware. The State of Washington Supreme Court ruled in favour of Timberline Software. However, a simple bug in the software lead to multiple problems to both companies. In the US Warranty Laws, the Article 2 of the Uniform Commercial Code includes the "Uniform Computer Information Transaction Act" UCITA) that allows software manufacturers to:
That act, practically means that software distributors can limit their liability through appropriate clauses in the contracts. For instance, below is shown the disclaimer of warranties of a Microsoft product. Although the law overprotects software developers and distributors, using these disclaimers may prevent legal problems, but there are multiple additional problems related with poor software that are not avoided by them.
DISCLAIMER OF WARRANTIES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, MICROSOFT AND ITS SUPPLIERS PROVIDE TO YOU THE SOFTWARE COMPONENT, AND ANY (IF ANY) SUPPORT SERVICES RELATED TO THE SOFTWARE COMPONENT ("SUPPORT SERVICES") AS IS AND WITH ALL FAULTS; AND MICROSOFT AND ITS SUPPLIERS HEREBY DISCLAIM WITH RESPECT TO THE SOFTWARE COMPONENT AND SUPPORT SERVICES ALL WARRANTIES AND CONDITIONS, WHETHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, ANY (IF ANY) WARRANTIES OR CONDITIONS OF OR RELATED TO: TITLE, NON- INFRINGEMENT, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, LACK OF VIRUSES, ACCURACY OR COMPLETENESS OF RESPONSES, RESULTS, LACK OF NEGLIGENCE OR LACK OF WORKMANLIKE EFFORT, QUIET ENJOYMENT, QUIET POSSESSION, AND CORRESPONDENCE TO DESCRIPTION. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE COMPONENT AND ANY SUPPORT SERVICES REMAINS WITH YOU.
Although the law overprotects software developers and distributors, using these disclaimers may prevent legal problems, but it is just a way to avoid the legal problems of having bad quality software not solving the real problem that is what affect and frustrates end users.
Many people have tried to define what does Software Quality mean. However, it is not an easy task. Quality in general (not only in Software) is such a subjective topic that trying to define it formally is extremely challenging.
There is a very interesting book called "Zen and the Art of Motorcycle Maintenance" [[ZEN-AND-THE-ART-OF-MOTORCYCLE-MAINTENANCE]] in which the narrator talks about the process of creative writing, and specially about quality. The quality of a written text is difficult to define. If you ask people to rank essays (or programs) from best to worst it is very likely they reach a consensus "they have an intuitive understanding that one essay has more quality than another" but it's much more difficult to identify the parts of the essay that give it quality.
In Zen and the Art of Motorcycle Maintenance, Pirsig (the author) explores the meaning and concept of quality, a term he deems to be undefinable. Pirsig's thesis is that to truly experience quality one must both embrace and apply it as best fits the requirements of the situation. According to Pirsig, such an approach would avoid a great deal of frustration and dissatisfaction common to modern life.
Let's think about another example of how the situation determines the quality. For instance, a master chef has prepared an exquisite meal and invited a group of friends to share it at her restaurant on a lovely summer evening. Unfortunately the air conditioning isn't working at the restaurant, the waiters are surly, and two of the friends have had a nasty argument on the way to the restaurant that dominates the dinner conversation. The meal itself is of the highest quality but the experiences of the diners are not.
You could think that writing code is very different to writing an essay, but that is not the case. Usually, when you have a look at a piece of code it is easy for you to determine if you like it or not, but it becomes quite complicated to assess why.
Software quality may be defined as conformance to explicitly stated functional and performance requirements, explicitly documented development standards and implicit characteristics that are expected of all professionally developed software.
This definition emphasis from three points:
For the first item, explicit software requirements, it is going to be relatively easy to check objectively the conformance to them, for the second one, it is going to be more complicated and depends on how documented those standards are, for the implicit characteristics expected, it is going to be even tougher, as measuring conformance to something that is implicit, is, by definition, impossible.
Those "implicit" requirements mentioned in the formal definition are a hint to indicate that there is something more about Software that goes beyond the explicit requirements. At the end of the day, software is going to be used by people, which do not care about the requirements but about their expectations. Hence, the need to look for another point of view.
"A product's quality is a function of how much it changes the world for the better." [[MANAGEMENT-VS-QUALITY]] or "Quality is value to some person" [[QUALITY-SOFTWARE-MANAGEMENT]]. Both definitions stress that the quality may be subjective. I.e. different people are going to perceive different quality in the same software. The software developers should also think about end users and asking themselves questions such as "How are users going to use the software?".
In order to provide a more complete picture, IEEE standard 610.12-1990 combines both views in their definitions of quality:
Software quality is
There is another dimension of Software Quality that depends on whether we focused on the part of the Software that is exposed to the users or on the part of the Software that is not.
External Quality is the fitness for purpose of the software, i.e. does the software what it is supposed to do?. The typical way to measure external quality is through functional tests and bugs measurement.
Usually this is related to the conformance requirements that affect end-users (formal definition) as well as to meeting the end-user expectations (human point of view).
Some of the properties that determine the external quality of software are:
Internal Quality is everything the software does but is never seen directly by the end-user. It's the implementation, which the customer never directly sees. Internal quality can be measured by conformance requirements (not focused on end-users but on software structure), software analysis and adherence to development standards or best practices.
If it is not visible to end-user, and our target is make customers happy, we could ask ourselves if Internal Quality is something we should pay attention to.
Internal quality is related with the design of the software and it is purely in the interest of development. If Internal quality starts falling, the system will be less amenable to change in the future. Due to that, code reviews, refactoring and testing are essential as otherwise the internal quality will slip.
An interesting analogy with debts and bad code design was developed Ward [[DEBT-ANALOGY]]. Sometimes companies need to get some credit from the banks in order to be able to invest, however, it is also critical to understand that is impossible to ask for credit continuously as the paying interest will kill the company financially. The same could be used for software, sometimes it is good to assume some technical debt to achieve a goal, for instance, meeting a critical milestone to reach users before our competitors, but it is important to understand that assuming technical debt endlessly would kill the project as it will make the product unmaintainable.
Sometimes, after achieving the target External Quality, we need to refactor our code to improve the Internal Quality. Software Quality is sometimes the art of a continuous refactor.
Let's go back to the analogy of writing an essay or a paper, in that case most people write out the first draft as a long brain-dump saying everything that should be said. After that, the draft is constantly changed (refactored) until it is a cohesive piece of work.
When developing software (for instance in University assignments :-D) the first draft is often finished when it meets the general requirements of the task. So, after that, there is an immediate need to refactor the work into a better state without breaking the external quality. Maybe writing software is also kind of an art?
This is universally true, and the danger of not paying attention to refactor your code is bigger on a larger project where poor quality code can lose you days in debugging and refactoring.
Some of the properties that enable the process of product with good internal quality are:
The external quality is sometimes compared with "Doing the right things" as opposed to "Doing the things right" which should define what internal quality is.
Usually, the problems with the external quality characteristics (correctness, reliability...) are simply visible symptoms about software problems, that usually are related with internal quality attributes: program structure, complexity, coupling, testability, reusability, readability, maintainability... Sometimes, when the internal quality is bad, external quality can be met during a short period of time, but in the longer term, the external quality will be affected.
An excellent analogy is the Quality Iceberg created by Steve McConell (see ).
ISO 9126 defines Software Quality as the totality of characteristics of an entity that bears on its ability to satisfy stated and implied needs.
It recognizes that quality is not only determined by the software itself but also by the process used for software development and the use made of the software. Hence, the following identities are defined:
ISO identified 6 characteristics of the software quality that are sub-divided into sub-characteristics:
Quality in use is defined by ISO as "the extent to which a product used by specified users meets their needs to achieve specified goals with effectiveness, productivity, and satisfaction in specified contexts of use". The quality in use hence depends on the context in which the product is used and its intrinsic quality.
There is no a single definition of quality. However, the importance of Software Quality is continuously increasing. The concepts of external and internal quality are commonly used across the software industry, but despite that, the properties used to measure the quality diverge across different methodologies, standards or companies.
Despite the availability of different quality definitions, characteristics and entities, a common understanding is that high quality is usually linked to products with low number of defects. Therefore, it is assumed that a quality problem is due to the impact of a defect.
But in order to identify which is high quality, defining what a defect is needed. In general, there are three concepts used in software quality to refer to defects:
Peter is driving his car towards Oxford. While he is driving, the road diverts into two different directions: 1. Left road to Oxford 2. Right road to Cambridge By mistake, Peter takes the road to Cambridge. That is a fault that is committed by Peter. Suddenly, Peter is in an error situation or state: Peter is heading Cambridge and not Oxford. If Peter goes on and arrives to Cambridge, that would be a failure: Peter was planning to get to Oxford but he has arrived to Cambridge instead. If Peter realizes of the error situation while he is driving Cambridge, returns to the junction and takes the right road to Oxford no failure would happen as Peter recovers from the error condition.
public static int numZero (int[] x) { // effects: if x == null throw NullPointerException // else return the number of occurrences of 0 in x int count = 0; for (int i = 1; i < x.length; i ++) { if (x[i] == 0) { count ++; } } return count; }
The fault in the code above is that it starts looking for zeroes at index 1 instead of index 0. For example, numZero([2, 7, 0]) correctly evaluates to 1, while numZero([0, 7, 2]) incorrectly evaluates to 0. In both cases the fault is present and is executed. Although the code is in both cases in an error situation, only in the second case there is a failure: the result is different from the expected one. In the first case, the error condition (the for starts in 1) do not propagates to the output.
Some early conclusions can be already identified:
Software Quality Assurance (SQA) is the set of methods used to improve internal and external qualities. SQA aims at preventing, identifying and removing defects throughout the development cycle as early as possible, as such reducing test and maintenance costs.
SQA consists of a systematic, planned set of actions necessary to provide adequate confidence that the software development process or the maintenance process of a software system product conforms to established functional technical requirements as well as with the managerial requirements of keeping the schedule and operating within the budgetary confines.
The ultimate target of the SQA activities is that few, if any, defects remain in the software system when it is delivered to its customers or released to the market. As it is virtually impossible to remove all defects, another aim of QA is to minimize the disruptions and damages caused by these remaining defects.
The SQA methodology will also depend on the software development methodology used, as they are inherently couple. For instance, different software development models will focus the test effort at different points in the development process. Newer development models, such as Agile, often employ test driven development and place an increased portion of the testing in the hands of the developer, before it reaches a formal team of testers. In a more traditional model, most of the test execution occurs after the requirements have been defined and the coding process has been completed.
An example of an SQA methodology is available at [[IEEE-QA-TEMPLATE]].
SQA activities are not only carried out by the Software Quality group, the software engineer group is responsible for putting in place the SQA methodology defined, which may include different activities such as testing, inspection, reviews...
The activities that are carried out as part of the SQA process can be divided in three different categories.
The following sections explain in detail these SQA activities.
The main goal of these activities is reducing the chance for defect injections and the subsequent cost to deal with these injected defects.
Most of the defect prevention activities assume that there are known error sources or missing/incorrect actions that result in fault injections, as follows:
People is the most important factor that determines the quality and, ultimately, the success or failure of most software projects. Hence, it is important that people involved in the software planning, design and development have the right capabilities for doing their jobs. The education and training effort for error source elimination should focus on the following areas:
Formal methods provide a way to eliminate certain error sources and to verify the absence of related faults. Formal development methods, or formal methods in short, include formal specification and formal verification.
Even if the best software developers in the world are involved in a software project, and even if they follow the formal methods described in the previous section, some faults will be injected in the software code. Due to that, defect prevention needs to be complemented with other techniques focused on removing as many of the injected faults as possible under project constraints.
Fault distribution is highly uneven for most software products, regardless of their size. Much empirical evidence has accumulated over the years to support the so-called 80:20 rule, which states that 20% of the software components are responsible for 80% of the problems (Pareto Law). There is a great need for risk identification techniques to detect the areas in which the fault removal activities should be focused.
There are two key activities that deal with fault removal: Code Inspection and Testing.
The software inspections were first introduced by Michael E. Fagan in 1970s, when he was a software development manager at IBM. The inspections are a means of verifying intellectual products by manually examining the developing product, a piece at a time, by small groups of peers to ensure that it is correct and conforms to product specifications and requirements. Inspections may be done in the software code itself and also in other related items such as design or requirements documents.
Code inspections should check for technical accuracy and completeness of the code, verify that it implements the planned design, and ensure good coding practices ?and standards are used. Code inspections should be done after the code has been compiled and all syntax errors removed, but before it has been unit tested.
There are different kind of inspections depending on factors such as the formality (formal vs informal) the size of the team (peer review, team review), whether it is guided or not; The type of inspection to be done depends on the software to be reviewed, the team involved and the target of the review.
Regardless of the inspection type used, there are clear benefits when inspections are used. For instance, according to Bell-Northen Researh, the cost of detecting a defect is much lower in case of inspections (1 hour per defect) than in the case of testing (2-4 hours per defect).
More information about inspections can be found at [[INSPECTIONS-AND-REVIEWS]] and [[TRUTHS-PEER-REVIEWS]] and in the last chapter.
Testing is the execution of software and the observation of the program behaviour and outcome. As in the case of the software inspections, there are different kind of testing, usually applied in different phases of the software development process.
Some of the most typical testing types are:
A concept tight related with testing (although applicable in other areas such as reviews) is the handling of the defects. In particular it is very important that the defects detected are properly recorded (defect logging) with all the relevant information as in many situations finding the error related with a fault is not trivial. It is also very important that the issues detected are monitored so that everybody knows what is the status of every defect after the initial discovery (defect tracking).
The defect reduction activities can only reduce the number of faults to a fairly low level, but not completely eliminate them. For instance, in many situations, the combination of possible situations is so big, that it is impossible to test all those situations, especially those linked to rare conditions or unusual dynamic scenarios.
Depending on the purpose of the software, these remaining faults, and the failure risk due to them may be still inadequate, so some additional QA techniques are needed:
For instance, the software used in the flight control systems is one example of software with very extreme requirements about failures. The report [[CHALLENGES-FAULT-TOLERANT-SYSTEMS]] provides more details about the challenges that this kind of systems pose to software developers.
Software fault tolerance ideas originate from fault tolerance designs in traditional hardware systems that require higher levels of reliability, availability, or dependability.
All fault tolerance systems must be based on the provision of useful redundancy that allows to switch between components when one of they fail (due to software or hardware faults). That implies that there has to be some extra components, which ideally should have a different design to avoid the same error to happen twice. Based on how those redundant components structured are used (e.g. when to switch from one to another) there are different kind of systems:
NVP (N-version programming):
This technique uses parallel redundancy, where N copies, each of a different version, of codes fulfilling the same functionality are running in parallel with the same inputs. When all of those N-copies have completed the operation, an adjudication process (decision unit) takes place to determine (based in a more or less complex vote) the output.
Some key characteristics about this scheme that is depicted in
Obviously a wide range of different variants of those systems have been proposed based in multiple combinations of them [[COST-EFFECTIVE-FAULT-TOLERANCE]] and multiple comparisons between the performance are also available [[PERFORMANCE-RB-NVP-SCOP]].
What do you think are the key advantages and disadvantages of the two fault tolerance techniques described (Recovery Blocks & N-Version)? Exercise 3: Recovery Blocks vs. N-Version
There is software that is used in safety critical systems, that have severe consequences in case a failure occurs. In those situations it is very important to avoid some of the potential accidents or at lt
Various specific techniques are used for this kind of systems, most of them based on the analysis of the potential hazards linked to the failures:
Notice that both hazard control and damage control above are post-failure activities that attempt to "contain" the failures so that they will not lead to accidents or the accident damage can be controlled or minimized. All these techniques are usually very expensive and process/technology intensive, hence they should be only applied when safety matters and deal with rare conditions related to accidents.
Whereas Quality Assurance defines a set of methods to improve Software Quality, it does not define aspects that are key in order to ensure good quality software is delivered such as:
In order to address these questions, the QA activities should be considered not in an isolated manner, but as part of a full engineering problem. Software Quality Engineering is the discipline that defines the processes to ensure high quality products. QA activities are only a part of that process, which requires further activities such as Quality Planning, Goal Setting or Quality Assessment. The provides an overview of the typical SQE:
Before doing any QA activity, it is important to consider some aspects such as the target quality, the most appropriate QA activities to be done and when should be done, how are the quality going to be measured. All those activities are usually called Pre-QA or Quality Planning Activities.
The first activity that should be done in SQE is defining what are the specific quality goals for the software to be delivered. In order to do so, it is important to understand what are the expectations of the software end-user/customer. Obviously, it is also key to recognize that the budget is limited and that the quality target should be financially doable. The following activities are key to identify the target quality of the software:
Once that the quality goals are clear, the QA strategy should be defined. Two key decisions should be made during this stage:
These activities have been described in section 1.4.2 and basically consist in executing the QA activities planned and handling the defects discovered as a result of them.
These activities consist in measuring the quality of the software (after the QA activities), assess the quality of the software product and the definition of the decisions and actions need to improve its quality.
All these activities are usually carried out after normal QA activities have started but as part of these "normal" QA activities. Their goal is to provide feedback so that decisions can be made and improvements can be suggested. The key activities include:
The overall framework for quality improvement is called QIP, and it includes three interconnected steps:
W. Edwards Deming in the 1950's proposed that business processes should be analysed and measured to identify sources of variations that cause products to deviate from customer requirements. He recommended that business processes be placed in a continuous feedback loop so that managers can identify and change the parts of the process that need improvements. As a teacher, Deming created a (rather oversimplified) diagram to illustrate this continuous process, commonly known as the PDCA cycle for Plan, Do, Check, Act:
Deming's PDCA cycle can be illustrated as in :
By going around the PDCA circle, the working methods are continuously improved as well as the results obtained. However, it is important to take care avoid a situation called "spiral of death" It happens when an organization goes around and around the quadrants, never actually bringing a system into production.
The quality engineering process cannot be considered in an isolated manner, but as part for the overall software engineering process. For instance, most of the SQE Activities should be included as part of the Software Development activities ():
However, it should be considered that SQE activities have different timing requirements, activities and focus. For instance, represents the typical effort spent in the different quality activities during the software development time.
Focusing on the QA activities, in a typical waterfall development model, the provides an estimate of the key QA activities done during each of the project phases:
Another important aspect to be considered is that some of the QA activities cannot be done until it is already too late. For example, for safety critical systems, post-accident measurements provide a direct measure of safety, but due to the damage linked to those accidents, they should be avoided by all means. In order to take early measures, appropriate models that link some of the quality measures during the development process with the end product quality are needed. Last but not least, it should be stressed that there is an increasing cost of fixing problems late instead of doing early, because a hidden problem may lead to other related problems, and the longer it stays in the system, the discovery is more difficult.
In section 1.1, some of the implications of bad quality software have been introduced. The cost of poor quality (COPQ) is not the only cost that Software Quality Engineering should take into account. The cost of having good quality (COGQ) that may be linked to SQA activities (e.g testing or code inspections) should not be underestimated and considered when the total quality cost is assessed.
As in the case of the external and internal quality, the different costs linked to quality have been represented by some authors as an iceberg, in which some of the costs are easy to be identified (e.g. testing costs, customer returns...) while some others are not always taken account (e.g. unused capacity, excessive IT costs...). In [[COST-OF-QUALITY]] there is a detailed analysis of this approach for identifying quality costs.
"Quality metrics let you know when to laugh and when to cry", Tom Gilb
"If you can't measure it, you can0t manage it", Deming
"Count what is countable, measure what is measurable. What is not measurable, make measurable", Galileo
These are just some sample sentences about the importance of measuring in general. Obviously, the capability of quantifying characteristics of a product are extremely helpful to manage that product. However, it's also important to stress that is essential to understand the attributes that are being measured so the metric doesn't end-up being a number but a proper indicator with a very clear meaning. Additionally, we studied in Unit 1 that Quality is a extremely subjective thing, so we should not assume that every aspect related with product quality can be quantified, or at least, quantified easily. Albert Einstein put it very nicely when he said his famous sentence: "Not everything that can be counted counts, and not everything that counts can be counted.".
Also, we should bear in mind that the act of measuring some software attribute, is not intended to improve that metric but firstly to understand its impact and its validity as an indicator for some software characteristic. A typical mistake is trying to improve any metric you are calculating in your projects. Doing this for the sake of it is a mistake, as Goodhart explained in his law: "When a measure becomes a target, it ceases to be a good measure". Imagine you are working in a development team and you are told that you need to increase the number of defects found during the development cycles. What is likely to happen is that the team is going to start reporting anything they could suspect is a bug, it doesn't matter how small, difficult to detect, or difficult to reproduce it is. At the end of the day, the team has been asked to find more bugs, and that is what they are going to do!
What does this all mean? It means that we shuld try to get metrics, but more importantly, we need to understand how those metrics affect the quality of the product, and how they could become indicators of certain software attributes.
If we do that, we could use those metrics to improve our process and increase product quality over releases, predict potential issues (e.g. last time we got those indicators, the product was a disaster on the field), re-use successful experiences (if a product worked extremely well, check it's metrics and what did it make differente), etc. In summary we should use metrics to understand first and imrpove afterwards.
But, what is relationship between the terms measurement, metric and indicator?
Action | Concept | Examples |
---|---|---|
Collect (Data) |
|
|
Calculate (Metrics) |
|
|
Evaluate (Metrics) |
|
|
There are three main kind of metrics related to software:
Example of measurements:
* 120 defects detected during 6 months by 2 engineers * Defects defected every month: 10, 10, 20, 20, 25, 35 * Defects remaining in the final product: 40 * Size of the Product: 40.000 Lines of Code
Metrics and Indicator Examples:
* Process Metric: Defect Arrival Pattern per month: 10, 10, 20, 20, 25, 35 -> Indicator of Maturity * Project Metric: 40 KLOC / 2 / 6 = KLOC per eng-month -> Indicator of Productivy * Product Metric: 40 defects / 40 KLOC = 1 defect / KLOC -> Indicator of Quality
Software quality metrics are a subset of software metrics that focus on the quality aspects of the product, process, and project. Software quality metrics can be divided further into:
In general, the quality of a developed product (end-product metrics) is influenced by the quality of the production process (in-process metrics). Identifying the link between those two type of metrics is essential for software development as the end-product metrics, most of the times, can be only discovered when it is too late (i.e. the product is alreay in the market). However, the link between both type of metrics is hard and complex as in the most of the times its relationship is poorly understood.
The link model between a process and a product for manufacturer goods is in most of the cases simple. However, for software, this model is in general more complex because the influence of the humans involved in software development is way higher than in goods manufacturing and the degree of automation is smaller in software development that in manufacturing.
As engineers, our target is:
The ultimate goal of software quality engineering is to investigate the relationships among in-process metrics, project characteristics, and end-product quality, and based on these findings to engineer improvements in both process and product quality.
Software reliability is a measure of how often the software encounters an error that leads to a failure. From a formal point view, Reliability can be defined as the probability of not failing during a specified length of time:
R(n) (where n is the number of time units)
The probability of failing in a specified length of time is 1 minus the reliability for that length of time and it's usually denoted by a capital F letter:
F(n) = 1 - R(n)
If time is measured in days, R(1) is the probability of the software system having zero failures during one day (i.e. the probability of not failing in 1 day)
A couple of metrics related with the software reliability are the "Error Rate" and the "Mean Time To Failure" (MTTF). The MTTF can be defined as the average time that occurs between two system failures. Error Rate is the average number of failures suffered by the system during a given amount of time. Both metrics are related with the following formula:
The relationship between the error rate and the reliability depends on the statistical distribution of the errors, not only on the error rate.
For instance, the following table shows the errors that occured per day in two different systems during one week.
Defects per Day | |||||||
---|---|---|---|---|---|---|---|
DAY 1 | DAY 2 | DAY 3 | DAY 4 | DAY 5 | DAY 6 | DAY 7 | |
Project A | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Project B | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
It can be seen that both systems suffered the same amount of errors during the week 28 and hence the error rate for both systems is the same: 28/7 = 4 Errors/Day. However, the reliability of the system for the first day is very different.
Unless detailed statistics/models are available, the best estimate of the short-term future behavior is the current behavior. For instance, if a system suffers 24 failures during one day, the best estimate for the next day is that 24 failures will occur (24 errors/day) that correspond to a 1hour MTTF. That means that by default, we could assume that most of the system failures follow an exponential distribution, and hence the following formula could be used to calculate their reliability:
Where λ is the Error Rate and t is the amount of time for which the system reliability is calculated. A key concept of exponential distributions is that the error rate is constant and hence it doesn't change with the time. The following tables show the Probability and Cumulative Density Functions of exponential curves with different values of λ
However, this concept of systems having a constant error rate that doesn't change over time, it's not very common in the real world. For instance, in Hardware components, the error rate evolves with the time in different ways:
You can find a lot of information about reliability and how maths is used for calculating it at [[RELIABILITY-MATHS]]
Although exponential distribution may be a good compromise that could be applied to any software system, there are other distributions that may describe in a more accurate way the idea of having a non constant error rate. For instance, the Weibull distribution is frequently used in reliability analysis [[WEIBULL-BASICS]].
In a Weibull distribution, the error rate can change with the time. The Reliability function is:
That depends on two variables, η and β that define the shape of the distribution function. For a fixed value of η, the failure rate could be constant (β = 1), descending (β < 1) or ascending (β > 1). Please note that combining those three options we could describe the Hardware component failure rate phases.
If we take into account Software Upgrades, there are some interesting analysis about how a sawteeth pattern is observed [[SOFTWARE-RELIABILITY]]. Again, such a curve could be defined by combining different Weibull distributions.
Defect Density is the number of confirmed defects detected in software/component during a defined period of development/operation divided by the size of the software/component.
The "defects" are usually counted as confirmed and agreed defects (not just reported). For instance, dropped defects are not counted.
The "period" or metrics time frame, might be for one of the following:
The "opportunities for error" (OFE) or sofware "size" is measured in one of the following:
In the following chapters both ways of measuring OFE will be studied separately.
Counting the lines of code (LOC) is way more complex that what it could be initially considered. The major problem for couting lines of code comes from the ambiguity of the operational definition, the actual counting. In the early days of Assembler programming, in which one physical line was the same as one instruction, the LOC definition was clear. With the availability of high-level languages the one-to- one correspondence broke down. Differences between physical lines and instruction statements (or logical lines of code) and differences among languages contribute to the huge variations in counting LOCs. Even within the same language, the methods and algorithms used by different counting tools can cause significant differences in the final counts. Multiple variations were already described by Jones in 1986 such as:
For instance, next example includes two approaches for coding the same functionality. As the functionality is the same, and it is writing in the same manner, the opportunities for error should be the same, however, the lines of code differ. For instance, if we count all the lines (job control language, comments...), in the first case only one line of code is used whereas in the second case 5 lines of code have been used.
for (i=0; i<100; ++i) printf("I love compact coding"); /* what is the number of lines of code in this case? */ /* How many lines of code is this? */ for (i=0; i<100; ++i) { printf("I am the most productive developer"); } /* end of for */
Some authors have considered LOC not only a less useful way to measure software size but also a harmful thing for sofware economics and productivity. For instance, the paper written by Capers Jones called "A Short History of Lines of Code (LOC) Metrics" [[LOC-HISTORY]] offers a very interesting historical view about the evolution of Software Programming Languages and LOC metrics.
Regardless of the LOC measurements used, when a software product is released to the market for the first time, and when a certain way to measure lines of code is specified, it is relatively easy to state its quality level (projected or actual). However, when enhancements are made and subsequent versions of the product are released, the measurement is more complicated. In order to have a good insight on the product quality is important to follow a two-fold approach:
The first measure may improve over releases due to aging and defect removal, but that improvement in the overall defect rate may hide problems on the developement/quality process (e.g. new code contains a higher defect density that the "old" code which indicates a problem in the process). In order to be able to calculate defect rate for the new and changed code, the following must be available:
These tasks are enabled by the practice of change flagging. Specifically, when a new function is added or an enhancement is made to an existing function, the new and changed lines of code are flagged. The change-flagging practice is also important to the developers who deal with problem determination and maintenance. When a defect is reported and the fault zone determined, the developer can determine in which function or enhancement pertaining to what requirements at what release origin the defect was injected. The following is an example on how the overall defect rate and the defect rate for new code is mesaured at IBM according to the book "Metrics and models in software quality engineering" by Stephen H. Kan.
In the first version of a software product 30 Defects were reported by end-users and the software size was 30KLOC. After fixing all the discovered bugs, the team works in a new version that includes 10 new KLOC. End-users report 10 additional defects in this new version that were injected in the new 10KLOC.
The Defect Density of the first version was 1 defect/KLOC.
If we calculate the Defect Density of the second version in the same way, it would be: DD = 10/40 = 0.25 defects/KLOC. If we compare this value, we could conclude that the second version was way better that the first one.
But, this could be misleading as indeed, the second version could include only defects in the new 10KLOC, not in the old ones. We could calculate the same metric but just counting the new Lines of Code. If we do so, the result would be: DD = 10/10 = 1 defect/KLOC which is the same than for the first release.
We could conclude that for end-users, the second version is going to be a significant improvement as the number of defects are going to perceive is smaller both in absolute and relative terms. However, the team has been doing a similar job in terms of defects remaining after releasing the product (defect injection and detection).
It is important to think how useful is this metric from two points of view:
From the customer's point of view, the defect rate is not as relevant as the total number of defects that might affect their business. Therefore, a good defect rate target should lead to a release-to-release reduction in the total number of defects, regardless of size. I.e. Not only the defect rate should be reduced but also the total number of defects. If a new release is larger than its predecessors, it means the defect rate goal for the new and changed code has to be significantly better than that of the previous release in order to reduce the total number of defects.
In the example above, from the initial release to the second release the defect rate didn't improve. However, customers experienced a 66% reduction [(30 - 10)/30] in the number of defects because the second release is smaller.
As explained in the previous chapter, measuring the opportunities for error through the lines of code has some problems. Counting lines of code is but one way to measure size. Another alternative is the using the function point. In recent years the function point has been gaining acceptance in application development in terms of both productivity (e.g., function points per person-year) and quality (e.g., defects per function point)
A function can be defined as a collection of executable statements that performs a certain task, together with declarations of the formal parameters and local variables manipulated by those statements. The ultimate measure of software productivity is the number of functions a development team can produce given a certain amount of resource, regardless of the size of the software in lines of code. The defect rate metric, ideally, is indexed to the number of functions a software provides. If defects per unit of functions is low, then the software should have better quality even though the defects per KLOC value could be higher — when the functions were implemented by fewer lines of code. Although this approach seems very powerful and promising, from a practical point of view it is very difficult to be used.
The function point metric was originated by Albrecht and his colleagues at IBM in the mid-1970s. The name could be a bit misleading as the technique itself does not count the functions. Instead it tries to measures some aspects that determine the software complexity withouth taking into the differences between programming languages and development styles that change the LOC metric. In order to do so, it takes into account five major components that comprise a software product:
Following figure provides a graphical example on how all these components work together and how they interact with the end-users.
Apart from being technology independent, this way of identifying the key software functions is very interesting as it is focused on the end-user point view: most of the components are thought from the user’s perspective (not the developers one), hence it works well with use cases.
The number of function points is obtained by the addition of the number of occurrences of those components (each of them weighted by a different factor) multiplied by an adjustment factor chosen based on the software characteristics:
FP = FC x VAF
Where:
In order to calculate the Function Points, every component is classified in three categories according to its complexity (low/medium/high). A different weight factor is assigned to every component type and category. The following weights are defined for every component and complexity:
When the number of components (classified by complexity) is available, given the previous weighting factors, the Function Counts (FCs) can be calculated based on the following formula:
Where wij are the weighting factors and xij the number of ocurrences of each component in the software. i denotes complexity and j denotes the component type. The following table shows graphically how can this function be calculated easily.
Type | Low Complexity | Mid Complexity | High Complexity | Total |
---|---|---|---|---|
EI | _ x 3 + | _ x 4 + | _ x 6 + | = |
EO | _ x 4 + | _ x 5 + | _ x 7 + | = |
EQ | _ x 3 + | _ x 4 + | _ x 6 + | = |
ILF | _ x 7 + | _ x 10 + | _ x 15 + | = |
EIF | _ x 5 + | _ x 7 + | _ x 10 + | = |
The complexity classification of each component is based on a set of standards that define complexity in terms of objective guidelines. For instance, for the external output component, if the number of data element types is 20 or more and the number of file types referenced is 2 or more, then complexity is high. If the number of data element types is 5 or fewer and the number of file types referenced is 2 or 3, then complexity is low. The following tables provide the standard categorization where:
RETs | 1-19 DETs | 20-50 DETs | 51+ DETs |
---|---|---|---|
1 | Low | Low | Medium |
2-5 | Low | Medium | High |
6+ | Medium | High | High |
FTRs | 1-4 DETs | 5-15 DETs | 16+ DETs |
---|---|---|---|
0-1 | Low | Low | Medium |
2 | Low | Medium | High |
3+ | Medium | High | High |
FTRss | 1-5 DETs | 6-19 DETs | 20+ DETs |
---|---|---|---|
0-1 | Low | Low | Medium |
2-3 | Low | Medium | High |
4+ | Medium | High | High |
In order to calculate the Value Adjustment Factor (VAF), 14 characteristics of the software system must be scored (in a scale from 0 to 5) in terms of their effect on the software. The list of characteristics is:
Once all these characteristics are assessed, they are summed, based on the following formula, to arrive at the value adjustment factor (VAF):
Where ci is the score for general system characteristic i.
Over the years the function point metric has gained acceptance as a key productivity measure from a practical point of view. However, the meaning of function point and the derivation algorithm and its rationale may need more research and more theoretical groundwork. Furthemore, function point counting can be time-consuming and expensive, and accurate counting requires certified function point specialists.
Another product quality metric vastly used in the software industry measures the problems customers encounter when using the product.
For the defect denstity metric (section 1.2.1.2), the numerator was the number of valid defects. However, from the customers’ standpoint, all problems they encounter while using the software product, not just the valid defects, are problems with the software. Software problems suffered by end- users that are not valid defects may be:
These so-called non-defect-oriented problems, together with the defect problems, constitute the total problem space of the software from the customers’ perspective.
The problems metric is usually expressed in terms of problems per user month (PUM):
Where the total number of license-months is the number of months all the users have been using the software and may be calculated multipling the Number of install licenses of the software by the Number of months in the calculation period.
PUM is usually calculated for each month after the software is released to the market, and also for monthly averages by year. Note that the denominator is the number of license-months instead of thousand lines of code or function point, and the numerator is all problems customers encountered. Basically, whereas the defect density focuses in the number of real problems with regards to the software complexity, this metric relates detected problems to software usage.
There are different approaches to minimize PUM:
The first two approaches reduce the numerator of the PUM metric, and the third increases the denominator. The result of any of these actions will be that the PUM metric has a lower value. All three approaches make good sense for quality improvement and business goals for any organization. The PUM metric, therefore, is a good metric. The only minor drawback is that when the business is in excellent condition and the number of software licenses is rapidly increasing, the PUM metric will look extraordinarily good (low value) and, hence, the need to continue to reduce the number of customers’ problems (the numerator of the metric) may be under- mined. Therefore, the total number of customer problems should also be monitored and aggressive year-to-year or release-to-release improvement goals set as the number of installed licenses increases. However, unlike valid code defects, customer problems are not totally under the control of the software development organization. Therefore, it may not be feasible to set a PUM goal that the total customer problems cannot increase from release to release, especially when the sales of the software are increasing.
The key points of the defect rate metric and the customer problems metric are briefly summarized in the following table. The two metrics represent two perspectives of product quality. For each metric the numerator and denominator match each other well: Defects relate to source instructions or the number of function points, and problems relate to usage of the product. If the numerator and denominator are mixed up, poor metrics will result. Such metrics could be counterproductive to an organization’s quality improvement effort because they will cause confusion and wasted resources.
Defect Density Rate | PUM | |
---|---|---|
Numerator | Valid and Unique defects | All customer problems |
Denominator | Size of Product | Usage of Product |
Measurement | Producer Perspective | Consumer Perspective |
Scope | Intrinsic Product Quality | Intrinsic Product Quality + Other |
The customer problems metric can be regarded as an intermediate measurement between defects measurement and customer satisfaction. To reduce customer problems, one has to reduce the functional defects in the products and, in addition, improve other factors (usability, documentation, problem rediscovery, etc.)
Customer satisfaction is often measured by customer survey data in which the users are asked to qualify the software or characteristics of the software through a scale.
Based on the survey result data, several metrics with slight variations can be constructed and used, depending on the purpose of analysis. For example:
In addition to forming percentages for various satisfaction or dissatisfaction categories, the net satisfaction index (NSI) is also used to facilitate comparisons across product. NSI ranges from 0% (all customers are completely dissatisfied) to 100% (all customers are completely satisfied). If all customers are satisfied (but not completely satisfied), NSI will have a value of 75%. This weighting approach, however, may be masking the satisfaction profile of one’s customer set. For example, if half of the customers are completely satisfied and half are neutral, NSI’s value is also 75%, which is equivalent to the scenario that all customers are satisfied. If satisfaction is a good indicator of product loyalty, then half completely satisfied and half neutral is certainly less positive than all satisfied.
Defect rate during a development cycle is usually positively correlated with the defect rate in the next phases. For instance, the defect rate after integration testing is usually positively correlated with the defect rate in the field. Higher defect rates found during a phase is an indicator that the software has experienced higher error injection during that phase, unless the higher testing defect rate is due to an extraordinary testing effort (for example, additional testing or a new testing approach that was deemed more effective in detecting defects). The rationale for the positive correlation is simple: Software defect density never follows the uniform distribution. If a piece of code or a product has higher testing defects, it is a result of more effective testing or it is because of higher latent defects in the code. Myers suggested a counterintuitive principle that the more defects found during testing, the more defects will be found later.
This simple metric of defects per KLOC or function point is especially useful to monitor subsequent releases of a product in the same development organization. The development team or the project manager can use the following scenarios to judge the release quality:
This concept is shown graphically in the next diagram:
Overall defect density during testing is a summary indicator. The pattern of defect arrivals (or for that matter, times between failures) gives more information. Even with the same overall defect rate during testing, different patterns of defect arrivals indicate different quality levels in the field.
Next figure shows two contrasting patterns for both the defect arrival rate and the cumulative defect rate. Data were plotted from 44 weeks before code-freeze until the week prior to code-freeze. In both projects, the overall defect count is the same, however the potential forecast of the quality on the field is quite different. In the first project, during the last weeks, the number of defects reported every week is smaller and is tending to zero. The second project, represented by the charts on the right side, follows the opposite pattern. This obviously indicates that testing started late, the test suite was not sufficient, and that the testing ended prematurely. It's extremely likely that this project, if released as it is, would lead to even more defects in the field.
The objective is always to look for defect arrivals that stabilize at a very low level, or times between failures that are far apart, before ending the testing effort and releasing the software to the field. Such declining patterns of defect arrival during testing are indeed the basic assumption of many software reliability models. The time unit for observing the arrival pattern is usually weeks and occasionally months. For reliability models that require execution time data, the time interval is in units of CPU time.
When we talk about the defect arrival pattern, there are actually three slightly different metrics, which should be looked at simultaneously:
We have just introduced an interesting concept. Detecting a defect doesn't mean it is going to be automatically removed. This can happen because of many different reasons:
A metric intended to distinguish between defect detection and defect removal is the Defect Removal Pattern, that describes the evolution of the number of defects removed over the time. This metric can be calculated per unit of time or per phase of the project (e.g. iteration).
Some related metrics are the "average time to detect a defect" (which provides an indication about how good the process is about detecting defects) and the "average time to fix a defect" which is an indiator of how good the process is with respect to fixing defects once they have been detected. These metrics are key as we should remember that the later a defect is detected and fixed, the more expensive it is.
A burndown chart is a graphical representation of the amount of work left to be done vs. the time. It's typically used in Agile methodologies to check sprint evolution, measure team speed, etc. In those cases, the amount work is always decreasing as the tasks for the sprint are identified before the sprint starts.
An equivalent concept is the Defect Backlog / Burndown. In that case the graphic does not represent the amount of work to be done but the amount of unfixed defects. The curve can go down if no more defects are found and the remaining ones are fixed, or go up if the number of detected defects overpace the fix rate.
In the example above, we can see that the red line shows the number of cumulative defects (fixed or unfixed), the green one shows the number of fixed ones, whereas the black line shows the "delta" between detected and fixed defects (i.e. the size of the defect backlog).
Ideally, the defect backlog count should be zero before releasing a product. But in big product that is nearly impossible. This requires product managers to play with the four key aspects of any software project: resources, scope, time and quality. In particular, the following actions could be done:
One approach that is used sometimes to make the defect backlog go to zero is using triage meetings to determine which defects are blockers and which ones are not. The idea is that the closer the release date is, the more difficult is to consider a bug as a blocker for the release. However, having a clear set of guidelines about what is a blocker and what is not, is also very helpful, for instance, Mozilla used a particualr one for FirefoxOS
Defect removal effectiveness can be defined as follows:
It provides a measure of the percentage of defects removed in one phase with regards to the overall number of defects in the code when entering into that phase. As the total number of latent defects in the product at any given phase is not known, the denominator of the metric can only be approximated which is usually done through:
The metric can be calculated for the entire development process, for the front end (before code integration), and for each phase. It is called early defect removal
The higher the value of the metric, the more effective the development process and the fewer the defects escape to the next phase or to the field. This metric is a key concept of the defect removal model for software development.
For instance, if during the development of a product 80 bugs were found and fixed, but there were still 20 defects latent that were found by the customers when the product hit the field, the DRE would be:
DRE = 80 / (80 + 20) = 80%
In average, 80 of every 100 defects were removed.
Another view of this metric is depicted in the following Figure.
It shows how are defects injected, detected and repair during a project phase. Based on it, another way to calculate the defect removal efficiency can be found:
The following table is an example in which the data about when are errrors injected and detected in a software project is provided.
Origin of the defect | ||||||
---|---|---|---|---|---|---|
Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | TOTAL REMOVED | ||
Where Found? | Iteration 1 | 5 | - | - | - | 5 |
Iteration 2 | 10 | 15 | - | - | 25 | |
Iteration 3 | 5 | 5 | 10 | - | 20 | |
Iteration 4 | 5 | 5 | 0 | 5 | 15 | |
Total Injected | 25 | 25 | 10 | 5 | 65 |
With that data, the DRE could be calculated for different phases of the software development process. Some examples are shown below:
During the First Iteration the total numbers of defect injected was 25. The meaning of the value "5" in the intersection of "Iteration 1" row and column is that during that phase, only 5 defects were removed. The meaning of the value 10 in the intersetion of row "Iteration 2" and column "Iteration 1" is that during Iteration 2, 10 defects that were originated in Iteration 1 were removed. Equally, the meaning of the value 15 in the intersetion of row "Iteration 2" and column "Iteration 2" is that during Iteration 2, 15 defects that were also originated in Iteration 2 were removed. We could calculate the DRE of all the different iterations quite easily: * Iteration 1: DRE = 5/25 = 20% * Iteration 2: DRE = (10+15) / [(25+25) - 5] = 25/45 = 55%; * Iteration 3: DRE = (5+5+10) / [(25+25+10) - (5+25)] = 20/30 = 66%; * Iteration 4: DRE = (5+5+0+5) / [(25+25+10+5) - (5+25+20)] = 15/15 = 100%;
The following table describes for each of the software development process phases the most important sources of defect injection and removal.
Delopment Phase | Defect Injection | Defect Removal |
---|---|---|
Requirements | Requirements Gathering Process and Specification Development | Requirement Analysis and Review |
High Level Design | Design | High Level Design Inspections |
Low Level Design | Design | Low Level Design Inspections |
Code Implementation | Coding | Code Inspections Testing |
Integration Build | Integration and Build Process | Build Verification Testing |
Unit Test | Bad Fixes | Testing Itself |
Component Test | Bad Fixes | Testing Itself |
System Test | Bad Fixes | Testing Itself |
This chapter has been addresing so far metrics that are related with a direct measurement of the software quality. However, it is also critical to consider that those metrics usually have a direct relationship with some software characteristics that could not be considered as directly related with the software quality.
Some intrinsic characteristics of the software that usually affect the software quality (either internal or external) are:
Different metrics exist to take into account all those aspects in early phases of the software development process and take preventive measures. E.g. if the code is extremely complex a refactoring of the software should be done in order to minimize the likelyhood of defects.
Some example of metrics used (in Object Oriented Programming) are:
Quality Assurance is only one part of the activities that are used to improve Software Quality. However QA is per-se not enough as it does not define how the software is managed, for instance:
SCM could be defined as a framework for managing the evolution of software throughout all the stages of Software Development Process.
There are multiple definitions for SCM and in some cases the SCM acronym is used with different meanings (Software/Source Code Management, Software/Source Code Change Control Management, Software/Source Configuration Management...). Roger Pressman states [[SOFTWARE-ENGINEER-PRACTICIONER]] that SCM is a "set of activities designed to control change by identifying the work products that are likely to change, establishing relationships among them, defining mechanisms for managing different versions of these work products, controlling the changes imposed, and auditing and reporting on the changes made."
In summary, SCM is a set of activities intended to guarantee:
When used effectively during a product's whole life cycle, SCM identifies software items to be developed, avoids chaos when changes to software occur, provides needed information about the state of development, and assists the audit of both the software and the SCM processes. Therefore, its purposes are to support software development and to achieve better software quality. Additionally, a good SCM system should also help to reduce (or at least control) costs and effort involved in making changes to a system.
IEEE's (IEEE Std. 828-1990) traditional definition of SCM included four key activities: configuration identification, configuration control, configuration status accounting and configuration audits. However, a successful implementation of SCM also requires careful planning and a good release management and processing. Next figure represents all these activities graphically:
The following figure provides a breakdown of all the SCM activities into more granular topics.
A successful SCM implementation requires careful planning and management. This, in turn, requires an understanding of the organizational context for, and the constraints placed on, the design and implementation of the SCM process.
Some aspects that should be decided during this activity are:
The software configuration identification activity identifies items to be controlled, establishes identification schemes for the items and their versions, and establishes the tools and techniques to be used in acquiring and managing controlled items. These activities provide the basis for the other SCM activities.
Configuration Item: A configuration item is any possible part of the development or delivery of a system or product that it's necessary to identify, produce, store, use and change individually. Many people associate configuration item with a source code file, but configuration items are not limited to that, many other items could be identified and managed such as:
For each configuration item, additional information apart from the item itself is controlled by the SCM. As it is data about data, it is called metadata. Every configuration item must have a unique identification that is sometimes also called label. Additionally, metadata may include additional information such as:
A first step in controlling change is to identify the software items to be controlled. This involves understanding the software configuration within the context of the system configuration, selecting software configuration items, developing a strategy for labelling software items and describing their relationships, and identifying the baselines to be used.
Software Configuration: A software configuration is the set of functional and physical characteristics of software as set forth in the technical documentation or achieved in a product.
Selecting Configuration Items: It is an important process in which a balance must be achieved between providing adequate visibility for project control purposes and providing a manageable number of controlled items. The items of a configuration should include all the items that are part of a given software release.
Defining relationships and interfaces between the various configuration items is key as it also affects other SCM activities such as software building or assessing the impact of suggested changes. The identification or labelling scheme used should support the need to evolve software items and their relationships (e.g. configuration item X requires version A of configuration item Y).
Identifying the baselines is another critical task of SCM Identification. A software baseline is a set of software configuration items formally designated and fixed at a specific time during the software life cycle. The term is also used to refer to a particular version of a software configuration item that has been agreed on. In either case, the baseline can only be changed through formal change control procedures. A baseline, together with all approved changes to the baseline, represents the current approved configuration.
The software is subject to continuous changes that are coming from different sources:
Change Control takes care of keeping track of these changes and ensures that they are implemented in a controlled manner.
The most important activity from a Change Control point of view is the definition of how are changes made:
In short, the key thing is having a clear working flow, you can find an example about the FirefoxOS flow. the process for making changes is clear, it is also important to specify how the revision history of configuration items is going to be kept, how other developers are going to be notified about those changes:
Configuration status accounting main target is recording and reporting of information needed for effective management of the software configuration.
The information that should be available is diverse:
In order to provide and control all this information a good tool support is needed. This could be part of the Configuration Item Management system or another independent tool that is integrated with it.
Reported information can be used by various organizational and project elements, including the development team, the maintenance team, project management, and software quality activities. Reporting can take the form of ad hoc queries to answer specific questions or the periodic production of predesigned reports. Some information produced by the status accounting activity during the course of the life cycle might become quality assurance records.
In addition to reporting the current status of the configuration, the information obtained by this system can serve as a basis for various measurements of interest to management, development, and SCM. Examples include the number of change requests per configuration item and the average time needed to implement a change request, defect arrival pattern per release/component...
The purpose of configuration audits is to ensure that the software product has been built according to specified requirements (Functional Configuration Audit, FCA), to determine whether all the items identified as a part of CI are present in the product baseline (Physical Configuration Audit, PCA), and whether defined SCM activities are being properly applied and controlled (SCM system audit or in-process audit). A representative from management, the QA department, or the customer usually performs such audits. The auditor should have competent knowledge of both SCM activities and the project.
The author should check the product is complete, consistent (e.g. "Are all the correct versions of files used in this current release?"), that no outstanding issues exist (e.g. "There are no critical defects or CRs") and that the product has passed all the required tests to ensure its quality.
The output of the audit should specify whether the product's performance requirements have been achieved by the product design and the product design has been accurately documented in the configuration documentation.
In order to properly perform this activity is important to:
The term "release" is used to refer to a software configuration that is distributed outside of the development team. This includes internal releases as well as distribution to end-users. When different versions of software are available for different platform configurations it is frequently necessary to create multiple releases for delivery.
Building the release:
In order to release a software product, the configuration items must be combined, packaged with the right configuration and in most of the cases built into an executable program that can be installed by the customers. Build instructions ensure that the proper build steps are taken and in the correct sequence. In addition to building software for new releases, it is usually also necessary for SCM to have the capability to reproduce previous releases for recovery, testing, maintenance, or additional release purposes.
Software is built using particular versions of supporting tools, such as compilers. It might be necessary to rebuild an exact copy of a previously built software configuration item. In this case, the supporting tools and associated build instructions need to be under SCM control to ensure availability of the correct versions of the tools (i.e. not only source code evolve, but also the tools we use).
A tool capability is useful for selecting the correct versions of software items for a given target environment and for automating the process of building the software from the selected versions and appropriate configuration data. For large projects with parallel development or distributed development environments, this tool capability is necessary. Most software engineering environments provide this capability.
Release Management:
Software release management encompasses the identification, packaging, and delivery of the elements of a product, for example, executable program, documentation, release notes, and configuration data.
Given that product changes can occur on a continuing basis, one concern for release management is determining when to issue a release. Some aspects to take such a decision are the severity of the problems addressed by the release and the measurements of the fault densities of prior releases.
The packaging task must identify which product items are to be delivered, and then select the correct variants of those items, given the intended application of the product. The information documenting the physical contents of a release is known as a version description document. The release notes typically describe new capabilities, known problems, and platform requirements necessary for proper product operation. The package to be released also contains installation or upgrading instructions. The latter can be complicated because some current users might have versions that are several releases old.
Finally, in some cases, the release management activity might need to track the distribution of the product to various customers or target systems. An example would be a case where the supplier was required to notify a customer of newly reported problems. A tool capability is needed for supporting these release management functions. It is useful to have a connection with the tool capability supporting the issue tracker in order to map release contents to the issues that have been received. This tool capability might also maintain information on various target platforms and on various customer environments.
This chapter provides a set of best practices or patterns that should be used in SCM. There are multiple tools that can be used for SCM. Some of them focus on the configuration identification and change control, some others pay special attention to the auditing and accounting and some others are focused on the build and release part. In most of the cases different tools are required and what is important is all the tools are properly integrated. For instance, a typical situation is using a tool for managing the source code (e.g. Subversion or git), another one for keeping track of the issues, defects or releases (e.g. Redmine or Bugzilla), another one for Agile Management (e.g Trello) and maybe another one for Continous Integration (e.g. Travis). In such a multi-tool environment it is important to ensure that the changes on the configuration items can be linked to the issues and releases in the issue tracker and the agile manaagement tool.
It is important to stress that there are multiple different paradigms for managing the source code, being the most important distinction whether the system is centralized or distributed. Linus Torvalds (who was the creator of git) gave an interesting speech in the Google Tech Talk event in which he compared both approaches [[LINUS-SCM-GOOGLE]].
The patterns described in this chapter try to be generic enough so they can be applied in centralized and distributed systems. However some of them may also apply to one of them. Additionally, it is important to stress that usually, a distributed system can be configured to work in a centralized way. Additinally, during the last years distributed systems have proliferated and it seems they have become the de-facto standard for SCM.
In previous chapter the formal definition of baseline (according to IEEE) has been provided. From a more "practical" point of view, a baseline is a consistent set of Configuration Items (sometimes also called tagging or labelling). A baseline is a reference basis for evolution and releasing.
The frequency of the baseline releasing, depends a lot on the software development methodology that is used:
Obviously, working in an evolutionary manner, requires more frequent releases of the baselines than in a waterfall model. Hence, although the need of having an easy way to release is good in general, it's even more important in agile approaches.
A repository is a system that stores the different versions of all the configuration items. The repository remembers every change ever written to it: every change to every configuration item and changes on the repository structure (such as the addition, deletion and rearrangement of files and directories).
Depending on the type of approach (centralized or distributed) there may be a centralized repository that is considered as the mastercopy of the project.
A workspace is a copy of the repository that developers have in their machines and that is used to progress on the software development. The changes that developers make in their working copies are not available to other developers until they have transferred the data to the repository. A working copy does not have all the versions of the configuration items but just one. However, developers have the opportunity to retrieve from the repository any version of any configuration item they are interested in.
When a developer wants to create a working copy based on the content of the repository, he should perform a "checkout" or "clone" of the repository. A checkout is the operation that copies all the configuration items of a repository to create a new working copy. The checkout operation can be requested in any version of the repository, but by default, it requests the latest one (a.k.a HEAD). The checkout does not only retrieves the content of the configuration items but also all their revisions, configuration information and branches.
When a developer makes some changes in his working copy that wants to submit to the repository (so that other developers can use it) he should perform a "commit" operation (a.k.a. check-in). The commit operation allows developers to contribute to the repository new versions of one or multiple configuration items. In some systems (e.g. SVN) the commit is directly submitted to the repository. However, in some others, such as distributed systems (e.g git), a commit needs to be sent to the repository. This could be achieved in different ways:
Once a developer has a working copy he can request at any time to synchronize with the latest version available in other repository. In centralized systems, the synchronization will be performed just with one repository, the central one. However, in disitributed systems, developers can (and usually do) synchronize with multiple repositories. For instance, a typical Git configuration is having a remote repository name upstream pointing out to the project upstream repository and another one named "origin" that points to the developer repository. This operation is called "update" in centralized systems and "pull" in distributed ones. When this operation is requested, the configuration items that have been changed in the repository are updated in the developer working copy.
SCMs will never incorporate other people's changes (update), nor make your own changes available to others (commit), until you explicitly tell it to do so.
Different systems have different approaches with regards to configuration item versioning and identification. In Subversion or Git, every time a commit is performed in the repository a new revision of the repository is created.
As revisions are always linked to a commit, they are also called
"commit IDs". In Git, as they are hashes, they are also referred
to as "Hash IDs". For instance, the revision identifier of the
first commit in the repository of these notes is
53a3797f7f406f15220955f5f6883cbae36e826f
as you can
see
here.
A commit may include one or more configuration items with changes, due to that between two subsequent revisions, more than one item may differ from one to another.
For instance, this commit includes changes in one file and add 2 new ones and it just adds a new revision on top of the previous one.
It is important to stress than in modern SCM systems the configuration items are not identified individually but as part of a revision. This is an important change with regards to other old systems such as CVS (Concurrent Versioning System). In order to identify a particular version of a configuration item, the revision in which that configuration item version was available should be referred to.
The master or trunk is the main line of development of the workspace, that is the place where the evolution of the software product should be done. However, having a unique development point in the repository is not enough in most of the software products.
Branching: A non-software example. Suppose your job is to maintain a document for a division in your company, a handbook of some sort. One day a different division asks you for the same handbook, but with a few parts "tweaked" for them, since they do things slightly differently. What do you do in this situation? You do the obvious thing: you make a second copy of your document, and begin maintaining the two copies separately. As each department asks you to make small changes, you incorporate them into one copy or the other. You often want to make the same change to both copies. For example, if you discover a typo in the first copy, it's very likely that the same typo exists in the second copy. The two documents are almost the same, after all; they only differ in small, specific ways. Maintaining the two branches is an extra burden.
As you have seen, maintaining extra branches is expensive, hence before creating long-life parallel branches you need to think if there are altenartives to that: configuration parameters, specific modules, different runtime behaviours...
When a developer wants to create another development line in the repository he creates a branch. A branch is a line of development that exists independently of another line but shares a common history if you look far enough back in time. A branch always begins life as a copy of something, and moves on from there, generating its own history. However, branches that started from a common point and diverged later on can merge eventually again.
In the Software Development process it is sometimes convenient to identify a particular version, release or baseline of the software. This is achieved by tags. A tag is a snahpshot of the repository at a specific point in history. Typically people use this functionality to mark release points (v1.0, and so on). Tags are intended not to change by any means. Different SCMs have different strategies for implementing tags, but most of them implement this feature as a specific branch that does not change with the time.
Before Git was used, branches were used with a lot of care care since merging in other SCM systems such as SVN was very difficult. Merging is the process by which two configuration items are combined into a new one. Depending on the amount of configuration items to be combined and on the type of changes done in them, and on the SCM system used, merging can be a very difficult operation.
Branches are created to save some work by allowing developers to work in independent features in an independent manner. However, that may end up in some times in spending extra time doing a difficult merge task. The reason why Git is so successful nowadays is that it has simplified the way merges are done and hence has enabled developers to create and work on separate branches.
However, easy merging, does not mean branches should be used without care. For instance, in general, overcomplicated structures where branches are created from branches different to master continuously (arborescent approach) should be avoided.
Branching works better when you integrate with the origin of the branch as quickly as possible.
Best Practice 1: Simplify the branching model.
Althoug branching is cheap in systems such as Git that should not be an excuse for creating too complex tree structures diverging from the master branch. Ask developers to branch from the master branch that is the "home codeline" in which you merge all of your development on, except in special circumstances. Branching always from master reduces merging and synchronization effort by requiring fewer transitive change propagations.
It is also important that the expected branches are planned in advance and that a branch diagram is used. Having a diagram is of huge help to the development as it allows at a glance to have a clear understanding of the different branches available and the relationship across them. There are many tools for getting such a diagram automatically.
You can see below an example of such a diagram:
Best Practice 2: Create specific development branches for every feature you implement
Like shown in the previous diagram (branches Story A and Story B), for every feature to be added or for every bug you fix you should create a separate branch so you can work in a isolated and independent manner.
Best Practice 3: Development branches should be short-lived.
More information about when should a development branch be merged will be provided in the following sections, but by having a look at the diagram is easy understand than the later we merge, the more difficult it will be as branches will have diverged more.
Best Practice 4: When development branches must live for a long time, relatively frequent intermediate merges should be done.
When you create a develop branch and it's going to take a long time before you can merge your changes to master branch, try to sync with master frequently so you can avoid your working branch to diverge from master. The more time you wait, the more difficult the merge would be.
Best Practice 5: Branch Customer Releases.
When a new software version to users, the "usual" situation is that the team must work at least in two versions in parallel:
Due to that, when a version is released to customers a release branch should be created. In this way bugfixing can be done on the release branch without exposing the customer to new feature work in progress on the mainline.
The typical workflow for customer releases is:
Best Practice 6: Branch long-lived parallel efforts.
Long lived parallel efforts that multiple people will be working on should be done in independent branches. Imagine you want to experiment with a new feature and you know that before having something that can be merged in the master branch a lot of time is going to be needed. In that case, it makes sense to create a specific branch (similar to the release branch) so that the team can work in that feature while others can check your progress.
Best Practice 7: Be always flexible, there may be some very strong reasons for breaking these "rules".
These are just a set of recommendations, but there are different
ways to work with branches and all of them are right and wrong at
the same time as it is impossible to have a perfect framework.
For instance, some authors [[SUCCESSFUL-GIT-BRANCHING]] promote
the idea of having an integration branch called
develop
and with an infinite lifetime (as master)
as shown in the following figure:
When working in multiple branches, the task of combining them into a single line of code (merging) is of endeavour importance.
When the work in the two branches to merge has no overlapping configuration items (no configuration item has been modified in both), the merging task is easier. However, although no conflicts should occur during the merging, it does not mean that the result of the merge is going to be good enough. Let's have a look at a non-software example:
Imagine you are Dr. Frankenstein and you want to build a human being. You have a development team composed by two developers, in order to avoid problems when merging their contributions you ask one to develop the legs and another one to develop the arms. When both have finished their task the merging is done with no problem, i.e. 2 arms and 2 legs are assembled in the body. However, imagine what happens if the left legs are twice longer than the righ leg: the merge worked OK but the result is a monster.
In conclusion, a merge without conflicts can also be a bad merge.
When some configuration items are modified in both branches, the merging task is not immediate as manual intervention is required to suggest how to solve the conflicts that result of modiying separately the same file. A conflict in a merge is said to occur when two configuration items have been modified with divergent changes.
Best Practice 8: Developers making the changes should be the ones responsible to fix the conflicts.
They are the ones who know better the code they have modified so the best way to prevent a Frankestein to be created is asking them to ensure the merge work leads to a fully functional result.
Software is developed in teams because concurrent work is needed. Nonetheless, the more people in your team, the more potential for conflicting changes.
In order to minimize the number of conflicts and facilitate the work of the team it is important to encourage team members to:
But finding the right balance for this last issue (check stable code but soon) is usually difficult.
Working from a highly tested stable line is not always an option when new features are being developed, otherwise the frequency of the commits would not be as high as it is needed. However, although not being highly tested, at least it is expected that the code that is retrieved from the repository has a reasonable quality. In order to get to a good trade-off it is important to require developers to perform simple procedures before submitting code to the codeline, such as a preliminary build, and some level of testing
The good trade-off is having a development line stable enough for the work it needs to do. Do not aim for a perfect active development line, but rather for a mainline that is usable and active enough for your needs.
An active development line will have frequent changes, some well tested checkpoints that are guaranteed to be "good", and other points in the codeline are likely to be good enough for someone to do development on the tip of the line.
Some aspects that should be considered by developers are:
Many of the concepts we have just described are very related with the concept of Continuous Integration and will be explained in the next chapter.
Best Practice 9: Before pushing a contribution (Pull Request or Direct Push), ensure that the latest version of the repository is available in the working copy.
Best Practice 10: Think globally by building locally. Ensure the system builds before pushing.
The only way to truly test that any change is 100% compatible with the system is through the centralized integration build. However, if we do not test it in our working copy, it is highly likely our changes break the build and disturbs the work of other developers. Before making a submission to source control, developers should build the system using a Private System Build that is similar to the centralized build. A private system build does take time, but this is time spent by only one person rather than each member of the team should there be a problem.
Best Practice 11: Code can be committed with bugs if they are known and do not introduce regressions.
Do not wait to have the final version of your software. Sometimes it's better to have the code available in the master branch soon (even with known bugs) than waiting extra time to fix and land the code later (when more conflicts can happen and less time will be spent in testing by other developers)
Since many people are making changes in the repository, it is impossible for a developer to be 100% sure that the entire system builds correctly after they integrate their changes in the repository even if they create a local build before and extensively test it.
Continuous Integration (CI) is a software development practice where members of a team commit their work frequently, leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. This approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.
Building is the process of getting the sources turned into a running system. This can often be a complicated process involving compilation, moving files around, generating configuration files, loading schemas into the databases, and so on. However this process can (and as a resullt should) be automated.
Automated environments for builds are a common feature of systems. The Unix world has had make for decades, the Java community developed Ant, the .NET community has had Nant and now has MSBuild, for node.js and Javascript we have now Grunt, Gulp and many more... What is important, regardless of the Programming Language and framework is to make sure you can build and launch your system using these scripts using a single command. A common mistake is not to include everything in the automated build. This should be avoided by all means as anyone should be able to bring in a virgin machine, check the sources out of the repository, issue a single command, and have a running system on their machine.
Best Practice 12: The full build process should be automated and include everything that is required.
A big build often takes time, and with CI we want to detect issues as soon as practical so optimizing build time is key to meet this target as in some cases, building a complete sytem might take hours. In order to save time, good build tools analyzes what needs to be changed as part of the process and perform only the required actions. The common way to do this is to check the dates of the source and object files and only compile if the source date is later. One of the trickiest aspects of building in an incremental way is managing depedencies: if one object file changes those that depend on it may also need to be rebuilt.
Best Practice 13: Try to minimize the time required to generate the build.
As explained before, multiple tools existe in order to perform the build, that depend, for instance, in the OS of the host machine for the repository: Make, Ant, Grunt ... There are also some cross-platform tools that allow to create a custom centralized build process in any OS and SCM system.
The build process should take into account that different targets or configurations may be supported. For instance, desktop software must be usually built for Windows, OS-X and Linux so the build system should be able to create builds for all these systems.
Having a central build ensures the software is always built in the same manner. The software build process should be reproducible, so the same build could be created as many times as needed and as close as possible to the final product build.
Best Practice 14: Have a centrazlied and reproducible build system.
A build may be successfully created and it may run, but that doesn't mean it does the right thing. Modern statically typed languages can catch many bugs, but far more are not detected by the compilers.
A good way to catch bugs quickly and efficiently is to include automated tests in the build process. Testing isn't perfect, of course, but it can catch a lot of bugs.
The good news is that the rise of TDD has lead to a wide availability of automated testing frameworks and tools such as the XUnit family, Selenium and plenty of others.
Of course the self-testing is not going to find everything as tests do not prove the absence of bugs but they help to detect bugs early and hence minimize their impact. As in the case of the build generation, passing the tests takes time and we should try to optimize the testing process (in terms of performance and the amount of relevant tests to be passed).
As we are encouraging developers to commit frequently, ensuring the mainline stays in a healthy state is an important but difficult task.
The best way to ensure that is by having regular builds on an integration machine and only if this integration build succeeds should the commit be considered to be done. Since the developer who commits is responsible for this, that developer needs to monitor the mainline build so they can fix it if it breaks. Your work is not completely done until the mainline build is finished and has passed all the self-tests.
Best Practice 15: Create a new build with every commit.
A continuous integration server acts as a monitor to the repository. Every time a commit against the repository is done the server automatically checks out the sources onto the integration machine, initiates a build, passes the self-test and notifies the committer of the result of the build and tests.
The best way to monitor the repository is by using tools such as hooks. Hooks are a set of actions that could be configured in Git to be done every time a user commits a file to the repository. A hook could be pre-commit or post-commit, depending on whether the hook is executed before the commit or after the commit is done respectively.
Pre-commit hooks may be used for intance to reject commits that have some errors (e.g. with the changes the system does not build), in that case if the error is detected the commit is rejected and notified to the user doing the commit.
If post-commit hooks are used, it's the developer or the repository administrator the one responsible to perform corrective actions in case the build is not properly generated or the tests do not pass.
A key part of doing a continuous build is that if the mainline build fails, it needs to be fixed right away. The whole point of working with CI is that you're always developing on a known stable base.
It's not a terrible thing for the mainline build to break, although if it's happening all the time it suggests people aren't being careful enough about updating and building locally before a commit. When the mainline build does break, however, it's important that it gets fixed fast. Usually, the fastest way to fix the build is to revert the latest commit from the mainline, taking the system back to the last-known good build, this is sometimes known as backing out the commit. Unless the cause for the breakage is immediately obvious and can be fixed really fast, developers should just revert the mainline and debug the problem on the working copy leaving the repository clean.
Best Practice 16: Back out any commit that breaks the master build immediately.
Continuous Integration is all about communication, so it is important to ensure that everyone can easily see the state of the system and the changes that have been made to it.
SCM systems such as Git provide us the information about the changes done but Git as such does not communicate the state of the mainline build. The ideal solution should be providing a web site (either integrated in the SCM or as a standalone one) that will show you if there's a build in progress and what was the state of the last mainline build. An example of such a system is Travis [[TRAVIS-CI]].
A release is a version of the product that is made available to its intended customers. External releases are published to end-users whereas internal releases are made available only to developers. The releases are identified by release numbers which are totally independent from the SCM version numbers.
Releases can be also classified in full or partial releases, depending on whether it requires a complete installation or not respectively. Partial releases require a previous full release to be installed.
Release creation involves collecting all files and documentation required to create ystem release. Configuration descriptions have to be written for different hardware and installation scripts have to be written. The specific release must be documented to record exactly what files were used to create it. This allows it to be re-created if necessary
Release planning is concerned with when to issue a system version as a release. The following factors should be taken into account for defining a release strategy:
There is a need to submit continuously changes to the repository. The reasons for checking-in changes are multiple:
Even in a continuous integration model, it is important to have the possibility to control the changes that have been done to the configuration items in the repository. Control in this context does not mean approval but traceability. I.e. It is not always needed that someone approves a change, but what is needed is that it is possible to identify for every change committed the reasons for it. Lack of control in the process leads to project failures, confusion and chaos. Using a good control mechanism enables communication, sharing data and efficiency.
Depending on the codeline in which the changes are done, the level of information required and the flow that should be followed for implementing and approving it should be different.
For instance, changes in master should be encouraged rather than discouraged. In order to do so, developers should be free to commit their changes to the repository if:
Additionally, anybody in the development team should be free to raise additional issues that can be assigned to anybody within the team. Giving freedom to the development team (within some limits) is usually a good idea.
If the changes are going to be applied in a branch that was created based on a commercial release, the process usually follows a more strict control. For instance:
Regardless of how "controlled" is the implementation of changes in the repository, the system used should allow some features such as:
With respect to the tools, there are multiple tools that allow issue tracking. There are commercial ones such as Jira or free ones such as Bugzilla or Redmine. The later is a very powerful as it is quite flexible and can be integrated with the most popular source code management tools such as git and svn. Additionally, other Agile management tools can be used such as Trello.
Additionally, the Source Code Management tools should also have adequate authorization mechanism to ensure the traceability of the changes, i.e. identified who made any particular change in the repository. For instance, SVN offers and authentication mechanism and allows to use others such as LDAP. The important thing is that SVN identifies which user has done which commit in order to trace back a change to the author. SVN also allows the definition of permissions in per-branch or per-configuration item basis, access to some branches may be only allowed to some users.
The purpose of software testing is to ensure that the software systems work as expected when their target customers and users use them.
The basic idea of testing involves the execution of software and the observation of its behavior or outcome. If a deviation from the expected behavior is observed, the execution record is analyzed to find and fix the bug(s) that caused the failure.
Testing could be hence defined as a controlled experimentation through program execution in a controlled environment before product release. Therefore testing fulfills two primary purposes:
Testing could be categorized in different types based on different criteria.
The main difference between functional and structural testing is the knowledge (or lack of knowledge) about the software internals and hence the related focus:
When a "black-box" approach is followed, the definition of the test-cases that should be executed does not take into account the structure of the software. The execution of the test cases focuses on the observation of the program external behavior during execution. It checks what is the external output of software based in some inputs.
There are different levels in which Black-Box testing can be performed:
Structural testing requires the knowledge of the internals of the software implementation. It verifies the correct implementation of internal units, such as program statements, data structures, blocks... and the relations among them.
Defining test cases in a structural way, consists in using the knowledge of the software implementation in order to reduce the number of test cases to be passed. With the tendency of defining the tests cases (even automating them as we will see in the TDD section) before the software is implemented. This kind of techniques are not that useful nowadays.
When executing tests in a structural way, as the key focus is the connection between execution behavior and internal units, the observation of the results is not enough and additional software tools are also required. For instance, debuggers, that help us in tracing through program executions. By doing so, the tester can see if a specific statement has been executed, and if the result or behavior is expected.
This kind of testing is usually very complex due to the use of these tools. However, its key advantage is that once a problem is detected it is also localized (the failure leads directly to the bug). Because of this complexity, this testing is only done when the root of a bug discovered via functional testing cannot be found or in very late stages of the project.
One of the most important decisions that should be taken by the QA (and the whole software and product) team is to decide when to stop testing.
Obviously, an easy (but wrong decision) would be stopping based on the resources, e.g. stop when you run out of time or of money or time. As such a decision would lead to quality problems, we need to find a quality-based criteria to decide when our product has passed enough tests. In order to identify when the product has reached the quality goals, there are different points of view:
Actual customer usage of software products can be viewed as a form of usage-based testing. Measuring directly the quality in a real environment is the most accurate way to identify if the software quality targets have been achieved.
The so-called beta test makes use of continuous iterations, through controlled software release so that these beta customers help software development and organizations improve their software quality.
In usage-based statistical testing (UBST), the overall testing environment resembles the actual operational environment for the software product in the field, and the overall testing sequence, as represented by the orderly execution of specific test cases in a test suite, resembles the usage scenarios, sequences, and patterns of actual software usage by the target customers.
Although very useful, as this approach is helpful to detect not only bugs but also other type of prblems, this approach could be dangerous if it is not use with care as it could damage the software vendor's reputation, for instance if the produce released as beta has very bad quality. Due to that, it is recommended to use this approach mainly for final software stages or when the team feels very confident about software stability.
Most traditional testing techniques, either Black or White Box, use various forms of test coverage as the stopping criteria. This means that the testing process is stopped when a set of tests are executed successfully in the software. In this case, the key aspects are identifying which is the required test coverage and what executed successfully means.
With respect to coverage, in the case of Functional Testing, it could consist on completing a checklist of major functions based on product specification (system requirements), it could consist in having a minimal number of Test Cases per User Story, etc...
In case of Structural Testing, it could consist on completing a checklist of all the product components or all the statements of the software.
With respect to what executed successfully means, as we know, it is impossible to have a 100% bug-free software, so in some cases, it is assumed that some bugs could be allowed in order to decide the product to be finished, obviously, this depends on the type of product, the criticality of the bugs, the timescales...
The key differences that distinguish CBT from UBST are the perspective and the related stopping criteria.
With regards to the perspective, UBST views the objects of testing from a user's perspective and focuses on the usage scenarios, sequences, patterns, and associated frequencies or probabilities. On the other hand, CBT views the objects from a developer's perspective and focuses covering functional or implementation units and related entities.
With regards to the stopping criteria, UBST uses product in use metrics as the exit criterion. CBT uses coverage goals - that are supposed to be an approximations of in-use goals - as the exit criterion.
Tests are frequently grouped by where they are added in the software development process, or by the target (element) to be tested.
Unit testing refers to tests that verify the functionality of a specific section of code, usually at the function level. In an object-oriented environment, this is usually at the class level, and the minimal unit tests include the constructors and destructors.
These types of tests are usually written by developers as they work on code (white-box style), to ensure that the specific function is working as expected. One function might have multiple tests, to catch corner cases or other branches in the code. Unit testing alone cannot verify the functionality of a piece of software, but rather is used to assure that the building blocks the software uses work independently of each other.
Unit testing is also called component testing.
Integration testing is any type of software testing that seeks to verify the interfaces between components against a software design. Software components may be integrated in an iterative way or all together ("big bang"). Normally the former is considered a better practice since it allows interface issues to be localised more quickly and fixed.
Integration testing works to expose defects in the interfaces and interaction between integrated components (modules). Progressively larger groups of tested software components corresponding to elements of the architectural design are integrated and tested until the software works as a system.
System testing tests a completely integrated system to verify that it meets its requirements.
System integration testing verifies that a system is integrated to any external or third-party systems defined in the system requirements.
Although testing has a common set of goals, the targets of testing can be very different. Some examples of type testing based on their goals are listed in this section.
Regression testing focuses on finding defects after a major code change has occurred. Specifically, it seeks to uncover software regressions, or old bugs that have come back. Such regressions occur whenever software functionality that was previously working correctly stops working as intended. Typically, regressions occur as an unintended consequence of program changes, when the newly developed part of the software collides with the previously existing code. Common methods of regression testing include re-running previously run tests and checking whether previously fixed faults have re-emerged. The depth of testing depends on the phase in the release process and the risk of the added features. They can either be complete, for changes added late in the release or deemed to be risky, to very shallow, consisting of positive tests on each feature, if the changes are early in the release or deemed to be of low risk.
Acceptance testing can mean one of two things:
Alpha testing is simulated or actual operational testing by potential users/customers or an independent test team at the developers' site. Alpha testing is often employed for off-the-shelf software as a form of internal acceptance testing, before the software goes to beta testing.
Beta testing comes after alpha testing and can be considered a form of external user acceptance testing. Versions of the software, known as beta versions, are released to a limited audience outside of the programming team. The software is released to groups of people so that further testing can ensure the product has few faults or bugs. Sometimes, beta versions are made available to the open public to increase the feedback field to a maximal number of future users.A
As in many other software related activities, the typical plan, execute and assess flow is also used in testing as depicted in the figure below.
Most of the key decisions about testing are made during this stage. During this phase an overall testing strategy is fixed by making the following decisions:
As soon as the first models are being generated (for example, usage models, system models, architectural models, etc), they can be used to generate test cases: A test case is a collection of entities and related information that allows a test to be executed or a test run to be performed. The collection of individual test cases that will be run in a test sequence until some stopping criteria are satisfied is called a test suite. IEEE Standard 610 (1990) defines test case as follows:
According to Ron Patton: "Test cases are the specific inputs that you'll try and the procedures that you'll follow when you test the software."
From a more practical point of view, a test case is composed by:
On the other hand, a test run, is a dynamic unit of specific test activities in the overall testing sequence on a selected testing object. Each time a static test case is invoked, an individual dynamic test run is created.
One aspect that should be considered when planning the test cases is the sequencing of the individual test cases and the switch-over from one test run to another. Several concerns affect the specific test procedure to be used, including:
The most important activities related with test execution are:
One of the critical aspects in order to fulfill the objectives of testing is checking if the result of the test run is successful or not. In order to do so, it must be possible to observe the results of the test and determine whether the expected result was achieved or not.
Is enough with observing the results? In some situations, such as in object-oriented software, the execution of a test-run may have affected the state of an object. That state might also affect in the future the software that has been tested. Due to that, in some situations it is helpful to examine the state of some objects before and after a test is conducted. The reason for that is that only a small percentage of the overall functionality of an object can be observed via the return values.
This may be in conflict with using a "black-box" testing approach, in which only events observable outside can be used to verify the results of a test-run. However, the meaning of observed may be different for different software projects: outside a method? An object? The whole software?
When a failure is observed, it needs to be recorded and tracked until their resolution. In order to allow developers to trace back the failure to the bug causing it, it is important that detailed information about failure observations and the related activities is registered.
But not only failures must be registered, successful executions also need to be recorded as it is very important for regression testing.
In general for every test-run the following information should be gathered:
The results of the testing activities (i.e. the measurement data collected during test execution), together with other data about the testing and the overall environment provide valuable feedback to test execution and other testing and development activities.
Obviously, as a consequence of testing, there are some direct follow-up activities:
In order to fix an issue, it is important to follow these steps:
In order to take appropriate management decisions, some analysis can be performed on the overall testing results:
Exhaustive testing is the execution of every possible test case. Rarely can we do exhaustive testing. Even simple systems have too many possible test cases. For example, a program with two integer inputs on a machine with a 32-bit word would have 264 possible test cases. Thus, testing is always executing a very small percentage of the possible test cases.
Two basic concerns in software testing definition have been already introduced (1) what test cases to use (test case selection) and (2) how many test cases are necessary (stopping criterion). Test case selection can be based on either the specifications (functional), the structure of the code (structural), the flow of data (data flow), or random selection of test cases. Test case selection can be viewed as an attempt to space the test cases throughout the input space. Some areas in the domain may be especially error-prone and may need extra attention. It has been also mentioned that the stopping criterion can be based on a coverage criterion, such as executing N test cases in each subdomain, or the stopping criterion can be based on a behavior criteria, such as testing until an error rate is less than a threshold x.
A program can be thought of as a mapping from a domain space to an answer space or range. Given an input, which is a point in the domain space, the program produces an output, which is a point in the range. Similarly, the specification of the program is a map from a domain space to an answer space.
Please also remember, that a specification is essential to software testing. Correctness in software is defined as the program mapping being the same as the specification mapping. A good saying to remember is "a program without a specification is always correct". A program without a specification cannot be tested against a specification, and the program does what it does and does not violate its specification.
A test coverage criterion is a rule about how to select tests and when to stop testing. One basic issue in testing research is how to compare the effectiveness of different test coverage criteria. The standard approach is to use the subsumes relationship.
A test criterion A subsumes test coverage criterion B if any test set that satisfies criterion A also satisfies criterion B. This means that the test coverage criterion A somehow includes the criterion B. For example, if one test coverage criterion required every statement to be executed and another criterion required every statement to be executed and some additional tests, then the second criterion would subsume the first criterion.
Researchers have identified subsumes relationships among most of the conventional criteria. However, although subsumes is a characteristic that is used for comparing test criterian, it does not measure the relative effectiveness of two criteria. This is because most criteria do specify how a set of test cases will be chosen. Picking the minimal set of test cases to satisfy a criterion is not as effective as choosing good test cases until the criterion is met. Thus, a good set of test cases that satisfy a "weaker" criterion may be much better than a poorly chosen set that satisfy a "stronger" criterion.
In functional testing, the specification of the software is used to identify subdomains that should be tested. One of the first steps is to generate a test case for every distinct type of output of the program. For example, every error message should be generated. Next, all special cases should have a test case. Tricky situations should be tested. Common mistakes and misconceptions should be tested. The result should be a set of test cases that will thoroughly test the program when it is implemented. This set of test cases may also help clarify to the developer some of the expected behavior of the proposed software.
In the book "The Art of Software Testing", Glenford Myers poses
the following functional testing problem: Develop a good set of
test cases for a program that accepts three numbers, a, b,
c,
interprets those numbers as the lengths of the sides of
a triangle, and outputs the type of the triangle. Myers reports
that in his experience most software developers will not respond
with a good test set.
An approach to define the test cases for this classic triangle problem, is dividing the domain space into three subdomains, one for each different type of triangle that we will consider: scalene (no sides equal), isosceles (two sides equal), and equilateral (all sides equal). We can also identify two error situations: a subdomain with bad inputs and a subdomain where the sides of those lengths would not form a triangle. Additionally, since the order of the sides is not specified, all combinations should be tried. Finally, each test case needs to specify the value of the output. Following table shows a possible solution.
Subdomain | Test Description | Test Case |
---|---|---|
Scalene | Increasing Size | (3,4,5) -> Scalene |
Decreasing Size | (5,4,3) -> Scalene | |
Largest is second | (4,5,3) -> Scalene | |
Isosceles | a=b & other side larger | (5,5,8) -> Isosceles |
a=c & other side larger | (5,8,5) -> Isosceles | |
b=c & other side larger | (8,5,5) -> Isosceles | |
a=b & other side smaller | (8,8,5) -> Isosceles | |
a=c & other side smaller | (8,5,8) -> Isosceles | |
b=c & other side smaller | (5,8,8) -> Isosceles | |
Equilateral | a=b=c | (5,5,5) -> Equilateral |
Not a triangle | Largest first | (6,4,2) -> Not a triangle |
Largest second | (4,6,2) -> Not a triangle | |
Largest third | (1,2,3) -> Not a triangle | |
Bad Inputs | One bad input | (-1,4,2) -> Bad Inputs |
Two bad inputs | (-1,2,0) -> Bad Inputs | |
Three Bad Inputs | (0,0,0) -> Bad Inputs |
This list of subdomains could be increased to distinguish other subdomains that might be considered significant. For example, in scalene subdomains, there are actually six different orderings, but the placement of the largest might be the most significant based on possible mistakes in programming.
Note that one test case in each subdomain is usually considered minimal but acceptable.
A way to formalize this identification of subdomains is to build a matrix using the conditions that we can identify from the specification and then to systematically identify all combinations of these conditions as being true or false.
The conditions in the triangle problem might be:
These four conditions can be put on the rows of a matrix. The columns of the matrix will each be a subdomain. For each subdomain, a T will be placed in each row whose condition is true and an F when the condition is false. All valid combinations of T and F will be used. If there are four conditions, there may be 2^4 = 8 subdomains (columns). Not all the combinations are possible as some of the conditions depend on others to be true or false. Additional rows will be used for defining possible values of a, b, and c and for the expected output for each subdomain test case.
Next table shows an example of this matrix:
Conditions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
a=b or a=c or b=c | T | T | T | T | T | F | F | F |
a=b and b=c | T | T | F | F | F | F | F | F |
a>=b+c or b>=a+c or c>=a+b | T | F | T | T | F | T | T | F |
a<=0 or b<=0 or c<=0 | T | F | T | F | F | T | F | F |
Sample Test Case | 0,0,0 | 3,3,3 | 0,4,0 | 3,8,3 | 5,8,5 | 0,5,6 | 3,4,8 | 3,4,5 |
Expected Output | Bad inputs | Equilateral | Bad inputs | Not a triangle | Isosceles | Bad inputs | Not a triangle | Scalene |
Structural testing coverage is based on the structure of the source code. The simplest structural testing criterion is every statement coverage, often called C0 coverage.
This criterion is that every statement of the source code should be executed by some test case. The normal approach to achieving C0 coverage is to select test cases until a coverage tool indicates that all statements in the code have been executed.
The pseudocode in the following table implements the triangle problem. The table also shows which lines are executed by which test cases. Note that the first three statements (A, B, and C) can be considered parts of the same node.
Node | Source | 3,4,5 | 3,5,3 | 0,1,0 | 4,4,4 |
---|---|---|---|---|---|
A | read a,b,c | * | * | * | * |
B | type="scalene" | * | * | * | * |
C | if((a==b) || (b==c) || (a==c))) | * | * | * | * |
D | type="isosceles" | * | * | * | |
E | if((a==b)&&(b==c)) | * | * | * | * |
F | type="equilateral" | * | |||
G | if((a>=b+c) || (b>=a+c) || (c>=a+b))) | * | * | * | * |
H | type="not a triangle" | * | |||
I | if((a<=0>) || (b<=0>) || (c<=0>))) | * | * | * | * |
J | type="bad inputs" | * | |||
K | print type | * | * | * | * |
By the fourth test case, every statement has been executed. This set of test cases is not the smallest set that would cover every statement. However, finding the smallest test set would often not find a good test set.
A more thorough test criterion is every-branch testing, which is often called C1 test coverage. In this criterion, the goal is to go both ways out of every decision.
If we model the program of previous table as a control flow graph, this coverage criterion requires covering every arc in the following control flow diagram.
Next table shows the test cases identified with this criteria.
Arcs | Test Case: (3,4,5) | Test Case: (3,5,3) | Test Case: (0,1,0) | Test Case: (4,4,4) |
---|---|---|---|---|
ABC-D | * | * | * | |
ABC-E | * | |||
D-E | * | * | * | |
E-F | * | |||
E-G | * | * | * | |
F-G | * | |||
G-H | * | |||
G-I | * | * | * | |
H-I | * | |||
I-J | * | |||
I-K | * | * | * | |
J-K | * |
Even more thorough is the every-path testing criterion. A path u is a unique sequence of program nodes that are executed by a test case. In the testing matrix above, there were eight subdomains. Each of these just happens to be a path. In that example, there are sixteen different combinations of T and F. However, eight of those combinations are infeasible paths. That is, there is no test case that could have that combination of T and F for the decisions in the program. It can be exceedingly hard to determine if a path is infeasible or if it is just hard to find a test case that executes that path.
Most programs with loops will have an infinite number of paths. In general, every-path testing is not reasonable.
Next table shows the eight feasible paths in the triangle pseudocode as well as the test cases required for testing all of them.
Path | T/F | Test Case | Output |
---|---|---|---|
ABCEGIK | FFFF | 3,4,5 | Scalene |
ABCEGHIK | FFTF | 3,4,8 | Not a triangle |
ABCEGIJK | FFTT | 0,5,6 | Bad inputs |
ABCDEGIK | TFFF | 5,8,5 | Isosceles |
ABCDEGHIK | TFTF | 3,8,3 | Not a triangle |
ABCDEGHIJK | TFTT | 0,4,0 | Bad Inputs |
ABCDEFGIK | TTFF | 3,3,3 | Equilateral |
ABCDEFGHIJK | TTTT | 0,0,0 | Bad Inputs |
A multiple-condition testing criterion requires that each primitive relation condition is evaluated both true and false. Additionally, all combinations of T/F for the primitive relations in a condition must be tried. Note that lazy evaluation of expressions will eliminate some combinations. For example, in an "and" of two primitive relations, the second will not be evaluated if the first one is false.
In the pseudocode for the triangle example, there are multiple conditions in each decision statement as displayed in the tables below. Primitives that are not executed because of lazy evaluation are shown with an 'X'.
Combination | Possible Test Case | Branch |
---|---|---|
TXX | 3,3,4 | ABC-D |
FTX | 4,3,3 | ABC-D |
FFT | 3,4,3 | ABC-D |
FFF | 3,4,5 | ABC-E |
Combination | Possible Test Case | Branch |
---|---|---|
TT | 3,3,3 | E-F |
TF | 3,3,4 | E-G |
FX | 4,3,3 | E-G |
Combination | Possible Test Case | Branch |
---|---|---|
TXX | 8,4,3 | G-H |
FTX | 4,8,3 | G-H |
FFT | 4,3,8 | G-H |
FFF | 3,3,3 | G-I |
Combination | Possible Test Case | Branch |
---|---|---|
TXX | 0,4,5 | I-J |
FTX | 4,-2,-2 | I-J |
FFT | 5,-4,3 | I-J |
FFF | 3,3,3 | I-K |
Subdomain testing is the idea of partitioning the input domain into mutually exclusive subdomains and requiring an equal number of test cases from each subdomain. This was basically the idea behind the test matrix. Subdomain testing is more general in that it does not restrict how the subdomains are selected. Generally, if there is a good reason for picking the subdomains, then they may be useful for testing. Additionally, the subdomains from other approaches might be subdivided into smaller subdomains. Theoretical work has shown that subdividing subdomains is only effective if it tends to isolate potential errors into individual subdomains.
Every-statement coverage and every-branch coverage are not subdomain tests. There are not mutually exclusive subdomains related to the execution of different statements or branches. Every-path coverage is a subdomain coverage, since the subdomain of test cases that execute a particular path through a program is mutually exclusive with the subdomain for any other path.
For the triangle problem, we might start with a subdomain for each output. These might be further subdivided into new subdomains based on whether the largest or the bad element is in the first position, second position, or third position (when appropriate). Next table shows the subdomains and test cases for every subdomain.
Subdomain | Possible Test Case |
---|---|
Eauilateral | 3,3,3 |
Isosceles first largest | 8,5,5 |
Isosceles second largest | 5,8,5 |
Isosceles third largest | 5,5,8 |
Scalene first largest | 5,4,3 |
Scalene second largest | 3,5,4 |
Scalene third largest | 3,4,5 |
Not a triangle first largest | 8,3,3 |
Not a triangle second largest | 3,8,4 |
Not a triangle third largest | 4,3,8 |
Bad Inputs first largest | 4,3,0 |
Bad Inputs second largest | 3,4,0 |
Bad Inputs third largest | -1,4,5 |
Data flow testing is testing based on the flow of data through a program. Data flows from where it is defined to where it is used.
A definition of data, or DEF, is when a value is assigned to a
variable. For example, with respect to a variable x, nodes
containing statements such as input x
and
x = 2
would both be defiing nodes.
Usage nodes (USE) refer to situations in which a variable is used by the software. Two main kinds of use have been identified:
print x
or a
= 2+x
. A C-USE is said to occur on the assignment
statement.
if x>6
). A
P-USE is assigned to both branches out of the decision statement.
There are also three other types of usage node, which are all, in effect, subclass of the C-USE type:
print(x)
).
a[x]
).
for (int i = 0;i <= x; i++)
)
A definition free path, or def-free, is a path from a definition of a variable to a use of that variable that does not include another definition of the variable.
Next figure depicts the Control Flow Graph of Triangle Problem and is annotated with the definitions and uses of the variables type, a, b, and c.
More details about the control-flow procedure and examples can be found in the paper "Data Flow Testing - CS-399: Advanced Topics in Computer Science, Mark New (321917)"
Random testing is accomplished by randomly selecting the test cases. This approach has the advantage of being fast and it also eliminates biases of the testers. Additionally, statistical inference is easier when the tests are selected randomly. Often the tests are selected randomly from an operational profile.
For example, for the triangle problem, we could use a random number generator and group each successive set of three numbers as a test set. We would have the additional work of determining the expected output. One problem with this is that the chance of ever generating an equilateral test case would be very small. If it actually happened, we would probably start questioning our pseudo random number generator.
Testing in the development environment is often very different than execution in the operational environment. One way to make these two more similar is to have a specification of the types and the probabilities that those types will be encountered in the normal operations. This specification is called an operational profile. By drawing the test cases from the operational profile, the tester will have more confidence that the behavior of the program during testing is more predictive of how it will behave during operation.
A possible operational profile for the triangle problem is shown in next table:
# | Description | Probability |
---|---|---|
1 | Equilateral | 20% |
2 | Isosceles - Obtuse | 10% |
3 | Isosceles - Right | 20% |
4 | Scalene - Right | 10% |
5 | Scalene - All Acute | 25% |
6 | Scalene - Obtuse Angle | 15% |
If random testing has been done by randomly selecting test cases from an operational profile, then the behavior of the software during testing should be the same as its behavior in the operational environment.
For instance, if we selected 1000 test cases randomly using an operational profile and found three errors, we could predict that this software would have an error rate of less than three failures per 1000 executions in the operational environment.
Often errors happen at boundaries between domains. In source code,
decision statements determine the boundaries. If a decision
statement is written as x<1
instead of
x<0
, the boundary has shifted. If a decision is
written x=<1
, then the boundary, x=1
, is
in the true subdomain. In the terminology of boundary testing, we
say that the on tests are in the true domain and the off tests are
values of x greater than 1 and are in the false domain.
If a decision is written x<1
instead of
x=<1
, then the boundary, x=1, is now in the false
subdomain instead of in the true subdomain.
Boundary testing is aimed at ensuring that the actual boundary between two subdomains is as close as possible to the specified boundary. Thus, test cases are selected on the boundary and off the boundary as close as reasonable to the boundary. The standard boundary test is to do two on tests as far apart as possible and one off test close to the middle of the boundary.
Next figure shows a simple boundary. The arrow indicates that the on tests of the boundary are in the subdomain below the boundary. The two on tests are at the ends of the boundary and the off test is just above the boundary halfway along the boundary.
In the triangle example, for the primitive conditions, a>=b+c or b>=a+c or c >= a + b, we could consider the boundary. Since these are in three variables as a plane in 3D space. The on tests would be two (or more) widely separated tests that have equality - for example, (8,1,7) and (8,7,1). These are both true. The off test would be in the other domain (false) and would be near the middle - for example, (7.9, 4,4).
For large software systems, the test coverage required to ensure a proper quality may be huge. Due to that, it is impossible to run all the tests manually and mechanisms to automate the tests are used. However, it should be noted that in many cases (if not all) a full automation of the procedure is impossible due to the need of manual intervention or analysis of the results. Hence, when automation is used, it should be assessed in which areas of the software functionality is going to lead to the higher benefits.
Among the three major test activities, preparation, execution, and follow-up, execution is a prime candidate for automation.
The testing that programmers do is generally called unit testing (aka Object Testing):
The rhythm of an Object Test is similar to any other test:
In order to facilitate this process, a number of frameworks have been built for different programming languages such as:
And many more to over 30 programming languages and environments.
Although the implementations are different for every environment, the concepts are the same in any of these frameworks that are known in the abstract as xUnit.
Most software developers just want to write code; testing is simply a necessary evil in our line of work. Automated tests provide a nice safety net so that we can write code more quickly, but we will run the automated tests frequently only if they are really easy to run.
What makes tests easy to run? Four specific goals answer this question:
With these four goals satisfied, one click of a button (or keyboard shortcut) is all it should take to get the valuable feedback the tests provide. Let's look at these goals in a bit more detail.
A test that can be run without any Manual Intervention is a Fully Automated Test. Satisfying this criterion is a prerequisite to meeting many of the other goals. Yes, it is possible to write Fully Automated Tests that don't check the results and that can be run only once. The main() program that runs the code and directs print statements to the console is a good example of such a test.
A Self-Checking Test has encoded within it everything that the test needs to verify that the expected outcome is correct. The Test Runner "calls us" only when a test did not pass; as a consequence, a clean test run requires zero manual effort. Many members of the xUnit family provide a Graphical Test Runner (see Test Runner) that uses a green bar to signal that everything is OK; a red bar indicates that a test has failed and warrants further investigation.A
A Repeatable Test can be run many times in a row and will produce exactly the same results without any human intervention between runs. Unrepeatable Tests increase the overhead of running tests significantly. This outcome is very undesirable because we want all developers to be able to run the tests very frequently, as often as after every "save". Unrepeatable Tests can be run only once before whoever is running the tests must perform a Manual Intervention. Just as bad are non determinstic Tests that produce different results at different times; they force us to spend lots of time chasing down failing tests. The power of the red bar diminishes significantly when we see it regularly without good reason. All too soon, we begin ignoring the red bar, assuming that it will go away if we wait long enough. Once this happens, we have lost a lot of the value of our automated tests, because the feedback indicating that we have introduced a bug and should fix it right away disappears. The longer we wait, the more effort it takes to find the source of the failing test.
Tests that run only in memory and that use only local variables or fields are usually repeatable without us expending any additional effort. Unrepeatable Tests usually come about because we are using a Shared Fixture of some sort. In such a case, we must ensure that our tests are self-cleaning as well. When cleaning is necessary, the most consistent and foolproof strategy is to use a generic Automated Teardown mechanism. Although it is possible to write teardown code for each test, this approach can result in Erratic Tests when it is not implemented correctly in every test.
Coding is a fundamentally difficult activity because we must keep a lot of information in our heads as we work. When we are writing tests, we should stay focused on testing rather than coding of the tests. This means that tests must be simple - simple to read and simple to write. They need to be simple to read and understand because testing the automated tests themselves is a complicated endeavor. They can be tested properly only by introducing the very bugs that they are intended to detect; this is hard to do in an automated way so it is usually done only once (if at all), when the test is first written. For these reasons, we need to rely on our eyes to catch any problems that creep into the tests, and that means we must keep the tests simple enough to read quickly.
Of course, if we are changing the behavior of part of the system, we should expect a small number of tests to be affected by our modifications. We want to Minimize Test Overlap so that only a few tests are affected by any one change. Contrary to popular opinion, having more tests pass through the same code doesn't improve the quality of the code if most of the tests do exactly the same thing.
Tests become complicated for two reasons:
The tests should be small and test one thing at a time. Keeping tests simple is particularly important during test-driven development because code is written to pass one test at a time and we want each test to introduce only one new bit of behavior. We should strive to Verify One Condition per Test by creating a separate Test Method for each unique combination of pre-test state and input.
The major exception to the mandate to keep Test Methods short occurs with customer tests that express real usage scenarios of the application. Such extended tests offer a useful way to document how a potential user of the software would go about using it; if these interactions involve long sequences of steps, the Test Methods should reflect this reality.
Tests should be maintained along with the rest of the software. Testware must be much easier to maintain that production software as otherwise:
The steps of test first design (TFD) are overviewed in the UML activity diagram of next figure. The first step is to quickly add a test, basically just enough code to fail. Next you run your tests, often the complete test suite although for sake of speed you may decide to run only a subset, to ensure that the new test does in fact fail. You then update your functional code to make it pass the new tests. The fourth step is to run your tests again. If they fail you need to update your functional code and retest. Once the tests pass the next step is to start over (you may first need to refactor any duplication out of your design as needed, which is what converts TFD into TDD).
Dean Leffingwell describes TDD with this simple formula:
TDD = Refactoring + TFD.
TDD completely turns traditional development around. When you first go to implement a new feature, the first question that you ask is whether the existing design is the best design possible that enables you to implement that functionality. If so, you proceed via a TFD approach. If not, you refactor it locally to change the portion of the design affected by the new feature, enabling you to add that feature as easy as possible. As a result you will always be improving the quality of your design, thereby making it easier to work with in the future.
Instead of writing functional code first and then your testing code as an afterthought, if you write it at all, you instead write your test code before your functional code. Furthermore, you do so in very small steps - one test and a small bit of corresponding functional code at a time. A programmer taking a TDD approach refuses to write a new function until there is first a test that fails because that function isn't present. In fact, they refuse to add even a single line of code until a test exists for it. Once the test is in place they then do the work required to ensure that the test suite now passes (your new code may break several existing tests as well as the new one). This sounds simple in principle, but when you are first learning to take a TDD approach it proves require great discipline because it is easy to "slip" and write functional code without first writing a new test.
An underlying assumption of TDD is that you have a testing framework available to you. Agile software developers often use the xUnit family of open source tools, such as JUnit or VBUnit, although commercial tools are also viable options. Without such tools TDD is virtually impossible. Next figure presents a UML state chart diagram for how people typically work with the xUnit tools (source Keith Ray).
Kent Beck, who popularized TDD, defines two simple rules for TDD (Beck 2003):
Beck explains how these two simple rules generate complex individual and group behavior:
For developers, the implication is that they need to learn how to write effective unit tests.
Most programmers don't read the written documentation for a system, instead they prefer to work with the code. And there's nothing wrong with this. When trying to understand a class or operation most programmers will first look for sample code that already invokes it. Well-written unit tests do exactly this - they provide a working specification of your functional code - and as a result unit tests effectively become a significant portion of your technical documentation. The implication is that the expectations of the pro-documentation crowd need to reflect this reality. Similarly, acceptance tests can form an important part of your requirements documentation. This makes a lot of sense when you stop and think about it. Your acceptance tests define exactly what your stakeholders expect of your system, therefore they specify your critical requirements. Your regression test suite, particularly with a test-first approach, effectively becomes detailed executable specifications.
Are tests sufficient documentation? Very likely not, but they do form an important part of it. For example, you are likely to find that you still need user, system overview, operations, and support documentation. You may even find that you require summary documentation overviewing the business process that your system supports. When you approach documentation with an open mind, I suspect that you will find that these two types of tests cover the majority of your documentation needs for developers and business stakeholders. Furthermore, they are an important part of your overall efforts to remain as agile as possible regarding documentation.
A significant advantage of TDD is that it enables you to take small steps when writing software. This is far more productive than attempting to code in large steps. For example, assume you add some new functional code, compile, and test it. Chances are pretty good that your tests will be broken by defects that exist in the new code. It is much easier to find, and then fix, those defects if you've written two new lines of code than two thousand. The implication is that the faster your compiler and regression test suite, the more attractive it is to proceed in smaller and smaller steps. I generally prefer to add a few new lines of functional code, typically less than ten, before I recompile and rerun my tests.
The act of writing a unit test is more an act of design than of verification. It is also more an act of documentation than of verification. The act of writing a unit test closes a remarkable number of feedback loops, the least of which is the one pertaining to verification of function.
The first reaction that many people have to agile techniques is that they're ok for small projects, perhaps involving a handful of people for several months, but that they wouldn't work for "real" projects that are much larger. That's simply not true. Beck (2003) reports working on a Smalltalk system taking a completely test-driven approach which took 4 years and 40 person years of effort, resulting in 250,000 lines of functional code and 250,000 lines of test code. There are 4000 tests running in under 20 minutes, with the full suite being run several times a day. Although there are larger systems out there, it's clear that TDD works for good-sized systems.
You are asked to write a code to extract the UK postal area code from any given full UK postcode. Example Input "SS17 7HN", output should be "SS", for input "B43 4RW", output should be "B".
Before starting coding, you need to create the tests and for doing so you need to think about the structure of the software. An obvious approach is creating a class named PostCode with a method that retrieves the postal area code (e.g. areaCode()). In that way, in order to calculate the are code something similar to:
postCode = new PostCode("SS17 THN");
string area = postCode.areaCode();
Hence, the first thing that should be done is developing the test cases for such a solution, as you have been already provided with two examples, a good idea is using them as the test cases:
Depending on the programming language, you should build those test cases through the xUnit tool. If you try to execute them, both are going to fail because the class PostCode does not exist yet.
The next step is building an empty class PostCode with a method postalArea that always return the string postcode.
In this case, again the tests are going to fail, as the return value is not the expected one.
In the following iteration we could implement the solution for the first part of the problem, when the postal code is 8 chars long, the area code is composed by the 2 first ones. If we try again to run the tests, the first one ("SS17 THN") will complete successfully whereas the second one will fail. We have added a very reduce functionality, and we have nearly immediately checks that what we has added is correct.
In the following iteration we could implement the solution for the second part of the problem, when the postal code is 7 chars long, the area code is composed by the firs ones. If we try again to run the tests, both will complete successfully. Again, we have just added very few lines code (e.g. could be 2-3), and have checked immediately that what we has added is correct.
Imagine now that you have been told that there are postal codes of 8 digits in which only the first one denotes the area, e.g. "W1Y2 3RD", now you should first create a new test case for that new example. Obviously, it will fail as it will return "W1" instead of "W " but adding the funcionality to implement the new feature should be quite safe as it would be easy to check if in the process of implementing it you are breaking the existing features.
Although testing is one of the key QA activities, there are many additional actions that could be taken in order to assure that the quality of the product meets some targets. Some of them are going to be analysed in this chapter.
The best way to avoid defects is preventing them and the most common technique for doing so is Defect Causal Analysis (DCA). This type of analysis consists in identifying the causes of defects and other problems and taking action to prevent them from occurring in the future by improving the process and reducing the causes that originate the defects.
DCA can be seen as a systematic process to identify and analyze causes associated with the occurrence of specific defect types, allowing the identification of improvement opportunities for the organizational process assets and the implementation of actions to prevent the occurrence of that same defect type in future projects. DCA is used by many companies, for instance, HP has extesively used it with very good results [[SOFTWARE-FAILURE-ANALYSIS-HP]].
There are multiple methodologies to implement a DCA system, but in general, the following activities should be conducted in all of them:
Defect Identification: Defects are found by QA activities specifically intended to detect defects such as Design review, Code Inspection, function and unit testing.
Defect Classification: Once defects are identified they need to be classified. There are multiple ways and techniques to classify defects, for instance: Requirements, Design, Logical and Documentation. These categories can be again divided in second and third levels depending on the complexity and size of the product.
Orthogonal Defect Classification (ODC) [[ODC]] is one of the most important techiques used for clasifying defects. It means that a defect is categorized into classes that collectively point to the part of the process which needs attention, much like characterizing a point in a Cartesian system of orthogonal axes by its (x, y, z) coordinates.
Defect Analysis: After defects are logged and classified, the next step is to review and analyze them using root cause analysis (RCA) techniques.
As doing a defect analysis for all the defects is a big effort. A useful tool before doing this kind of analysis is a Pareto chart. This kind of charts shows the defect type with the highest frequency of occurrence of defects. It shows the frequencies of occurrences of the various categories of problems encountered, in order to determine which of the existing problems occur most frequently. The problem categories or causes are shown on the x-axis of the bar graph and the cumulative percentage is shown on the y-axis of the graph. Such a diagram helps us to identify the defect types should be given higher priority and must be attended first.
For instance, the following picture shows an example of a Pareto diagram:
Root-cause analysis is the process of finding the activity or process which causes the defects and find out ways of eliminating or reducing the effect of that by providing remedial measures.
Defects are analyzed to determine their origins. A collection of such causes will help in doing the root cause analysis. One of the tools used to facilitate root cause analysis is a simple graphical technique called cause-and-effect diagram / fishbone diagram which is drawn for sorting and relating factors that contribute to a given situation.
It is important that this process uses the knowledge and expertise of the team and that it considers that the target is providing information and analysis in a way that helps implementing changes in the prcoesses that help prevent defects later on.
For instance, the following picture shows an example of a Fishbone diagram:
Defect Prevention: Once the causes of the defects are known it is key to identify actions that can be put in place to cut down these causes. This can be achieved, for intstance with meetings where all the possible causes are identified from the cause-and-effect diagram and debated among the team. All suggestions are listed and then the ones that are identified as the main reasons for causes are separated out. For these causes, possible preventive actions are discussed and finally agreed among project team members.
Process Improvement: Once the preventive actions have been identified, they need to be put in place and verify their effectiveness, for instance by observing the Defect Density and comparing it with previous projects.
You can find some examples and more details about this process at [[DEFECT-PREVENTION-NUTSHELL]] and [[DEFECT-ANALYSIS-AND-PREVENTION]].
During many years, people considered that the only consumers of software were machines and human beings were not intended to review the code after it was written. This attitude began to change in the early 1970s through the efforts of many developers who saw the value in reading code as part of a QA culture.
Nowadays, not all the companies apply techniques based in reading code as part of their Software Development (including QA) process, but the concept of studying program code as part of defect removal preocess is widely accepted as benefitial. Of course, the likelihood of those techniques being successful depend on multiple: factors the size or complexity of the software, the size of the development team, the timeline for development and, of course, the background and culture of the programming team.
Part of the skepticism for this kind of methods is because many people believe that tasks lead by humans could lead to worse results than mathematical proofs conducted by a computer. However, it has been proven that simple and informal code review techniques contribute substantially to productivity and reliability in three major ways.
Code Reviews are generally effective in finding from 30 to 70 percent of the logic-design and coding errors in typical programs. They are not effective, however, in detecting high-level design errors, such as errors made in the requirements analysis process. Note that a success rate of 30 to 70 percent doesn't mean that up to 70 percent of all errors might be found but up to 70% of the defects that are going to be detected (remember we don't know how many defectsi are in a software).
Of course, a possible criticism of these statistics is that the human processes find only the easy errors (those that would be trivial to find with computer-based testing) and that the difficult, obscure, or tricky errors can be found only by computer-based testing. However, some testers using these techniques have found that the human processes tend to be more effective than the computer-based testing processes in finding certain types of errors, while the opposite is true for other types of errors. This means that reviews and computer-based testing are complementary; error-detection efficiency will suffer if one or the other is not present.
Different ways of performing code reviews exist and in the following sections we are going to assess few of them.
For historical reasons, formal reviews are usually called inspections. This is due to the work Michael Fagan conducted and presented in his 1976 study at IBM regarding the efficacy of peer reviews. We are going to called them Formal Code Inspection to distinguish them from other types of Code Reviews.
There is always a inspection team that usually consists of four people. One of them plays the role of moderator who should be an expert programmer, but not the author of the program (he does not need to be familiar with the software either).
Moderator duties include:
The rest of the team is the developer of the code, a software architect (could be the architecture of the software) and a Quality Assurance engineer.
The Inspection Agenda is distributed some days in advance of the Inspect Session. Together with the agenda, the moderator distributes the software, specification and any relevant material to the inspection team so they can become familiar with the material before the meeting takes place.
During the review session the moderator ensures that two key activities take place:
When the session is over, the programmer receives an error list that includes all the errors that have been discovered. Hence, the session is focused on finding defects not fixing them. Despite that, in some occasions, when a problem is discovered, the review team could propose and discuss some design changes. When some of the detected defects require significant changes in the code, the review team could agree to have follow-up meetings in order to review again the code after the changes are implemented.
The list of errors is not only used by the developer in order to fix them; it is also used by moderator to verify if the error checklist could be improved with the results.
The review sessions are typically very dynamic and hence the moderator should be responsible not only for reviewing the code but also to keeping it focused so time is used efficiently (these sessions should be of 90-120 minutes maximum).
This kind of approaches requires of the right attitude, specially from the developer whose work is going to be under scrutiny. He must forget about his ego and think about the process as a way to improve the quality of his work and improve his development skills, as he usually receives a lot of feedback about programming styles, algorithms and techniques. But it is not only the developer but also the rest of the team the ones who could learn by such an open exchange of ideas.
The following diagram describes theis process grafically:
The following tables describes some checklists used in formal code reviews as explained in [[ART-OF-TESTING]].
A Walkthrough is quite similar to "Formal Code Inspections" as it is also very formal, it is conducted by a team, and it takes place during a pre-scheduled session of 90-120 minutes. However, there is a key difference: the procedure during the meeting. Instead of simply reading the software and use checklists, the participants "play computer", which means that a person that is designated as the tester comes to the meeting with a set of pre-defined test cases for the software. During the meeting, each test case is mentally executed; that is, the test data are "walked through" the logic of the program. The state of the program is monitored on paper or a whiteboard.
The test cases must not be a complete set of test cases, especially because every mental execution of a test case use to take a lot of time. The test cases themselves are not the critical thing; they are just an excuse for questioning the developer about the assumptions and decisions taken.
Although the size of the team is quite similar (three to five), the role of the participants is slightly different. Apart from the author of the software and a moderator, there are two key roles in walkthroughs: a tester role (that is the one responsible for guiding the execution of the test cases) and a secretary that writes down all the errors found. Additionally, other participants are welcome, typically experience programmers.
The two formal approaches described formerly, are good, and help to detect many defects. Additionally, they provide extra metrics and information about the effectiveness of the reviews themselves. However, this require a lot of effort, and consumes a lot of extra developer time. Many studies during the last yeasr have shown that there are other less formal methods that could achieve similar results but requiring less training and time.
The first one we are going to study is over-the-shoulder reviews. This is the most common and informal of code reviews. An over-the-shoulder review is just that: a developer standing over the developer's computer while the author walks the reviewer through a set of code changes.
Typically the author "drives" the review by sitting at the computer opening various files, pointing out the changes and explaining why it was done that way. Multiple tools can be used by the developer and it's usual to move back and forth between files.
If the reviewer sees something wrong, they can take different actions, such as doing a little of "pair-programming" while the developer implements the fix or just take note of the issue to be solved offline.
With cooperation tools such as videoconferencing, desktop sharing and so on, it is possible to perform this kind of reviews remotely but obviously, they are not so effective as the greatest asset of this technique is the closeness between developers and the easyness to take ad-hoc actions taken the opportunity of being together.
The key advantage of this approach is its simplicity: no special training is required and can be done at any time without any preparation. It also encourages human interaction and encourages people to cooperate. Reviewers tend to be more verbose and brave when speaking than when they need to record their reviews in a system such as DataBase.
Of course it has some drawbacks. The first one is that due to its informal nature, it is really difficult to be enforced, i.e. there is no way (document, tool, etc...) to check if such a review has been conducted. The second one is that, as the author is the one leading the whole process he might omit parts of the code. The third one is the lack of traceability to check that the detected defects have been properly addressed.
The following diagram describes this process grafically
This is the second-most common form of informal code review, and the technique preferred by most open-source projects. Here, whole files, or changes are packaged up (ZIP file, URL, Pull Request, etc...) by the author and sent to reviewers via e-mail or any other tool. Reviewers examine the files offline, ask questions and discuss with the author and other developers, and suggest changes.
Collecting the files to be reviewed was formerly a difficul task but nowadays, with Source Code Management systems such as Git, it is extremely easy to identify the files that the developer has modified and hence the changes he wants to merge into the main repository.
But SCM tools have helped not only to identifying the changes made by the developer, but also in other multiple areas such as:
Obviously, the main advantage with respect to over-the-shoulder reviews is that it can work perfectly with developers that are not based in the same place, either across a building or across an ocean. Additionally, by using this technique is extremely easy to allow multiple reviewers to review the code in parallel, in many cases, if the reviews are done in an SCM system, even anyone with access to the SCM could comment in the review, even if he/she is not a reviewer.
The main disadvantage with an over-the-shoulder review is that it takes longer as it usually requires different interactions, this could be especially painful if people are in different timezones.
In general, we could say that offline code reviews, done properly integrated in an SCM gets a good balance between speed, effectiveness and traceability.
The following diagram describes this process grafically
Pair Programming it is a development process that incorporates continuous code review in the development process itself. It consists in two developers writing code at a single terminal with only one developer typing at a time and continuous free-form discussion and review.
Studies of pair-programming have shown it to be very effective at both finding bugs and promoting knowledge transfer. However, having the reviewing developer so involved in the development itself is seen by many people as a risk to be biased: it's going to be more difficult for him to go a step back and critique the code from a fresh point of view. However, it could be argued that deep knowledge and understanding also provides him the capabiltiy to provide more effetive comments.
The key difference with the other techniques mentioned above is that introducing this way of working affects not only how QA Activities are performed but also development ones (i.e. you could combine all the other review techniques with different ways of developing code). Adopting this way of working requires evaluating properly how are developers going to work in such an environment and the time required for working in this way.
Each of the types of review is useful in its own way. Offline reviews strike a balance between time invested and ease of implementation. In any case, and any kind of code review is better than nothing, but it should be also acknowledged that code reviews are not enough to guarantee the quality of a final product.
Deffensive Programing consists in including in the software as many checks as possible, even if they are redundant (e.g. checks made by callers and callees). Sometimes it's said, "maybe they don't help, but they don't harm either".
The problem with this way of working, is that, in some cases, it ends-up adding a lot of redundancy "just in case", which means adding unnecessary complexity and increasing software size. The bigger and more complex a software is, the easier defects can affect it.
The ideas behind deffensive software, are interesting, but in order to make these ideas have a possitive effect, a more systematic approach should be followed.
A contract, in the real world, is an agreement between two parties in which each party expect some benefits from the contract if they meet some obligations. Both are linked, i.e. if the obligations are not met by any of the parties, there is no guarantee the benefits will happen. Those benefits and obligations are clearly documented so that there are no misunderstanding between the parties.
Imagine a courier company that has a express service within Madrid city. That express service can be only done if the customer meets some conditios (e.g. the package is within the limits, the address is valid and in Madrid, the user pays...). If the customer meets this conditions, he gets the benefit of the package being delivered in 4 hours. If the customer does not meet them, there is no guarantee he can get the express deliver benefits. The following table shows the obligations/benefits of this example:
Party | Obligations | Benefits |
---|---|---|
Client | Provide letter or package of no more thant 5 Kilograms, each dimension no more than 2 meters. Pay 100 Euros. Provide a valid recipient address in Madrid. | Get package delivered witouth any damage to recipient in 4 hours or less. |
Supplier | Deliver package to recipient in four hours or less. | No need to deal with deliveries too big, too heavy or unpaid. |
One important remark, is that when a contract is exhaustive, there is a guarantee that all the obligations are related to the benefits. This is also called the "No hidden clause" rule. This does not mean that the contract could not refer to external laws, best practices, regulation... it only means they do not need to be explicitly stated. For instance, in case the courier fails to meet their obligations, it is highly likely a law establishes a compensation to the customer.
It is easy to understand how the concept of contracts in the real world could be extrapolated to software development. In software every task can be split in multiple sub-tasks, the idea of sub-tasks is similar of contracting something to a company. I create a function, module, etc... that handles this part that is essential to meet the complete task.
task is do subtask1: subtask2: subtask3: end
If all the subtasks are completed correctly, the task will be also finished successfully. If there is a contract between the task and the subtasks, the task will have some guarantees about the completion. Subtasks in software developmentare typically functions, object methods...
Please also think about the Spotify way of working in which they created an architecture that manage every team to deliver different parts of Spotify client independently. It is quite similar, they have divided the main task (the Spotify client) in multiple subtasks (the components of the architecture). If all the components behave properly, the final task will be working properly too.
Design by Contract (DbC) is based on the definition of formal, precise and verifiable interface specifications for every software component. These specifications extend the ordinary definition of abstract types with preconditions, postconditions and invariants. Those specifications are also known as contracts.
A software contract could be defined as the set of three different things:
This could be formalized as three questions developers must try to solve when implementing a function:
The ideal environment for Design by Contract is one in which the language developers use has support for it in a native way. Unfortunately not too many of them support this capability, being Eiffel the most known one. For those languages, the contract is part of the function definition. For instance, see an Eiffel example below:
class ACCOUNT create make feature ... Attributes as before: balance , minimum_balance , owner , open ... deposit (sum: INTEGER) is -- Deposit sum into the account. require sum >= 0 do add (sum) ensure balance = old balance + sum end
In programming languages with no direct support, in most of the cases assertions are used as a way to implement DbC techniques. There are libraries that try to simplify the process of defining these assertions. An assertion is a predicate used to indicate that if the software is a correct status, the predicate should be always true at that place. If an assertion evaluates to false, that should mean that the software is in a wrong status (e.g. the contract has been broken).
Of course, the functions can still do some checkings, but only for conditions that are not part of the contract. The idea of DbC is removing any duplication and minimizing the amount of code necessary to check that the contract is met.
One question that could be raised is "What happens if one of these conditions fails during execution?". This depens on whether assertions are monitored or not during runtime (and this use to be customizable depending on developer needs), but it is not a critical aspect. The target of DbC is implementing reliable software that work, what happens when they do not work is interesting, but not the main target.
Developer can choose from various levels of assertion monitoring: no checking, preconditions only, pre and postconditions, conditions and invariants...
If a developer decides not to check assertions, the assertions or contracts do not have any impact on system execution. If a condition is not met, then the software could be in an error situation and no extra actions will be taken, these are just bugs. In most of the cases this is the typical configuration for released products.
If a developer decides to check assertions, the effect of assertions not met is typically an exception being fired. The typical use case for enabling assertion checking is debugging, i.e. detecting defects not in blind but based on consistency conditions. In most of cases this is the typical configuration for released products.
There might be also special treating of these exceptions, for instance in Eiffel routines a rescue clause which expresses the alternate behaviour of the routine (and is similar to clauses that occur in human contracts, to allow for exceptional, unplanned circumstances). When a routine includes a rescue clause, any exception occurring during the routine's execution interrupts the execution of the body and starts the execution of the rescue clause. This could be used for shielding the code inn some situations.
There is no practical way to guarantee that a given software has no bugs. It doesn't matter how good are our tools, methodologies and engineers... It doesn't matter how deep inspections and testings are either. In many cases the presence of bugs is tolerated as something "natural", but there are some systems were the reliability and security requirements are so important that extra measures should be taken to mitigate the consequences of undetected bugs.
When a system has extreme reliability requirements, fault tolerant solutions should be put in place. The idea behind fault tolerant solutions consists in breaking the bug/failure cause/effect relationship. A result of this is a increase of the reliability as reliability is inversely proportional to the frequency of failures. These techniques are usually expensive as they typically require redundancy of sort so they are only in systems that require it.
For instance, the software used in the flight control systems is one example of software with very extreme requirements about failures. The report [[CHALLENGES-FAULT-TOLERANT-SYSTEMS]] provides more details about the challenges that this kind of systems pose to software developers.
However, in some situations, failures cannot be prevented and hence reliability cannot be improved. However, there are ways to minimize the consequences of failures with the target of maximizing safety. It is important we don't confuse reliability with safety: for instance a medical system could not be 100% reliable but it should be 100% safe. The techniques intended to increase system safety are called failure containment techniques.
In this chapter we are going to study both type of techniques.
Fault Tolerance techniques are used to tolerate software faults and prevent system failures from occurring when a fault occurs. These kind of techniques are used in software that is very sensitive to failures such as aerospacial software, nuclear power, healthcare...
In this case, only one instance of the software exists, and it tries to detect faults and recover from them without the need to replicate the software. This kind of techniques are really difficult to be implemented, as it has been studied that efficient fault tolerant systems require some kind of redundancy as we will see in next section.
Redundancy in real world activities is the best way to increase reliability: multiple engines in a plane, lights in a car... For instance, the NASA performed a research to calculate the possibility of survival in a mission depending of the amount of redundant equipment in the spacecraft and the results demonstrated that the survival chances are extremely dependent on redundancy as show in the figure below.
The same is true for software systems but with some caveats. Redundancy in software only works if the redundant system works when the original one fails: this is normal in hardware systems but not in software. If I have two identical software systems and the first one fails, it is extremely likely the second one fails too. Due to this is important that redundant software systems are uncorrelated. Designing uncorrelated systems usually requires two teams, working isolated, with different techniques... This means duplicating at least the development cost.
In these systems, those multiple instances of the software developed independently can work in different configurations: N-Version programming (NVP), Recovery Blocks (RcB), N self checking programming (NSCP)...
NVP (N-version programming):
This technique uses parallel redundancy, where N copies, each of a different version, of codes fulfilling the same functionality are running in parallel with the same inputs. When all of those N-copies have completed the operation, an adjudication process (decision unit) takes place to determine (based in a more or less complex vote) the output.
Some key characteristics about this scheme that is depicted in
Obviously a wide range of different variants of those systems have been proposed based in multiple combinations of them [[COST-EFFECTIVE-FAULT-TOLERANCE]] and multiple comparisons between the performance are also available [[PERFORMANCE-RB-NVP-SCOP]].
What do you think are the key advantages and disadvantages of the two fault tolerance techniques described (Recovery Blocks & N-Version)? Exercise 3: Recovery Blocks vs. N-Version
Error Recovery is the key part in fault tolerant systems. However, although the most important one, it's the last step in a series of 4 parts:
Depending on how the new error-free state is calculated we could distinguish two approaches:
There is software that is used in safety critical systems, that have severe consequences in case a failure occurs. In those situations it is very important to avoid some of the potential accidents or at lt
Various specific techniques are used for this kind of systems, most of them based on the analysis of the potential hazards linked to the failures:
Notice that both hazard control and damage control above are post-failure activities that attempt to contain the failures so that they will not lead to accidents or the accident damage can be controlled or minimized. All these techniques are usually very expensive and process/technology intensive, hence they should be only applied when safety matters and deal with rare conditions related to accidents.