Computer Programming Principles/Maintaining/Debugging

Introduction

Debugging is the art of diagnosing errors in programs and determining how to correct them. "Bugs" come in a variety of forms, including: coding errors, design errors, complex interactions, poor user interface designs, and system failures. Learning how to debug a program effectively, then, requires that you learn how to identify which sort of problem you're looking at, and apply the appropriate techniques to eliminate the problem.

Bugs are found throughout the software life cycle. The programmer may find an issue, a software tester might identify a problem, or an end user might report an unexpected result. Part of debugging effectively involves using the appropriate techniques to get necessary information from the different sources of problem reports.

The most common types of mistakes when programming are:

Programming without thinking
Writing code in an unstructured manner

Bugs in Detail

What are these different kinds of bugs, then?

With coding errors, the source of the problem lies with programmer induced erroneous or improper code. Some examples of coding errors include:

Disregarding adopted conventions.
Calling the wrong function ("moveUp", instead of "moveDown")
Using the wrong variable names in the wrong places ( "moveTo(y, x)" instead of "moveTo(x, y)")
Failing to initialize a variable ( "y = x + 1", where x has not been set) when absolutely required.
Skipping a check for an error return.

Software users readily see some design errors, while in other cases design flaws make a program more difficult to improve or fix, and those flaws are not obvious to a user. Obvious design flaws are often demonstrated by programs that run up against the limits of a computer, such as available memory, available disk space, available processor speed, and overwhelming input/output devices. More difficult design errors fall into several categories:

Failure to hide complexity
Incomplete or ambiguous "contracts"
Undocumented side effects

Complex interactivity bugs arise in scenarios where multiple parts of a single program, multiple programs, or multiple computers interact.

Poor user interface designs often lead users to use the program in ways that accomplish something other than what they intend. For example, a "search" page for a web site might have an option for "case-insensitive" searching. When the option is hard for the user to find or see, that user might report a bug that some of their data is "lost", simply because it is not found by the case sensitive search.

Sometimes, computer hardware simply fails, and it usually does so in wildly unexpected ways. Determining that the problem lies not with the software itself, but with the computer(s) on which it is usually complicated by the fact that the person debugging the software may not have access to the hardware that shows the problem.

Preventing Bugs

No discussion of debugging software would be complete without a discussion of how to prevent bugs in the first place. No matter how well you write code, if you write the wrong code, it won't help anyone. If you create the right code, but users cannot work the user interface, you might as well have not written the code. In short, a good debugger should keep an open mind about where the problem might lie.

Although it is outside the scope of this discussion to describe the myriad techniques for avoiding bugs, many of the techniques here are equally useful after the fact, when you have a bug and need to uncover it and fix it. Thus, a brief discussion follows.

Understand the Problem

In order to write effective software, the developer must solve the problem the user needs solved. Users, naturally enough, do not think in strict algorithms, windowing systems, web pages, or command line interfaces. Rather, users may not think of problems in the same way that the developer thinks of problems.

To address this difference, sit down with the intended user, and ask them what they want from the software. Users frequently want more than software can actually deliver, or have contradictory aims, such as software that does more, but doesn't require that they learn anything new. In short, ask the users what their goals are. Absent those goals, users will keep reporting bugs that do not add up to a coherent whole.

Effective Processes

Development Tools

Unit Testing

Unit testing means checking what happens in all possible states that the current module can enter. Therefore you should prepare a "test list" where you define all the possible inputs for current module.
For example: We have program that gets positive numbers from the user and process them. First we need to check if the input is a number (it can be a char), then we will check if it's positive. By checking I mean enter input and see what happens.
Hint: When you start to write this test list you'll notice that it's quite hard to predict all the possibilities; if you have the option to ask someone else (that didn't help writing the module) to help it could be fruitful.

Documenting Code

Basic debugging steps

Although each debugging experience is unique, certain general principles can be applied in debugging. This section particularly addresses debugging software, although many of these principles can also be applied to debugging hardware.

The basic steps in debugging are:

Recognize that a bug exists
Isolate the source of the bug
Identify the cause of the bug
Determine a fix for the bug
Apply the fix and test it

Recognize a bug exists

Detection of bugs can be done proactively or passively.

An experienced programmer often knows where errors are more likely to occur, based on the complexity of sections of the program as well as possible data corruption. For example, any data obtained from a user should be treated suspiciously. Great care should be taken to verify that the format and content of the data are correct. Data obtained from transmissions should be checked to make sure the entire message (data) was received. Complex data that must be parsed and/or processed may contain unexpected combinations of values that were not anticipated, and not handled correctly. By inserting checks for likely error symptoms, the program can detect when data has been corrupted or not handled correctly.

If an error is severe enough to cause the program to terminate abnormally, the existence of a bug becomes obvious. If the program detects a less serious problem, the bug can be recognized, provided error and/or log messages are monitored. However, if the error is minor and only causes the wrong results, it becomes much more difficult to detect that a bug exists; this is especially true if it is difficult or impossible to verify the results of the program.

The goal of this step is to identify the symptoms of the bug. Observing the symptoms of the problem, under what conditions the problem is detected, and what work-arounds, if any, have been found, will greatly help the remaining steps to debugging the problem.

Isolate source of bug

This step is often the most difficult (and therefore rewarding) step in debugging. The idea is to identify what portion of the system is causing the error. Unfortunately, the source of the problem isn't always the same as the source of the symptoms. For example, if an input record is corrupted, an error may not occur until the program is processing a different record, or performing some action based on the erroneous information, which could happen long after the record was read.

This step often involves iterative testing. The programmer might first verify that the input is correct, next if it was read correctly, processed correctly, etc. For modular systems, this step can be a little easier by checking the validity of data passed across interfaces between different modules. If the input was correct, but the output was not, then the source of the error is within the module. By iteratively testing inputs and outputs, the debugger can identify within a few lines of code where the error is occurring.

Skilled debuggers are often able to hypothesize where the problem might be (based on analogies to previous similar situations), and test the inputs and outputs of the suspected areas of the program. This form of debugging is an instance of the scientific method. Less skilled debuggers often step sequentially through the program, looking for a place where the behavior of the program is different from that expected. Note that this is still a form of scientific method as the programmer must decide what variables to examine when looking for unusual behavior. Another approach is to use a "binary search" type of isolation process. By testing sections near the middle of the data / processing flow, the programmer can determine if the error happens during earlier or later sections of the program. If no data problems are detected, then the error is probably later in the process.

Identify cause of bug

Having found the location of the bug, the next step is to determine the actual cause of the bug, which might involve other sections of the program. For example, if it has been determined that the program faults because a field is wrong, the next step is to identify why the field is wrong. This is the actual source of the bug, although some would argue that the inability of a program to handle bad data can be considered a bug as well.

A good understanding of the system is vital to successfully identifying the source of the bug. A trained debugger can isolate where a problem originates, but only someone familiar with the system can accurately identify the actual cause behind the error. In some cases it might be external to the system: the input data was incorrect. In other cases it might be due to a logic error, where correct data was handled incorrectly. Other possibilities include unexpected values, where the initial assumptions were that a given field can have only "n" values, when in fact, it can have more, as well as unexpected combinations of values in different fields (field x was only supposed to have that value when field y was something different). Another possibility is incorrect reference data, such as a lookup table containing incorrect values relative to the record that was corrupted.

Having determined the cause of the bug, it is a good idea to examine similar sections of the code to see if the same mistake is repeated elsewhere. If the error was clearly a typo, this is less likely, but if the original programmer misunderstood the initial design and/or requirements, the same or similar mistakes could have been made elsewhere.

Determine fix for bug

Having identified the source of the problem, the next task is to determine how the problem can be fixed. An intimate knowledge of the existing system is essential for all but the simplest of problems. This is because the fix will modify the existing behavior of the system, which may produce unexpected results. Furthermore, fixing an existing bug can often either create additional bugs, or expose other bugs that were already present in the program, but never exposed because of the original bug. These problems are often caused by the program executing a previously untested branch of code, or under previously untested conditions.

In some cases, a fix is simple and obvious. This is especially true for logic errors where the original design was implemented incorrectly. On the other hand, if the problem uncovers a major design flaw that permeates a large portion of the system, then the fix might range from difficult to impossible, requiring a total rewrite of the application.

In some cases, it might be desirable to implement a "quick fix", followed by a more permanent fix. This decision is often made by considering the severity, visibility, frequency, and side effects of the problem, as well as the nature of the fix, and product schedules (e.g., are there more pressing problems?).

Fix and test

After the fix has been applied, it is important to test the system and determine that the fix handles the former problem correctly. Testing should be done for two purposes: (1) does the fix now handle the original problem correctly, and (2) make sure the fix hasn't created any undesirable side effects.

For large systems, it is a good idea to have regression tests, a series of test runs that exercise the system. After significant changes and/or bug fixes, these tests can be repeated at any time to verify that the system still executes as expected. As new features are added, additional tests can be included in the test suite.

Steps to reduce debugging

There are concrete steps that can be taken to reduce the amount of time spent debugging software. These are listed in the sections below.

The correct mindset

Probably the most important thing you can do when you are starting to debug a program is to realize that you don't understand what is going on. Programmers who are convinced that their program should work fine are less likely to find errors simply because they are refusing to admit their confusion. If the program behaved the way you think it does, you wouldn't be debugging; the program would be working fine. Even when the program appears to work, if you examine it with the thought that there is at least one bug remaining and you are going to find it, then you are more likely to find something wrong with the program.

Start at the source

The time when you are most aware of where problems are more likely to arise is usually when first designing and writing the code. By inserting integrity checks at various places within the program, problems can be detected and reported by the program itself. In addition to detecting problems, considerations should be given as to how best to handle each error. Options include:

Report error, set invalid fields to a default value, and continue
Report error, discard the record associated with the invalid value, and continue
Report error, transfer invalid record into separate file/table so the user can examine and possibly correct the problem
Report error and terminate the program

Treat user input with suspicion

Any data that originated from users (including external systems) should be treated with suspicion. Carefully validate all such input data, performing syntactical and semantical integrity checks. Such invalid data are a common source of programming errors. Think not just of data entered in error, but malicious data as well, as in buffer overflow exploits.

If data are entered interactively by users, you can provide appropriate error messages and allow the user to correct the invalid field(s). If data are not from an interactive source, then the erroneous records should be handled as described above.

Use of log files

Programs that write information to log files can provide significant information that can be used to analyze what was going on before, during, and after problems are encountered. The number of entries to be searched can be reduced by creating various log files, such as a separate log for each major component of the system, plus one log file strictly for errors. Each entry should be date/time stamped so that entries from different logs can be correlated.

Test suites

A standard set of tests that can be run to perform tests can assist in finding errors before they make it into production. These test cases should be automated as much as possible to reduce the amount of effort required to perform these tests. As new features are added to the system, additional tests should be created to exercise those features.

Change one thing at a time

When making a lot of changes, apply them incrementally. Add one change, and then test that change thoroughly before starting on the next change. This will reduce the number of possible sources of new bugs. If several different changes are applied at the same time, then it is much more difficult to identify the source of the problem. Furthermore, minor errors in different areas can interact to produce errors that never would have happened if those changes had been applied one at a time.

Back out changes that have no effect

If you make a change to fix a problem, but the program still behaves the same, back out those changes before proceeding. The fact that your changes didn't do anything indicates one of several things:

The problem is not where you think it is
The area you modified either isn't being called, or isn't being called the way you think it is
Assuming the section you changed wasn't executed, you might have introduced new bugs that won't appear until you fix the current bug

Try another port of your Application

Programs, that are available under different Architectures (e.g. Operation Systems like MS Windows, MacOSX, Linux or Processors like Intel Pentium, PowerPC or DEC Alpha) sometimes react differently on other Systems (especially for subsequent errors). Sometimes it is far more easy to find the error on a different architecture.

Think of similar situations

When a bug has been found, think of other places where the same mistake might have been made. Check those places and see if the same problem exists there as well.

Finding User Interface Bugs

Finding Design Bugs

Finding Coding Errors

Not every kind of program is debugged in the same way and not all techniques can be used on all types of programs.

The main character in debugging is the debugger. This is software which runs simultaneous with the newly written program and allows you to pause the program and read memory addresses, stack and various other normally invisible parts of your program.

Another method of debugging is the log-file. Outputting the contents of certain variables can provide valuable information on how your program performs. Outputting a string containing the name of the function when the function is called can be useful in locating when an error is introduced. For finding where a program crashes it's more practical to use the debugger.

Large programs are hard to debug, small programs are (relatively) easy to debug. So the key is to turn a large program into a lot of small programs for debugging. This is called "Unit testing" and involves compiling a part of your program (a routine, a collection of related routines, a module or even a complete subsystem) with extra code to allow it to run without the rest of the code in place.

Full screen application (especially games) can be hard to debug as you won't be able to see the debuggers output. A solution lays in using a null-modem cable, a second computer and a terminal program (e.g. Hyper-terminal). Pipe the output of the debugger through the null modem cable to the second computer.

e.g. In Dos with gdb using a serial null modem cable:

Configure the port with mode: mode COM2: 9600,n,8,1,none
Pipe the output to COM2 by adding >COM2 when you invoke the debugger.