Examining Undefined Behavior in C++ Programming

Programming languages such as C and C++ have quirks and undefined behavior. These memory safety violations, integer overflows, and unending loops can really wreck your program (and your day!). Sometimes the undefined behavior can be something as simple as accidentally dividing by zero!

According to the C99 standard, there’s a long list of undefined behaviors that can lead to crashes or incorrect program output. If you’re new to programming, rest assured you’re going to run into quite a few of these during your career. Here are three particularly noteworthy ones that you should know about:

Unknown Sequence of events

This is a classic example, though technically it’s unspecified as opposed to undefined; in the example below, the order in which the two sides of the assignment are performed in the second line is not specified in the C++ standard.

i = 4;
a[i] = i++;

 

On the left-hand side, we need the value of the variable so we know which slot in the array ‘a’ will store the value. On the right-hand side, we need to calculate the value of ‘I’, which returns 4, but then sets ‘I’ to 5. If the right-hand side is evaluated first, then a[i] will be a[5] and get the value 4. If the left-hand side is done first, then a[4] will get the value 4. We just cannot say which element of ‘a,’ [4] or [5], gets the value 4.

There are only two possible outcomes because the order of processing wasn’t specified. Other undefined behaviors, such as trying to get a value from a pointer that hasn’t been assigned, will lead to crashes or behave oddly every time they’re run. This stands in sharp contrast to an unspecified order, which will always return the same value (but it’s compiler-dependent).

I was interested in seeing how two well-known compilers (VC++ 17 on Windows and G++ version 7.3.0 on Ubuntu Linux) behaved with unknown sequences of events, so I wrote this short program:

   
#include 
using namespace std;
int main()
{
	auto i = 4;
	int a[10];
	a[i] = i++;
	cout << a[i] << endl;
    return 0;
}

 

I created this in Visual Studio, then ran WSL, the Windows Subsystem for Linux, and copied it using cp from the folder into my home folder. (Never copy files from Windows using Windows utilities, as that will corrupt the WSL filesystem; you must use Linux cp from your bash session or WinSCP, which copies files via SSH.)

G++, the Linux C++ compiler, failed trying to compile the unspec.cpp because it was an utf-16le file. This is the charset that Visual Studio used when saving the file out. In this charset, each character is stored in two bytes. You can see a file’s charset in Linux with the command ‘file.’

file -i unspec.cpp

 

If you run iconv –l, it will list all available character sets in Linux. There are several hundred, but to compile, we only need it to be UTF-8.

iconv - l 

 

I ran ‘iconv’ with this command line to convert the file to UTF -8.

iconv -f utf-16le -t UTF-8 unspec.cpp -o unspec2.cpp

 

I compiled and ran it with this command:

g++ -o unspec2 unspec2.cpp ; ./unspec2

 

Visual C++ outputs -858993460, but G++ outputs 4. I.e. VC++ stored the value in a[4], but output the value in a[5].

That odd value, -858993460, is used by Microsoft as a default value of a variable when a program is compiled in debug mode. It helps detect when a variable is read before it has been given a value. We’ll see that odd value again. If you compile the program in release mode, there’s no default value, and it will have whatever value is in RAM when the program runs.

Overflows

Adding 1 to an int seems harmless enough, but int variables have a maximum value, and if you add 1 to that, it overflows and wraps around. It does this silently; you don’t know about the overflow unless you do an explicit check. If that int variable was indexing an array or keeping track of a number of objects, it could crash the program.

This is how you check: The expression (x + 1 > x) is false when it overflows, so long as x is of type int. Try this example below to see it in action.

#include 
using namespace std;
int main()
{
	auto i = INT32_MAX;
	if (i + 1 > i) {
		cout << i + 1 << " is greater than i" << endl;
	}
	else {
		cout << i + 1 << " is not greater than i" << endl;
	}
    return 0;
}

 

Originally, I used INT_MAX, a predefined constant in C++, but only VC++ supported it unless I added #include after the #include . Both compilers support INT32_MAX, and both output the same result.

-2147483648 is not greater than 2147483647. 

 

INT32_MAX is the largest 32-bit integer. Adding 1 overflowed it silently to the lowest 32-bit integer, and the comparison failed. There are also similar constants INT8_MAX, INT16_MAX and INT64_MAX for other int types, so try those out and you’ll see a similar problem (but with different values).

A Different Kind of Ordering

This example is purely C++, whereas the previous ones also applied to C.

#include 
#include 
using namespace std;
class Person{
	public:
		Person(char * _name);
	private:
		const int len;
		const string name;
};
Person::Person(const char * _name)
	: name(_name), len(name.length()) {
	cout << "Length " << name << " = " << len << endl;
}
int main()
{
	Person P("Fred");
	return 0;
}

 

With VC++, the output is Length Fred = -858993460, which you might recognise as the undefined value seen in the first example. When compiled with the G++ compiler, the value is different each time, so I’m guessing it’s picking up whatever value lies in RAM.

This implements a Person class; a class in C++ is the definition of a type of object (in this case, it holds the name of a person). A class is a declaration of the type, and then we have to create an instance of that type. The name of that person is passed as a parameter and stored inside the instance. This line below creates an instance of Person with the name “Fred” passed in.

Person P("Fred");

 

It’s then supposed to set the private name and len properties within the constructor Person::Person, using an initializer list. You use these to set const values, or references before the body of the constructor is entered.

The initializer list shown below is the code between the : and the { in the Person::Person constructor. It’s used to initialize name and len values in the class. These are declared as const, and this is the only place where you can assign to them.

 
    : name(_name), len(name.length()) {

 

Once in the body, you can no longer set those. In VC++ adding len= name.length(); after the { and before the cout will give an compiler error: l-value specifies const object value.

But why the odd length output when len should be 4? It’s a bit of a “gotcha,” in that the order that the initializer list is executed is not the order (name, then len), but depends on the ordering of the len and name field declarations in the class.

As len is declared first, it executes len(name.length()) first in the initializer list, but as name hasn’t been assigned at this point, an undefined value is picked up. The answer here is to reorder the definitions and declare name before len.

I’ll admit that was a bit contrived. Had I passed a C++ string, not a char *, then the initializer list could have initialized len using _name.length() instead. With a char * type, it could only do name.length() after name had been set. As a general tip, always make the ordering in the initializer list match the order of declaration.

Conclusion

The use of static analyzers can also help catch bugs like these. A static analyzer is a program that reads the source code of your program and looks for possible errors such as not initializing variables, finding functions that are never called, and many more errors. Some languages (e.g., Rust) have a static analyzer built in for enhanced safety. For C++, you need to use one such as the open-source Valgrind for Linux and Android platforms, and Cppcheck for Windows.

When you are learning C/C++, it’s possible to hit undefined behavior and get really confusing results.

One way to spot undefined behavior is comparing the program running when compiled in debug mode to the same program compiled in release mode. Results should be identical, though you’d expect the release mode version to run a bit faster. If the results are not the same, then you may have hit undefined behavior.

You should also consider using assert. This is a statement that checks a condition and stops your program with a message if the condition is false.

You might put this in a function to check a parameter. If it is negative, it stops with this message. You might use this in a class to identify the undefined behavior of the initializer class ordering by comparing the stored length len with the calculated length:

  assert( len== name.length(),”Length of name is incorrect”);

 

Good luck!

Related

One Response to “Examining Undefined Behavior in C++ Programming”