Dead code comes in many forms, and appears in most projects at some time or another. A general definition of dead code would be “code that is unreachable”. This may be due to a function never being called, or it may be control paths within a function never being accessible. This article deals with the former.
Functions that are never called happen at two phases during the code’s lifecycle: brand new code that is yet to be hooked into an existing framework; or refactoring of some code that removes calls to other functions.
Often in the refactoring of code, functions that are no longer called are left alone as they might still be called from somewhere. This is often the case on larger projects where individuals do not know the entire codebase.
Another reason functions are not deleted is the idea that “it might be useful later”, an extension of the hoarder’s mentality.
This leads on to an interesting question: How do you know when a function is no longer called? The problem really presents itself when looking at shared object libraries. The normal approach is that an exported function is intended to be called from some other library or executable, however, in many cases there are exported functions that are only ever called from inside the library. Just because a function is exported, does that mean it should be kept? In order to look at shared object libraries, you need also look at all the code that uses that shared object library.
Once you have identified that your code base has dead code in it, why remove it? Probably the biggest factor is to aid the programmers in their understanding. There is no point in spending time reading and understanding code that is never called. Another major factor is having less clutter and cleaner code. This leads to identifying areas for refactoring, which often leads to better abstractions. A minor benefit is speeding up the compile and link time and reducing the size of the libraries and executables.
There are tools available that will do a coverage analysis of the source code. The tool watches the code as it is being executed and will identify parts of the code that are never reached. An advantage of this is that it can also identify unreached code paths in called functions. The disadvantage is the need for either automated or manual testing. If using automated testing, then the tests need to cover all the required use cases, which in itself is hard to do much of the time due to “fluffy”, incorrect, or outdated requirements. It is also often hard to “retrofit” on to a large code base. The alternative is manual testing, which means someone sitting in front of the application doing all it can do. Manual testing is probably more error prone than even limited automated testing. If the tests, manual or automated, don’t fully cover the actual use cases then it is possible that required code is incorrectly identified as unused.
The impetus behind my looking into this issue was the code base at a previous contract. There was somewhere in the vicinity of two million lines of code and of those it was estimated that somewhere between 20 and 40% is no longer used anywhere. The code was built into approximately 50 shared object libraries and 20 executables. There were only limited regression tests and no user knew everything that the system was supposed to do, which led to the idea of trying to create some tool that would analyse the libraries and executables themselves.
The general approach was to process each of the shared object libraries and extract a list of exported functions to match up with the undefined functions from the shared object libraries and executables – the theory being that whatever was exported and not called was “dead code”.
The tools that were chosen for the job were from GNU Binutils [1]:
nm
,
c++filt
, and
readelf
. Primarily because all the code was compiled with g++.
In order to tie
nm
,
c++filt
and
readelf
together, some glue was needed – I chose python.
GNU
nm
lists the symbols from object files. It can also extract the symbols from shared object libraries and executables. nm is capable of giving much more information than is needed for simple function usage. The parameters
--defined-only
and
--undefined-only
were used to reduce the results. These were then parsed using regular expressions to extract the mangled name.
To illustrate we have the following source for a shared object library:
shared.hpp:
#ifndef DEAD_CODE_SHARED_H
#define DEAD_CODE_SHARED_H
#include <string>
void exported_func(std::string const& param);
void unused_func(std::string const& param);
#endif
shared.cpp:
#include "shared.hpp"
#include <iostream>
void internal_func(std::string const& param)
{
std::cout << "internal called with "<< param << "\n";
}
void exported_func(std::string const& param)
{
std::cout << "exported_called\n";
internal_func(param);
}
void unused_func(std::string const& param)
{
std::cout << "never called\n";
}
nm
output:
g++ -shared -o libshared.so shared.cpp
tim@spike:~/accu/overload/dead-code$
nm --defined-only libshared.so
00001bd8 A __bss_start
00000740 t call_gmon_start
00001bd8 b completed.4463
00001ac4 d __CTOR_END__
00001abc d __CTOR_LIST__
00000920 t __do_global_ctors_aux
00000770 t __do_global_dtors_aux
00001bd0 d __dso_handle
00001acc d __DTOR_END__
00001ac8 d __DTOR_LIST__
00001ad4 A _DYNAMIC
00001bd8 A _edata
00001be0 A _end
00000964 T _fini
000007e0 t frame_dummy
00000ab8 r __FRAME_END__
00000900 t _GLOBAL__I__Z13internal_funcRKSs
00001bc0 a _GLOBAL_OFFSET_TABLE_
00000764 t __i686.get_pc_thunk.bx
00000700 T _init
00001ad0 d __JCR_END__
00001ad0 d __JCR_LIST__
00001bd4 d p.4462
000008a4 t __tcf_0
00000886 T _Z11unused_funcRKSs
0000085a T _Z13exported_funcRKSs
0000081c T _Z13internal_funcRKSs
000008bc t _Z41__static_initialization_and
_destruction_0ii
00001bdc b _ZSt8__ioinit
The entries of interest here are the ones where the type (the bit after the hex address) is
T
. These are where the symbol is in the text (code) section.
Here is a script that extracts the defined functions and its results for
libshared.so
:
#!/usr/bin/env python
import re, os
exported_func = \
re.compile('[0-9a-f]{8} T (\S+)')
exported_cmd = 'nm --defined-only %s'
for line in os.popen (exported_cmd % "libshared.so").readlines():
m = exported_func.match(line)
if m: print m.group(1)
Results:
_fini
_init
_Z11unused_funcRKSs
_Z13exported_funcRKSs
_Z13internal_funcRKSs
Mangled names are great for identifying with regular expressions and matching, but not so good for matching with the code. This is where
c++filt
comes in.
def unmangle(name):
return os.popen('c++filt ' + name).readline()[:-1]
New results:
_fini
_init
unused_func(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
exported_func(std::basic_string<char std::char_traits<char>, std::allocator<char> > const&)
internal_func(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
You can see with the fully expanded names why the mangled one is easier to parse and match, so both are needed. All libraries also have the
_fini
and
_init
methods, so those can be safely ignored.
In order to identify real usage you need to look at the libraries and executables together, so here is a program which uses the shared object library:
#include "shared.hpp"
int main()
{
exported_func("Hello World\n");
return 0;
}
Compile and execute:
tim@spike:~/accu/overload/dead-code$ g++ -o deadtest -l shared -L . main.cpp
tim@spike:~/accu/overload/dead-code$ ./deadtest
exported_called
internal called with Hello World
tim@spike:~/accu/overload/dead-code$
For the executables you are only interested in the undefined references, and of those ultimately only the ones that correspond to exported functions in the libraries.
tim@spike:~/accu/overload/dead-code$ nm --undefined-only deadtest
U __cxa_guard_acquire@@CXXABI_1.3
U __cxa_guard_release@@CXXABI_1.3
U getenv@@GLIBC_2.0
w __gmon_start__
U __gxx_personality_v0@@CXXABI_1.3
w _Jv_RegisterClasses
U __libc_start_main@@GLIBC_2.0
U _Unwind_Resume@@GCC_3.0
U _Z13exported_funcRKSs
U _ZNSaIcEC1Ev@@GLIBCXX_3.4
U _ZNSaIcED1Ev@@GLIBCXX_3.4
U _ZNSsC1EPKcRKSaIcE@@GLIBCXX_3.4
U _ZNSsD1Ev@@GLIBCXX_3.4
Following is a simplistic script that follows the initial approach defined above.
nm2.py:
#!/usr/bin/env python
import os, re
exported_func = re.compile ('[0-9a-f]{8} T (\S+)')
unknown_func = re.compile('\s*U (\S+)')
exported_cmd = 'nm --defined-only %s'
unknown_cmd = 'nm --undefined-only %s'
ignored_funcs = set(['_PROCEDURE_LINKAGE_TABLE_', '_fini', '_init'])
def unmangle(name):
return os.popen('c++filt ' + name).readline()[:-1]
# return name
class Library(object):
def __init__(self, name):
self.fullname = name
self.name = os.path.basename(name)
self.exported = []
for line in os.popen(exported_cmd % self.fullname).readlines():
m = exported_func.match(line)
if m:
if m.group(1) not in ignored_funcs:
self.exported.append(m.group(1))
self.unknown = []
for line in os.popen(unknown_cmd % self.fullname).readlines():
m = unknown_func.match(line)
if m:
self.unknown.append(m.group(1))
class Binary(object):
def __init__(self, name):
self.fullname = name
self.name = os.path.basename(name)
self.unknown = []
for line in os.popen(unknown_cmd % self.fullname).readlines():
m = unknown_func.match(line)
if m: self.unknown.append(m.group(1))
def main():
lib = Library('libshared.so')
bin = Binary('deadtest')
exported = set(lib.exported)
for unk in bin.unknown:
if unk in exported:
exported.discard(unk)
print "Unused:"
for func in exported:
print "\t%s" % unmangle(func)
if __name__ == "__main__":
main()
Executed:
tim@spike:~/accu/overload/dead-code$ ./nm2.py
Unused:
internal_func(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
unused_func(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
Here we can see that the function
internal_func
has been shown as not used even though it is called directly from
exported_func
. A tool that was going to give false positives like this was not going to be extremely useful.
Luckily it was pointed out to me that another GNU tool called
readelf
is able to show relocation information. There is a relocation entry for every function that is called.
The relevant lines from the results of
readelf -r --wide libshared.so
are shown in Figure 1.
|
Figure 1 |
More regex magic
'[0-9a-f]{8}\s+[0-9af]{8}\s+\S+\s+[0-9a-f]{8}\s+(\S+)'
gives a way to identify the function calls. Once these are eliminated from the exported list, we are left with only one function:
unused_func
.
Conclusion
The script ended up taking about 15 - 20 minutes to run (mainly due to an inefficiency in the
c++filt
calling that I never got around to fixing) but returned around about three or four thousand functions that were no longer called. The script does still show false positives though as it is not able to determine when a function is called through a pointer to a function or pointer to a member function. It did however give a good starting point to reduce the dead code.
Tim Penhey tim@penhey.net
Thanks
Thanks to Paul Thomas on the accu-general list for pointing out
readelf
to me.
References
1 GNU: http://www.gnu.org/software/binutils/