C++ Modules: A Brief Tour

C++20’s long awaited module system has arrived. Nathan Sidwell presents a tourist’s guide.

One of the major C++ 20 features is a module system. This has been a long time in coming. The idea predates C++98; it is about time C++ caught up with other languages! In this article, I’ll show 3 example programs, using progressively more advanced organization of code. There are a number of call-out boxes answering a few questions the main text might suggest. You can read those separately.

The road to standardization

ISO Working Group 21 (WG21) is responsible for C++. It holds meetings 3 times a year, to discuss new features and resolve issues with existing features. These physical meetings are on hold now, and various subcommittees hold virtual ones.

In 2016 a Technical Specification (N4592) was published, which specified a modules system. As implementors (such as me) experimented with this, a number of changes or clarifications were made during its path to incorporation into C++20.

Because of the pervasive use of header files as the way of describing interfaces, a particular difficulty is solving what may be phrased as the ‘how do we get there from here?’ problem. That took up a significant fraction of design and implementation effort.

Let’s start with a simple example showing some modular concepts. Listing 1 is a module interface file – this is the source file that provides importable entities to users of the module.

// file: ex1/hello.cc
module;
// legacy includes go here – not part of this module
#include <iostream>
#include <string_view>
export module Hello;
// the module purview starts here
// provide a function to users by exporting it
export void SayHello
  (std::string_view const &name)
{
  std::cout << "Hello " << name << "!\n";
}

Listing 1

The name of the file containing that code can be anything, but let’s put it in hello.cc. Listing 2 is a user of that module.

// file: ex1/main.cc
import Hello; // import the Hello module,
              // its exports become available
#include <string_view>
int main ()
{
  SayHello ("World");
}

Listing 2

We can compile our program using a module-aware GCC¹ with:

  > cd ex1
  > g++ -fmodules-ts -std=c++20 -c hello.cc
  > g++ -fmodules-ts -std=c++20 -c main.cc
  > g++ -o main main.o hello.o
  > ./main
  Hello World!

You’ll notice there are some differences to using header files:

You compile the module interface, just as a regular source file.
The module interface can contain non-inline function definitions.
You need to compile the interface before you compile sources that import it.

Do we need a new source suffix?

Often other module tutorials use a new source file suffix for the module interface file. This is user choice. The compiler doesn’t need a new suffix – it’s all still C++. Adding a new suffix means teaching your entire toolchain about the new suffix, which was too fiddly for me, and I control the compiler and am completely at home in an emacs config file!

As described in the build-systems box, prescanners need to scan all your sources, not just interface files, they gain nothing from distinguished interface names. If you do want to distinguish your interfaces, for the same reasons it’s useful to distinguish header files from source files, you could augment another part of the filename – a -I.cc ending maybe? As we’ll see further down, there are variations on module interfaces – should they be distinguishable from each other? (With yet more suffixes?)

Part of the reason may be due to history. The modules-ts did not have a specific syntax to denote a module interface, as opposed to a module implementation. The compiler had to be told via command line switch. It was one of my first contributions to suggest in-file syntax should make it clear.

The interface is a regular source file. It just happens to create an additional artefact to the usual object file – a Compiled Module Interface (CMI). That CMI is read by importers of the module, and then code can refer to entities exported by the module. It is this dependency that forces the compilation ordering. In this particular case, the CMI contains information about the SayHello function’s declaration, but not (necessarily) about its body. If SayHello was an inline function, the body would also (most likely) be present in the CMI.

What is a Compiled Module Interface (CMI) and how is it used?

I’ve described a module interface as producing a CMI. That’s a common implementation technique, but the standard itself makes no mention of such things, nor does it require them (the standard says nothing about object files either, by the way). Different compilers have taken different approaches to the CMI. For instance, Clang’s CMI represents the entire source file, and is another step in the compilation sequence, from whence the object file can be generated for instance. GCC generates the CMI as a separate artefact containing only the items required by importers. The CMI is a serialization of the compiler’s internal representation, but a more mergeable form than the usual PreCompiled Header (PCH) mechanism.

Rather than distribute source code, could one distribute a CMI? Not really. The CMI contains target CPU-specific information in addition to being very compiler-specific. Besides, users of a module will probably need source code to aid debugging. As mentioned above, the CMI may not contain all the source code information, so an object file would be needed too.

Why is the CMI not general? C++ itself requires certain architectural features to be locked down in the front end. For instance, sizeof(int) – consider instantiating a template on that value, we have to know what it is to get the correct specialization. Other pieces of the C++ language are implementation-defined, and to be portable all implementations would need to have the same behaviour. Underlying ABI decisions make themselves visible in the C++ front end, as it may or may not need to create temporaries in passing and returning. Don’t forget, different Standard Libraries are not binary compatible – you cannot link object files built against different library implementations.

Command line options also affect source. For instance -std=c++20 will allow rather different code, and enable different standard library code to -std=c++17. If you disable exceptions with -fno-exceptions, you’ll have differences in the internal representation streamed. The CMI data is probably tightly related to several command line options.

While CMIs might not be interchangeable, both GCC and Clang have extended the Itanium ABI so that their object files remain link-compatible.

The CMI is a caching artefact, recreateable on demand.
We already have a code distribution mechanism. It is source code.

You may notice that modules are not namespaces. A module can export names in any namespaces it chooses. The namespaces are common across all modules, and many modules can export names into the same namespace. An importer of a module has to use a qualified name to refer to a module’s exports (or deploy using-directives).

You’ll also have noticed that the main program had to #include <string_view>, even though the interface had already done so. The interface had done this in part of the file that precedes the module itself, and that part is not visible to importers. As the user code needs to create a std::string_view, it needs the header file itself. The header include and the import can be in any order. I’ll get more into detail about this later, as it is an important bridge from today’s code to the future’s module code.

Export

You’ll see the example used the resurrected export keyword in two places:

export module Hello; // declare the interface of a module
export void SayHello (…); // make a declaration visible to importers

The first use is a module-declaration, a new kind of declaration specifying the current source file is part of a module. You can only have at most one of them, and there are restrictions on what can appear before it. The intent is that you won’t get surprised with it buried in the middle of a file. As you might guess, there’s a variant of the module-declaration, which lacks the export keyword. I’ll get to that later.

The second use allows you to make parts of the module interface visible to importers, and most importantly its lack allows you to keep parts of the interface private to the module. Only namespace-scope nameable declarations can be exported. You can’t export (just) a member of a class, nor can you export a specific template specialization (specializations are not found by name). You cannot export things from within an anonymous namespace. You can only export things from the interface of a module (see Listing 3).

export module example;
// You can export a class.
// Both it and its members are available (usual
// access restrictions apply)
export class Widget { … };
namespace Tool {
 // export a member of a namespace
 export void Frobber ();
}
// export a using declaration (the used things must
// be exported)
export using Tool::Frobber;
// export a typedef
export using W  = Widget;
// export a template definition. Users can
// instantiate it
export template<int I> int Number () { return I;}
// you cannot explicitly export a specialization,
// but you can create them for importers to use
template<> int Number<0> () { 
  return -1; /* Evil! */ }

Listing 3

If you export something, you must export it upon its first declaration. This is like declaring something static – you have to do so on its first declaration, but a later redeclaration can omit the static. In fact, export is described in terms of linkage – it’s how you get external-linkage from inside a module. Declarations with external linkage are nameable from other modules.

Module ownership

Module ownership is a new concept. Declarations in the purview (after the module-declaration) of a module are owned by that module. No other module can declare the same entity. The module specification has been carefully designed to not require new linker technology. In general, module ownership can be added to the symbol-name of an entity, at the object-file level. You’re probably familiar with overloaded functions having mangled names, so that the linker can distinguish between int Frob (int) and int Frob (double). Module ownership can be implemented by extending that mangling, and that is just what the Itanium ABI does (used on Linux and many other systems).

However, there is a design trade-off. Should exported names be link-compatible with their non-modular equivalent? I.e. is it possible to create a header file with just the exported declarations of a module, and have that useable in module-unaware code? An alternative way of phrasing the question is whether modules exporting the ‘same’ entity should result in multiple definition errors (it is ill-formed). The Itanium ABI takes that approach, which is known as weak ownership.

The alternative strong ownership includes the module ownership in the exported symbols too, or uses new linker technology.

So, what happens if you omit the export inside a module? In that case, you get a new kind of linkage – module-linkage. Declarations with module-linkage are nameable only within the same module, as a module can consist of several source files, this is not like the internal-linkage you have with static. It does mean that two modules could both have their own int Frob (int) functions, without placing them into globally unique namespaces.

Types (including typedefs) can be exported (or not exported), in the same way as functions and variables. Types already have linkage (but typedefs do not). Usually we don’t think about that, because we use header files to convey such information and they textually include the class or typedef definition. Modules has more rigorous formulation of linkage of these entities that do not themselves generate code (and hence object-level symbols).

You can also export imports (see Listing 4).

// file: ex2/hello.cc
module;
#include <iostream>
export module Hello;
export import <string_view>;
// importers get <string_view>

using namespace std; // not visible to importers
export void SayHello (string_view const &name)
{
 cout << "Hello " << name << "!\n";
}
// file: ex2/main.cc
// same contents as ex1/main.cc

Listing 4

Here I’ve imported and re-exported <string_view>, (wait, what? importing a header file!? I’ll get to that) so that users do not need to #include (or import) it themselves. To build this, you will need to process <string_view>:

  > cd ex2
  > g++ -fmodules-ts -std=c++20 -c \
    -x c++-system-header² string_view
  > g++ -fmodules-ts -std=c++20 -c hello.cc
  > g++ -fmodules-ts -std=c++20 -c main.cc
  > g++ -o main main.o hello.o
  > ./main
  Hello World!

World in transition

So, how do I write my lovely new modules, but have them depend on olde worlde header files? It’d be unfortunate if it could only use modules. Fortunately there’s not one, but two ways to do this (with different trade-offs).

New keywords

C++ already had export as a keyword, exported templates were removed in C++11, but the keyword remained reserved, but module and import are new. Will that cause problems? There is known code that uses module and import as identifiers in their external interfaces. It would cause difficulty if those suddenly became unusable.

The C++ committee took care to specify that the lexing and parsing of module and import declarations was context sensitive. Code using those tokens as identifiers will largely be unaffected, and if they are there are simple formatting workarounds.

You saw the first way in the early example. We had a section of the source file before the module-declaration. That section is known as a Global Module Fragment (GMF). It’s introduced by a plain module; sequence, which must be the first tokens of the file (after preprocessing and comment stripping). If there is such a GMF, there must be a module-declaration – you can’t just have an introduced GMF, why would you need that? The contents of the GMF must entirely consist of preprocessing directives (or comments). You can have a #include there, but you can’t have the contents of that #include directly in the top-level source. The aim of this design is to make scanning for the module-declaration simple. Both the introductory module; and the module-declaration must be in the top-level source, unobscured by macros.

In this way, modules can get access to regular header files, and not reveal them to their users – we get encapsulation that is, in general, impossible with header files. Hurrah!

There is a missed opportunity with this kind of scheme. The compiler still has to tokenize and parse all those header files, and we might be blocking the compilation of source files that depend on this module. That’s unfortunate. Another scheme to address this is header-units. Header units are header files that have been compiled in an implementation-specified mode to create their own CMIs. Unlike the named-module CMIs that we’ve met so far, all header-unit CMIs declare entities in the Global Module. You can import header-units with an import-declaration naming the header-file:

  import <iostream>;

This import can be placed in the module’s purview, without making it visible to importers.

Naturally, as header-units are built from header files, there are issues with duplicate declarations and definitions. But we can make use of the One Definition Rule, and extend it into this new domain. Thus header-units may multiply declare or define entities, and be importable into a single compilation. Unlike header files, importing a header-unit is not affected by macros already defined at the point of the import – the meaning of the header-unit is determined by the macros defined when it was compiled to a CMI.

Not all header files are convertible to header-units. The goal here is to allow most of them to be, generally the well-behaved header files. This work derives from Clang-modules, which was an effort to do this seamlessly without changing source code.

One thing header-units do, which named modules do not, is export macros. This was unfortunately unavoidable as so many header files expose parts of their interface in the form of macros. Named-modules never export macros, even from re-exported header-units.

Implementations

I know of 4 popular compiler front ends that are on the path of implementing C++20 modules support.

The Edison Design Group FE. This is a popular front end for many proprietary compilers. They implemented the original export specification and gained an awful amount of knowledge about pitfalls in combining translation units. I do not know the current state of the implementation, nor do I know implementation details.
The Microsoft Compiler. This is complete, or very nearly so. The main architect of the modules-ts, Gabriel dos Reis, is at Microsoft, and has guided that design. I believe this is the most complete implementation.
Clang. The Clang FE has provided an implicit module scheme for use with header files for some time. Much of that experience went into the header-unit design. Richard Smith of Google has guided that design.
GCC I have been working on an implementation for GCC. This is currently on a development branch, and not in a released version. Godbolt (https://godbolt.org) provides it as an available compiler ‘x86-64 GCC (modules)’. The current status (along with build instructions and list of unimplemented features) is described at https://gcc.gnu.org/wiki/cxx-modules. I try not to regress its state, and I hope to merge it soon.¹

The latter 3 implementations (at least), have a lazy loading optimization. Importing a module does nothing beyond annotating symbol tables noting that an import contains something with a particular name. It is only when user code mentions a name that the relevant parts of the import are read in. The same is true for the macros of header-units. Thus importing is even cheaper than #including than might be expected.

Apologies to any other C++ compilers that I have failed to mention.

¹ Det er vanskeligt at spaa, især naar det gælder Fremtiden. [It is difficult to make predictions, especially about the future.] Probably a Dane other than Niels Bohr, https://quoteinvestigator.com/2013/10/20/no-predict/.

Splitting a module

So far I’ve only shown a module consisting of a single interface file. You can split a module up in two different ways.

The simplest way is to provide module-implementation files, distinct from the interface. An implementation file just has a module-declaration lacking the export keyword (it doesn’t export things). While a module must have only one interface file, it can have many implementation files (or none at all). The implementation files implicitly import the interface’s CMI, but themselves only produce an object file. If you think about modules as glorified header files, then this is the natural separation of interface and implementation (but you’re probably missing out).

The interface itself can be separated into module-partitions. Partitions have names containing exactly one :. These themselves can be interface or implementation partitions depending on whether their module-declaration has the export keyword or not. Interface partitions may export entities, just as the primary interface does. These interface partitions must be re-exported from the primary interface. The partitions may also be imported into any unit of the same module.

Partitions provide a way to break a large interface into smaller chunks.
Partitions are not importable into different modules. The partitions are invisible outside of their module.
Implementation partitions provide a way to make certain definitions available inside the module only, but have users aware of the type (for instance).

For example we could break our original example up as shown in Listing 5.

// file: ex3/hello-inp.cc
module;
#include <string_view>
// interface partition of Hello
export module Hello:inter; 
export void SayHello
  (std::string_view const &name);

// file: ex3/hello-imp.cc
module;
#include <iostream>
// implementation partition of Hello
module Hello:impl; 
import :inter; // import the interface partition
import <string_view>; 

using namespace std;
void SayHello (string_view const &name)
// matches the interface partitions’s exported
// declaration
{
  cout << "Hello " << name << "!\n";
}

// file: ex3/hello-i.cc
export module Hello;
// reexport the interface partition
export import :inter; 
import :impl; // import the implementation partition
// export the string header-unit
export import <string_view>;  
// file: ex3/main.cc
// same contents as ex1/main.cc

Listing 5

In the primary interface, the three imports can be in any order. That’s one of the design goals – import order is unimportant. You can see that the import syntax for a partition doesn’t name the module. That’s also important, so that there is no temptation to import into a different module.

Here are the build commands:

  > cd ex3
  > g++ -fmodules-ts -std=c++20 -c \
    -x c++-system-header string_view
  > g++ -fmodules-ts -std=c++20 -c hello-inp.cc
  > g++ -fmodules-ts -std=c++20 -c hello-imp.cc
  > g++ -fmodules-ts -std=c++20 -c hello-i.cc
  > g++ -fmodules-ts -std=c++20 -c main.cc
  > ar -cr libhello.a hello-{i,inp,imp}.o 
  > g++ -o main main.o -L. -lhello
  > ./main
  Hello World!

Note that in this example there was no need to import the implementation partition – it had no semantic effect.

Module ABI stability

An important part of module interface design is control of the aspects that are visible to users. Generally, the parts of the interface that can result in the importer emitting code are part of the ABI of your module. You want to control that.

The One Definition Rule

The One Definition Rule specifies that in a complete program there can only be One Definition of certain types of entities. And for those that can have multiple definitions (class, inline function, template instantiations), it places restrictions specifying how all those definitions are equivalent. It is the source of many ‘ill-formed, no diagnostic-required’ clauses in the standard – you the poor user get to figure it out.

Modules, and header-units, make it much harder to have silent ODR violations, which is good. The down side is you shouldn’t be surprised when it finds ODR violations in your existing code as you convert to modules. At least you’ll get a diagnostic rather than a land mine.

Every exported inline function’s body is visible to importers (they need to refer to the entities it names), and changing the body can change the ABI of a module. To that end, one significant change has been made to in-class function definitions. They are no-longer implicitly inline in a module’s purview! The implicit functions are still inline, as are lambdas. This means you no longer have to separate the definitions of your non-inline member functions (including template definitions), from their in-class declaration.

How will build systems be affected?

C++ build systems will need to change. The hardest build is build-from-scatch, when one does not have dependency information from a previous build.

In a header-only world, their problem was much simpler – all sources files can be built in parallel. Unless of course there are generated headers, and usually build systems suck with those. But now, we have interdependencies between source files. We cannot build the importers of module Foo, until we’ve built module Foo – we have the equivalent of generated headers all over the place! To make the problem harder, there’s no defined mapping between module names and the source file name of the interface.

There are essentially two approaches to solving this.

Prescanners. A prescan stage processes all the source files. Fortunately the design is such that this scanning is relatively simple, if one’s happy with over-estimating the dependencies. The module-declarations and import-declarations must appear on lines by themselves, without the module, export or import keywords being obscured by macros. They’re pretty much like preprocessor directives without the leading #. If one ignores everything else – including #if lines, one will get the maximal set of dependencies. To further simplify, all the imports of a module must appear immediately after the module-declaration – you can’t place one later in the middle of the module. For more accurate dependencies one would have to track #ifs and macros. With this information computed, the build can launch compiles in the correct order, and inform each of the locations of the Compiled Module Interfaces (CMIs) it will require.
Dynamic build graph. The compiler could consult an oracle whenever it meets an import-declaration, and inform the same oracle whenever it meets a module-declaration and produces a CMI. If the oracle is the build-system, it can modify its dependency graph, build the needed CMI(s) and then inform the compiler of the location. Because of the requirement of import placement, this can even be parallelized somewhat!

In both cases the determined build graph can be retained for a subsequent incremental build.

Note that in an unconstrained parallel build, a clean build of modular code is likely to be slower than that of #include builds – it’s constrained by the module dependency tree. However, an incremental build is likely to be much faster, because header-files do not need to be reparsed all the time. Google’s experience with Clang’s implicit modules showed this to be a significant win.

Onwards!

I hope the examples here have shown you a flavour of what is available with modules. I kept the examples simple, to show some of the core module concepts, particularly how non-modular and modular code can interact.

As mentioned elsewhere, I believe the Microsoft implementation is the most advanced, and has been used for production code. Of the other implementations, GCC’s is more complete than Clang’s (mid 2020).

Unfortunately, for GCC one must use Godbolt, which is awkward for the more advanced use, or build one’s own compiler, which is a steep cliff to climb for most users. To make things even more exciting, those that have played with GCC have fallen over bugs. As with any major new feature, ensuring it is correct is difficult, and users have imaginative ways of exercising things. Don’t let that put you off though, user bug reports are helpful.

Footnotes

GCC’s main development trunk and released versions do not yet provide module support. See the ‘Implementations’ box for details.
As string_view has no suffix, you need to tell G++ what language it is. The c++-system-header language specifies (a) searching on the system #include path and (b) with -fmodules-ts, specifies building a header-unit. Other possibilities are c++-header (automatically recognized with a variety of typical header file suffixes) and c++-user-header (use using #include path).

Nathan Sidwell is a long-time developer of GCC, having discovered that Open Source is more rewarding than proprietary software, compilers are more rewarding than hardware, and hardware is more rewarding than Physics.