ACCU Home page ACCU Conference Page ACCU 2017 Conference Registration Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinMusings on Python – by a C++ Developer

Overload Journal #122 - August 2014 + Programming Topics   Author: Sergey Ignatchenko
Python and C++ are very different languages. Sergey Ignatchenko walks through things in Python that can confuse a C++ programmer.

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Bunny, and do not necessarily coincide with the opinions of the translators and editors. Please also keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

During my vacation in Watership Down warren, I met a fellow rabbit developer who’s got some experience with developing in Python, after spending quite a while worshiping C++. Below is my humble attempt to express his feelings about Python in a more or less literary form. It doesn’t aim to be a comprehensive analysis of the subject, but rather a set of things the guy himself has run into (YMMV).

A philosphical approach
From a philosophical perspective, one can think of classical pre-template C++ (which relies on virtual functions) as having an ‘it matters who you are’ paradigm, while C++ templates and Python in general, relying on a quite different ‘it matters what you can do’ approach.

The good

Ad-hoc typing

When I see a bird that walks like a duck
and swims like a duck
and quacks like a duck,
I call that bird a duck.
~ James Whitcomb Riley

One good thing about Python is its ad-hoc typing system (which is known in Python world as ‘duck typing’). I’ve observed that it does speed up initial development quite a bit.

In any language, it is common to write something specific, and then to generalize it. In C++, it is doable, but difficulties related to generalization are quite substantial. In fact, you can either generalize via making a function virtual (relying on common base class), or making it a template. I won’t discuss the advantages and disadvantages of each of the approaches here, but in any case you’re expected to spend some time performing this generalization. If you prefer (or need, as it routinely happens with containers) the C++ template route, the necessary textual changes are massive (even when they’re mostly mechanical), and debugging of the generalized program requires quite an effort (to put it mildly); in fact, it is such a big effort that many developers won’t do it at all, and those who will, will think twice before going the template way. If going down the virtualization route, changes are not that massive (though are still substantial), but you’re introducing a common base class, which is essentially a dependency which often leads to strange problems down the road (like multiple inheritance with virtual base classes etc.); while these problems can always be solved, solving them takes time, and this is my point here.

In contrast, in Python you don’t need to do anything special to make your code generic. In fact, each and every piece of code becomes as generic as possible at the very moment it is written. For example, with code such as

  def f(x,y):
    return x*y

you don’t care about types of x and y, as long as they support multiplication. While in C++ it can be written as a template quite easily, the amount of textual changes necessary when converting a function f from int f(int x, int y) to its template counterpart will be quite substantial (and if we consider more complicated functions, the complexity will rise further).

It should be noted that in Python you can (and should) use classes more or less like in C++. However, in Python you have an option not to do so (in trivial cases) – and this flexibility often saves quite a lot of development time.

Overall, it is not about ‘what you can do’ in Python and in C++ (whatever you can do in Python, you can do in C++), but more of ‘what you can do faster’. This matters, because the more time you need to spend on technicalities related to your programming language, the less time you have left for the task in hand; in a sense, it is similar to an argument between assembler and C developers 40 or so years ago (I don’t want to say that C++ will follow the fate of assembler, at least not yet).

A word for those who have arguments about advantages of strong typing – I will tell a bit about these advantages too, so please keep reading until you reach ‘The bad’ section :-).

Garbage collection with RAII support

One thing which I like about Python is that while it is garbage collected, it has explicit support for Resource Allocation Is Initialization (RAII). Garbage collection IMHO does speed development up (though contrary to common belief you still need to be careful to avoid memory leaks [Ignatchenko12]). On the other hand, some garbage-collected programming languages (notably Java, at least at the time I last saw it) have a problem that freeing resources becomes really cumbersome and error-prone.

Let’s consider the C++ class File, which opens a file in constructor, and closes it in the destructor. It means that even if there was an exception, then when my object of class File goes out of scope, the file is closed and the resource is freed. Good, but we don’t have garbage collection in C++.

The same class File in Java won’t be able to have a real destructor (there are no destructors in Java). In Java, to guarantee that you always close all the relevant files, you have three and a half options. Option 1 is to find all places where you have instantiated File, enclose them in try-finally blocks, and close file manually in each finally block. Horrible. Option 1a is a variation of Option 1, based on the ‘execute around’ pattern. Basically, you’re declaring a function wrapper which allocates resource, then calls whatever function you need via an interface (doing it within try-finally block), and then frees allocated resource. As long as you can make sure that class File is used only within such a wrapper – it is not ‘horrible’ anymore, just ‘very cumbersome’.

Option 2 looks a bit better on the surface – in Java you can define a finalize() function, which looks like ‘almost a destructor’. Unfortunately, this ‘almost’ kills the whole idea: due to the very nature of garbage collection, Java cannot guarantee when exactly finalize() will be called; it means all kinds of trouble, including the program passing all the tests but failing in production. For example, you have file.close() in finalize(), and then re-open the same file somewhere down the road. It just so happens during the tests that finalize() is called before re-opening, and all tests pass, but in production finalize() is sometimes called later than re-opening the file, and therefore re-opening the file fails (to make things worse, it will invariably fail intermittently and at the very worst time to make debugging even more complicated). Overall, there is pretty much a consensus that finalize() should not be used for a generic resource cleanup. Ouch. In fact, this ‘how to guarantee that resources are always freed when they’re not necessary anymore’ problem has always been my biggest complaint about Java.

Option 3 (thanks to Roger Orr for pointing it out): if you’re lucky enough to run Java 7, you may implement the java.lang.AutoCloseable interface and then write code such as:

  try (MyClass x = new MyClass(/*...*/))
    //'try-with-resources' statement
    {
      x.method("this might throw");
    } // x.close() is called in any case

Not bad – and we can say that Java 7 does support both RAII and garbage collection.

In a manner which is quite similar to Java 7, Python provides a neat way of expressing RAII. In Python, you can declare your class with special functions like __entry__() and __exit__() (and many of Python’s own objects such as the file object, implement them too). Then, you can write something like:

  with open(“myfile.txt”,”r”) as f:
    #work with f
    #more work with f
  #at this point, f.__exit__() will be called

For me, it solves all my resource allocation concerns (and Python has garbage collection too). Oh, and while we’re on the subject of garbage collection and finalizers in Python – a word of advice: never declare Python finalizers (__del__() functions) unless you really know what it means (Python __del__() causes very different behavior from the Java finalize()).

Usable Lambdas

I didn’t think that I would ever be able to write anything good about lambda functions for any practical purpose, but here it is: lambda functions in Python are surprisingly readable and useful. They have a very simple syntax, and they’re limited, but they’re very readable. Compare:

Plain C++11:

  1 sort( myVector.begin(), myVector.end(),
  2    []( const MyClass& a, const MyClass& b )
  3         { return a.x < b.x; }
  4    );

Python:

  sort( myList, key=lambda a: a.x )

I have never been a fan of one-line expressions just for the sake of being one-line, but the Python version is not only a one-line, it is obvious from the very first glance, while the C++ version requires quite a lot of time to parse when reading.

It should be noted that the point of the example above is not about begin() and end() in C++ line 1 or comparison in C++ line 3; as we’re discussing lambdas, the difference under consideration is about C++ line 2 (and inevitable curly brackets from line 3).

As it was pointed out by Jens Auer in accu-general, the boost::lambda library (BLL) allows much shorter way of writing it.

boost::lambda library:

    sort( myVector.begin(), myVector.end(),
          _1.x < _2.x );

Still, I’d argue that while certainly shorter than plain C++, it is not exactly readable compared to Python version – first, numbered parameters are definitely worse than Python’s named ones, and second, unless you know about BLL (and most developers don’t as of now), such code becomes extremely confusing. Honestly, for a C++ project with more developers than just me I don’t know which way I’d use – cumbersome plain C++ or a much shorter BLL with a comment for each such lambda saying /* boost lambda */, so an unaware reader knows how to Google it (with Python syntax, it is quite self-documented).

NB: Obviously, it is possible to write a non-lambda wrapper for a specific task of sorting a vector, but this won’t get us any closer to having usable and readable lambdas, which this section is about. While it is perfectly possible to write code without lambdas at all, usable and readable lambdas do simplify development (not by much, but every bit counts), and having to write a non-lambda wrapper for each scenario where lambdas are useful, defeats the whole purpose of lambdas.

NB #2: there is a caveat related to lambdas in Python, please see ‘The ugly’section

Standard library

The standard library in Python is huge and is very-well organized. It includes 90% of the things one may want from an application-level library; overall, having pretty much everything included into the standard Python library is often referred to with the ‘Python. Batteries Included’ phrase. With all due respect to the enormous efforts of boost:: folks, matching functionality with the Python library isn’t going to happen (and probably is not aimed for) – there are just so many things in there, including cryptography, wide protocol and file format support, database interface libraries, etc. etc. Once again, it is not about ‘you cannot do it in C++’, but about ‘how long it will take to do’.

Let us now consider some of the most important parts of the Python standard library.

Collections are supported at language level, and include tuples (somewhat similar to return of C++’s std::make_tuple()), lists (similar to std::vector<>), dictionaries (similar to std::unordered_map<>), and sets (similar to std::unordered_set<>). Notably missing (well, you can use the bintrees package but it is not exactly ‘standard library’) are tree-based maps/sets which allow fast ordered iterations over large datasets.

Assertions are first-class citizens and are recognized at language level, which is a good thing. They also allow specification of the message to report in case of assertion failure – if you feel like it.

Furthermore, the packages profile and cProfile provide a rather convenient built-in means of profiling of your program.

Regular expressions in Python are very efficient and are aided by ‘raw string’ literals. ‘Raw string’ literals are useful because the escaping rules for ‘\’ in default C++ strings and default Python strings tend to make regexps quite cumbersome and poorly readable. In Python (as well as in C++11), there is an elegant way around it: whenever you prefix string literal with ‘r’ (such as r"\([0-9]*\)"), the escaping rules for backslashes will be different, which allows you to write regular expressions in quite a natural way.

Built-in unit testing framework

Having a built-in unit testing framework is a good thing in any language. For weakly typed languages such as Python, integrated unit testing (especially automated regression testing) becomes an absolute must. Fortunately, Python has support for it too (for details, see the unittest package).

Performance

For those who want to write Python off due to performance issues, I have a word of advice: don’t rush. While it is perfectly possible to find an application where Python’s performance (or as some C/C++ developers will probably say, Python’s lack of performance) will make a difference, the chances are that you won’t be able to see the difference in your program. In 99% of business applications, 99% of code is ‘glue code’, and for 99% of ‘glue code’, Python’s performance will be more than enough.

Of course, if you’re developing some non-standard computation-intensive stuff such as a video decoder, you will probably be out of luck. However, if you will run into situation where you need to write certain parts in C/C++, Python will provide a way to call your DLLs/.so's (the appropriate Python package is ctypes).

The bad

If you’ve read until this point, you may think that I’m a Python missionary on a quest to convert as many people as possible. Don’t worry, I will mention bad sides of Python too.

Ad-hoc typing

While ad-hoc typing does have its advantages (as was discussed above), it has a big problem too, and this is a lack of scalability. Let me elaborate a bit. If you’re creating an ad-hoc object such as (1,2,3) (similar to std::make_tuple(1,2,3)), it works very well for those cases where you need just to pass it from one point to another point, without going into hassle of declaring things. However, ad-hoc typing doesn’t really scale – as soon as you’re using the same ad-hoc type in 10 places, and it does need to be the same in all 10 places, code maintenance becomes a nightmare.

Many Python developers seem to realize the problem, and several workarounds have been created. In particular, I’ve found namedtuple package to be quite useful (in a sense, it is a close cousin of C++ struct):

C++:

  struct X { int i; string s; }

Python:

  X = namedtuple('X', ['i','s'])

On the other hand, more recent development of the Python abstract base classes (package abc) feels like a contradictio in adjecto: it is like writing in Python using C++ paradigms, which defeats the advantages of one while not providing benefits of the other one.

An ideal IMHO would be an environment where I could write ad-hoc types without declaring them (while they are still small), and then, whenever I feel that they became too large to be ad-hoc, to change them (just by adding declarations where necessary, and not changing the actual code(!)) to strict typing. I have some ideas in this regard, though it is a bit too early to describe them.

Performance

In general, Python performs surprisingly well for a scripting language. Still, if computationally intensive work is involved, one may end up with a need to rewrite big chunks of the program (or even the whole program). Also, multithreading, while technically possible, does not allow performing calculations on more than one core (see below).

Multithreading

Multithreading in Python is a joke; well, it is at least for those of us coming from a non-Pythonic world. Due to the fact that all data processing in Python is made under the so-called Global Interpreter Lock (GIL), trying to perform calculations on two cores in two threads is doomed (well, it will work, but it won’t work any faster – and probably a tad slower – than single-threaded code). It limits the usage of multithreading to the cases when a thread is blocked due to I/O wait. Technically speaking, while the default Python distribution uses cPython which does have a GIL, GIL is not a restriction of Python as such, so you may be able to get away with using something like Jython or IronPython (I didn’t try it myself though).

If you need to perform computational-intensive calculations in parallel while using the default cPython, there is still an option to spawn another process (which will have its own GIL so you will be able to calculate things in parallel). An appropriate Python package is multiprocessing, and it is quite convenient (in fact, it has an interface which is very similar to that of the threading package). However, you should keep in mind that under the hood it relies on marshaling/unmarshaling of all the parameters passed to the working process (and of all the values returned back), so if your parameters and/or return values are large you can easily get quite a performance hit. Which in turn can be overcome (at least in theory) by using shared memory, but this has a caveat too – shared memory cannot contain anything but very simple data. Overall, you can end up with a scenario where you’ll essentially be forced to write the computational code in C/C++.

The ugly

Every programming language has its own peculiarities, and Python is not an exception. I will try to point out a few items which looked quite unusual to me after coming from a mainly C++ world.

‘Pythonic’

When speaking to Pythonic developers (whether in person or in forums) there is a big chance that you’ll run into somebody who with almost religious zealousy will tell you, “You shouldn’t write it this way, because it is not ‘Pythonic’”. In fact, way too often ‘Pythonic’ becomes a synonym to “I believe that it is the only way of doing it; I cannot explain why, so I’m telling it is ‘Pythonic’”. Fortunately, in more or less populated forums (such as StackOverflow), usually there are enough people who make sure that whatever is called ‘Pythonic’ makes sense. Still, ongoing arguments about something being ‘Pythonic’ (or ‘not Pythonic’) can be rather annoying.

Python 3

Whereupon the emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs.
~ Jonathan Swift, circa 1726

With all due respect to Guido van Rossum, I strongly believe that the approach taken with Python 3 is a huge mortgage-crisis-sized mistake. What has happened with Python 3 is that developers were told that Python 3 will be incompatible with Python 2. No smooth migration, no gradual deprecation, just ‘all or nothing# migration path (well, with a helper ‘2to3’ tool which ‘sorta’ converts Python 2 source code to Python 3). Moreover, certain constructs which are allowed in both Python 2 and Python 3, have a subtly different meaning in Python 3 (one such example is dict.items()). This has lead to enormous confusion and significant reluctance to move towards Python 3 (in fact, the adoption rate of Python 3 was reported to be as low as 2% 5 years after it has been introduced [Hiltmon14]).

Without going into Blefuscian-Lilliputian discussions of “What is better – to suffer from imperfections of Python 2 in Python 3 or to have better but incompatible Python 3?”, I’ll try to summarize the current situation:

  • the official position of Guido and the Python core team is that all new development SHOULD be done in Python 3
  • however, if you have Python as a part of a 3rd-party application (which tend to use Python 2, as they need to support older scripts written in Python 2) – you’re pretty much doomed to Python 2
  • moreover, as there is a ‘2to3’ tool which ‘sorta’ converts your code from Python 2 to Python 3 (and there is no tool which converts code back – from Python 3 to Python 2), one way to have code which supports both Python 2 and Python 3, is to keep your codebase in Python 2. Alternatively, you may write in a dialect known as Polyglot (which works in both Python 2 and Python 3), though it has been argued that Polyglot is the worst language out of Python 2, Python 3, and Polyglot [Faassen14].

Phew, this is ugly indeed. To make it even uglier, there were even suggestions to stop supporting Python 2 to force migration to Python 3 [Faassen2014]. One thing I wonder about is how those people would stop a huge Python 2 community from creating an unofficial fork with ongoing support for Python 2 (it is open source, after all)?

Semantics of whitespace

Python is a quite unusual language in that it relies on whitespace to provide semantic data (or in other words – changing whitespace can change semantics of the program in Python). For example,

  if a < b:
      x = 1
      y = 2

and

  if a < b:
      x = 1
  y = 2

are two different programs producing different results.

The Python approach has both advantages and disadvantages. On the positive side, it enforces code readability. On the negative side, it has several issues (in practice, rather minor if you are careful):

  • you need to be careful when switching between windows using Alt-Tab: there is a substantial chance that an accidentally added tab can go unnoticed but will break your code, ouch
  • you need to make sure that ‘diff’ tool which you’re using with your source control system, does not ignore whitespace
  • instead of an endless C/C++/C#/Java debate of “Where is the right way to put curly brackets?” it leads to another endless debate of “What is the right thing to use – tabs or spaces?” As it doesn’t matter any more than which end of the egg is broken, the only thing which matters is consistency. And as such, I prefer to stick to the (widely accepted) recommendation from [PEP8]: use spaces, with 4 spaces per indentation level. Why? Just for the sake of consistency.

Lambdas within loops

With all the good things said about lambdas in Python, there is one thing to keep in mind: if you try to use lambda which captures variables within a loop, it won’t work as you might have expected. For details, refer, for example, to [StackOverflow-1]. The best workaround I was able to find is a direct replacement with a function object. Instead of not-working-as-expected:

  for i in range(10):
      a[i] = lambda x: x+i
          #every a[i] function will use the same i=10

you can use, for example, an almost equivalent but working as you (or at least me) would intuitively expect:

  class MyLambda:
      def __init__(self,i):
          self.i = i
      def __call__(self):
          return x+self.i
  #...
  for i in range(10):
      a[i] = MyLambda(i)

There are other alternatives too, see, for example, [StackOverflow-1] and [StackOverflow-2] for details.

Overall, this is rather annoying, but is not that a big deal when you know about it.

Optimizing performance

Optimizing the performance of a Python program is very different from optimizing a C or C++ counterpart. For a Python program, instead of an ‘I can write it myself’ approach, one should look for highly optimized (a.k.a. ‘written in C’) functions from the Python standard library. Just one example: when we’ve needed to read a multi-megabyte text file accounting for ‘universal line endings’ (either \r or \n), the standard Python library did rather a poor job. Rewriting it to byte-by-byte processing (which would help in C/C++) has only made things worse, as more work for Python bytecode is rarely a good thing. However, when we became creative and started reading the file in chunks (each chunk being several kilobytes in size, so it usually contained multiple lines), pre-creating a regular expression pattern

  eol_pattern = re.compile( r'([^\r\n]*)([\r\n])' )

and using

  line = eol_pattern.match( chunk,
  current_pos_within_chunk )

on the chunk to extract the next line (with an appropriate handling of double-symbol line endings), we got about a 2x speed improvement over the standard Python library universal line handling (which apparently was pure Python and was not that creative in this regard). The reason for this is quite obvious: the regular expression library is a heavily optimized C code, and when we pushed most of the processing there instead of doing it in Python, we got quite an improvement.

Conclusion

The guy who told me this story, is of the opinion (which I may or may not share) that Python is by far the best language available for writing ‘glue’ code. Yes, it has its quirks, but for most of the business-level code it is clearly ‘good enough’, and whenever top performance is necessary, C/C++ code can be integrated rather easily.

From my perspective, I would say that as both C++ and Python are Turing-complete,: you can always implement any practical program in both of them (well, assuming that Church-Turing thesis stands). In practice, of course, there are restrictions such as, ‘Will we live long enough to write the program?’ (an argument for C++ over asm and for Python over C++) and ‘Will we live long enough for the program to execute?’ (an argument in the opposite direction). As usual, it is all about choosing right tool for the job.

References

[Faassen14] ‘The Gravity of Python 2’, Martijn Faassen, http://blog.startifact.com/posts/python-2-gravity.html

[Hiltmon14] ‘Python: It’s a Trap’, Hilton Lipschitz, http://hiltmon.com/blog/2014/01/04/python-its-a-trap/

[Ignatchenko12] ‘Memory Leaks and Memory Leaks’, Sergey Ignatchenko, Overload 107, February 2012

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[PEP8] http://legacy.python.org/dev/peps/pep-0008

[StackOverflow-1] http://stackoverflow.com/questions/1841268/generating-functions-inside-loop-with-lambda-expression-in-python

[StackOverflow-2] http://stackoverflow.com/questions/938429/scope-of-python-lambda-functions-and-their-parameters

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Overload Journal #122 - August 2014 + Programming Topics