Code often rots over time as various people add new features. Jez Higgins shows how to refactor code that has grown organically, making it clearer and more concise.
In a recent blog post [Higgins24] about my sadness and disappointment about the candidates we were getting for interview, I talked about the refactoring exercise we give people, and the conversations we have afterwards.
I’m not able to show any of that code, but I am going to talk about some code here of the type we often see. According to the version history, it’s passed through a number of hands, and I want to be clear I know none of the people involved nor have I spoken to them. They are, though, exactly the type of person presenting themselves for interview, and so for my purposes here they are exemplars.
Here’s some Python code. It’s from a larger document processing pipeline. Documents come shoved into the system, get squished around a bit, have metadata added, some formatting fixups, then squirt out the other end as nice looking pdfs. Standard stuff.
This is not about them, though. I hold them blameless, and wish them only happiness. This is about the places that they worked, about the wider trade, about a culture that says this is fine.
To see a world in a grain of sand
Documents can have references to other documents, both within the existing corpus, and to a variety of external sources. These references have standard forms, and when we find something that looks like a document reference, we do a bit of work to make sure it’s absolutely clean and proper.
That’s where the function in Listing 1, normalise_reference
, comes in. I have obfuscated identifiers in the code sample, but its structure and behaviour are as I found it.
def canonicalise_reference(reference_type, reference_match, canonical_form): if ( (reference_type == "RefYearAbbrNum") | (reference_type == "RefYearAbbrNumTeam") | (reference_type == "YearAbbrNum") ): components = re.findall(r"\d+", reference_match) year = components[0] d1 = components[1] d2 = "" corrected_reference = canonical_form.replace("dddd", year).replace("d+", d1) elif ( (reference_type == "RefYearAbbrNumNumTeam") | (reference_type == "RefYearAbrrNumStrokeNum") | (reference_type == "RefYearNumAbbrNum") ): components = re.findall(r"\d+", reference_match) year = components[0] d1 = components[1] d2 = components[2] corrected_reference = ( canonical_form.replace("dddd", year).replace("d1", d1).replace("d2", d2) ) elif ( (reference_type == "AbbrNumAbbrNum") | (reference_type == "NumAbbrNum") | (reference_type == "EuroRefC") | (reference_type == "EuroRefT") ): components = re.findall(r"\d+", reference_match) year = "" d1 = components[0] d2 = components[1] corrected_reference = canonical_form.replace("d1", d1).replace("d2", d2) return corrected_reference, year, d1, d2 |
Listing 1 |
I’d been kind-of browsing around a git repository, looking at folder structure, getting the general picture. A chunk of the system is a Django webapp and thus has that shape, so I went digging for a bit of the meat underneath. This was almost the first thing I saw and, well, I kind of flinched. Poking around some more confirmed it’s not an anomaly. It is representative of the system.
You’ve probably had some kind of reaction of your own. This is what immediately leapt out at me:
- The length
- The width!1
- The visual repetition, both in the
if
conditions and in the bodies of the conditionals - The string literals
- The string literal with the spelling mistake
- The extraneous brackets in the second conditional body – written by someone else?
- The extra line before the return – functionally, of course, it makes no difference, but still, urgh
Straightaway I’m thinking that more than one person has worked on this over time. That’s normal, of course, but I shouldn’t be able to tell. If I can, it’s a contra-indicator.
Looking a little longer, there’s a lot of repetition – in shape, and in detail. Looking a little longer still, and I think function parameters are in the wrong order. reference_type
and canonical_form
are correlated and originate within the system. They should go together. It’s reference_match
which comes from the input document, it’s the only true variable and so, for me anyway, should be the first parameter. I suspect this function only had two parameters initially, and the third was added without a great deal of thought to the aesthetics of the change.
That’s a lot to not like in not a lot of code.
But at least there are tests
And hurrah for that! There are tests for this function, tangled up in a source file with some unrelated tests that pull in a mock database connection and some other business, but they do exist.
There are two test functions, one checking well-formed references, the other malformed references, but, in fact, each function checks multiple cases.
It’s a start, but the test code is much the same as the code it exercises – long and repetitious – which isn’t, perhaps, that surprising. A quick visual check shows they’re deficient in other, more serious ways. There are ten reference types named in canonicalise_reference
. The tests check seven of them and, in fact, there is a whole branch of the if
/else
ladder that isn’t exercised. That’s the branch I already suspect of being a later addition.
Curiously too, while canonicalise_reference
returns a 4-tuple, the tests only check the corrected reference and the year, ignoring the other two values. That sent me off looking for the canonicalise_reference
call sites, where all four elements of the tuple are used. Again, I’d suggest the 4-tuple came in after the tests were first written and were not updated to match. After all, they still passed.
I am sure these tests were written post-hoc. They did not inform the design and development of the code they support.
Miasma
If the phrase coming to mind is code smells, then I guess you’re right. This code is a stinky bouquet of bad odours, except they aren’t clues to some deeper problem with the code. We don’t need clues – it’s right out there front and centre. No, these smells emanate from with the organisation, from a failure to develop the programmers whose hands this code has passed through. The code works, let’s be clear, but there’s a clumsiness to it and a lack of care in its evolution. That’s a cultural and organisational failing.
I keep saying this is about organisations. I’m not saying these are bad places to work, where maladjusted managers delight in making their underlings squirm. Quite the contrary, I’ve worked at more than one of the organisations responsible for the code above and had a great time. But there is something wrong – an unacknowledged failure. An unknown failure even. There so much potential, and it’s just not being taken up
I came across this code because I was talking about potential work on it, going back into one of those organisations. That didn’t pan out, but had I been able I would absolutely have signed up for it. It’s fascinating stuff and right up a multiplicity of my alleys.
Let’s imagine for a moment that I was sitting down for my first day on this job, what would I do with this code? Well, at a guess, nothing. Well, nothing until I needed to, and then I’d spend a bit of time on it. But I’d absolutely be talking to my new colleagues about, well, everything.
One step at a time
The code in Listing 1 is just not great. It’s long, for a start, and it’s long because it’s repetitious. The line
components = re.findall(r"\d+", reference_match)
appears in every branch of the if
/else
. Let’s start by hoisting that up.
Clearing visual noise
The unnecessary brackets in the first elif
body just jar. They catch the eye and makes it appear that something different is happening in the middle there, when in fact it adds nothing and is just visual noise.
(The result of this change and the previous one are shown in Listing 2).
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if ( (reference_type == "RefYearAbbrNum") | (reference_type == "RefYearAbbrNumTeam") | (reference_type == "YearAbbrNum") ): year = components[0] d1 = components[1] d2 = "" corrected_reference = canonical_form.replace("dddd", year) .replace("d+", d1) elif ( (reference_type == "RefYearAbbrNumNumTeam") | (reference_type == "RefYearAbrrNumStrokeNum") | (reference_type == "RefYearNumAbbrNum") ): year = components[0] d1 = components[1] d2 = components[2] corrected_reference = canonical_form.replace("dddd", year) .replace("d1", d1).replace("d2", d2) elif ( (reference_type == "AbbrNumAbbrNum") | (reference_type == "NumAbbrNum") | (reference_type == "EuroRefC") | (reference_type == "EuroRefT") ): year = "" d1 = components[0] d2 = components[1] corrected_reference = canonical_form.replace("d1", d1) .replace("d2", d2) return corrected_reference, year, d1, d2 |
Listing 2 |
Move the action down
The if
/else
ladder sets up a load of variables, which are then used to build corrected_reference
.
The lines building corrected_reference
aren’t the same, but they are pretty similar. We can move them out of the if
/else
ladder and combine them together.
Looking up and out
This is a bit of a meta-change, because you can’t infer it from the code here, but canonical_form
is drawn from a data file elsewhere in the source tree. We control that data file.
Examining it, we can see it’s safe to replace d+
with d1
in the canonical forms. As a result, we can eliminate one of the replace
calls when constructing corrected_reference
.
This change and the previous one are shown in Listing 3. The shape of the code hasn’t wildly changed, but feels like we’re moving in a good direction.
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if ( (reference_type == "RefYearAbbrNum") | (reference_type == "RefYearAbbrNumTeam") | (reference_type == "YearAbbrNum") ): year = components[0] d1 = components[1] d2 = "" elif ( (reference_type == "RefYearAbbrNumNumTeam") | (reference_type == "RefYearAbrrNumStrokeNum") | (reference_type == "RefYearNumAbbrNum") ): year = components[0] d1 = components[1] d2 = components[2] elif ( (reference_type == "AbbrNumAbbrNum") | (reference_type == "NumAbbrNum") | (reference_type == "EuroRefC") | (reference_type == "EuroRefT") ): year = "" d1 = components[0] d2 = components[1] corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 3 |
Typos must die
The ‘typo’ in "RefYearAbrrNumStrokeNum"
is corrected – another meta-fix. That string comes from the same data file as the canonical forms. Obviously "RefYearAbrrEtcEtc"
looks like a loads of nonsense, but Abrr is so clearly a typo. It’s an abbreviation for abbreviation! It should be Abbr! Like the brackets I mentioned above, this is a piece of visual noise that needs to go.
Ok, the corrected version now says "RefYearAbbrNumStrokeNum"
, which isn’t a world changing difference, but to me it looks better and IDE agrees because there isn’t a squiggle underneath.
Constants
Those string literals give me the heebie-jeebies. I’ve replaced them with constants. (This change and the previous one are shown in Listing 4.)
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if ( (reference_type == RefYearAbbrNum) | (reference_type == RefYearAbbrNumTeam) | (reference_type == YearAbbrNum) ): year = components[0] d1 = components[1] d2 = "" elif ( (reference_type == RefYearAbbrNumNumTeam) | (reference_type == RefYearAbbrNumStrokeNum) | (reference_type == RefYearNumAbbrNum) ): year = components[0] d1 = components[1] d2 = components[2] elif ( (reference_type == AbbrNumAbbrNum) | (reference_type == NumAbbrNum) | (reference_type == EuroRefC) | (reference_type == EuroRefT) ): year = "" d1 = components[0] d2 = components[1] corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 4 |
Birds of a feather
By grouping like reference types together, we can slim down each if
condition.
YearAbbrNum_Group = [ RefYearAbbrNum, RefYearAbbrNumTeam, YearAbbrNum ]
Having tried it, I like that. Let’s roll it out to the rest of the types (see Listing 5.)
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if reference_type in YearNum_Group: year = components[0] d1 = components[1] d2 = "" elif reference_type in YearNumNum_Group: year = components[0] d1 = components[1] d2 = components[2] elif reference_type in NumNum_Group: year = "" d1 = components[0] d2 = components[1] corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 5 |
Love it.
Remember Python calls arrays lists, but also that it has tuples too. Tuples are immutable, so they’re a better choice for our groups. The result of swapping tuples for lists by switching []
to ()
is:
YearAbbrNum_Group = ( RefYearAbbrNum, RefYearAbbrNumTeam, YearAbbrNum )
Destructure FTW!
We can collapse the
year = ... d1 = ... d2 = ...
lines together into a single statement, going from three lines into a single line (see Listing 6).
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if reference_type in YearNum_Group: year, d1, d2 = components[0], components[1], "" elif reference_type in YearNumNum_Group: year, d1, d2 = components[0], components[1], components[2] elif reference_type in NumNum_Group: year, d1, d2 = "", components[0], components[1] corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 6 |
Much easier on the eye.
An extra level of indirection
Bringing the year
, d1
, d2
assignments together particular highlights the similarity across each branch of the if
ladder.
Let’s pair up a type group with a little function that pulls out the components. (See Listing 7.) Probably did a bit too much in one go here, and it’s ugly as hell. But it works, and it captures something useful.
YearNum_Group = { "Types": [ RefYearAbbrNum, RefYearAbbrNumTeam, YearAbbrNum ], "Parts": lambda cmpts: (cmpts[0], cmpts[1], "") } def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if reference_type in YearNum_Group.Types: year, d1, d2 = YearNum_Group.Parts(components) elif reference_type in YearNumNum_Group.Types: year, d1, d2 = YearNumNum_Group.Parts(components) elif reference_type in NumNum_Group.Types: year, d1, d2 = NumNum_Group.Parts(components) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 7 |
If we now introduce a little class to pair up the types and components lambda function, it’s more setup at the top, but it’s neater in the function body:
class TypeComponents: def __init__(self, types, parts): self.Types = types self.Parts = parts YearNum_Group = TypeComponents( ( RefYearAbbrNum, RefYearAbbrNumTeam, YearAbbrNum ), lambda cmpts: (cmpts[0], cmpts[1], "") )
That worked, and Listing 8 shows it extended across the two elif
branches.
def canonicalise_reference(reference_type, reference_match, canonical_form): components = re.findall(r"\d+", reference_match) if reference_type in YearNum_Group.Types: year, d1, d2 = YearNum_Group.Parts(components) elif reference_type in YearNumNum_Group.Types: year, d1, d2 = YearNumNum_Group.Parts(components) elif reference_type in NumNum_Group.Types: year, d1, d2 = NumNum_Group.Parts(components) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 8 |
The if
conditions and the bodies now all have the same shape. That’s pretty cool. They were similar before, but now they’re the same.
Yoink out the decision making
It’s not really clear in the code, but there are only two things really going on in this function. The first is pulling chunks out of reference_match
, and the second is putting those parts back together into canonical_reference
. Let’s make that clearer (see Listing 9).
def reference_components(reference_type, reference_match): components = re.findall(r"\d+", reference_match) if reference_type in YearNum_Group.Types: year, d1, d2 = YearNum_Group.Parts(components) elif reference_type in YearNumNum_Group.Types: year, d1, d2 = YearNumNum_Group.Parts(components) elif reference_type in NumNum_Group.Types: year, d1, d2 = NumNum_Group.Parts(components) return year, d1, d2 def canonicalise_reference(reference_type, reference_match, canonical_form): year, d1, d2 = reference_components( reference_type, reference_match) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 9 |
Say what you mean
There’s no need to assign year
, d1
, d2
in that new function. We can just return the values directly (see Listing 10).
def reference_components(reference_type, reference_match): components = re.findall(r"\d+", reference_match) if (reference_type in YearNum_Group.Types): return YearNum_Group.Parts(components) elif (reference_type in YearNumNum_Group.Types): return YearNumNum_Group.Parts(components) elif (reference_type in NumNum_Group.Types): return NumNum_Group.Parts(components) def canonicalise_reference(reference_type, reference_match, canonical_form): year, d1, d2 = reference_components(reference_type, reference_match) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 10 |
Search
I mentioned the if
conditions and the bodies now all have the same shape. We can exploit that now to eliminate the if
/else
ladder by checking each group in turn (see Listing 11).
TypeGroups = ( YearNum_Group, YearNumNum_Group, NumNum_Group ) def reference_components(reference_type, reference_match): components = re.findall(r"\d+", reference_match) for group in TypeGroups: if reference_type in group.Types: return group.Parts(components) def canonicalise_reference(reference_type, reference_match, canonical_form): year, d1, d2 = reference_components(reference_type, reference_match) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 11 |
And rest
I first wrote this on Mastodon [Higgins24] because I’m that kind of bear, and this is where I stopped. I felt the code was in a much better place – not perfect by any means, but better.
But then I thought of something else.
You wouldn’t let it lie
Now the types are grouped together, I was inclined to put the string literals back in.
We only use "RefYearAbbrNum"
, for example, as part of a TypeComponents
object. It’s not needed anywhere else, but having it as a constants in its own right floating around implies that you might and suggests that you can. In fact, it’s YearNum_Group
that is the constant, so let’s tie things down to that.
YearNum_Group = TypeComponents( ( "RefYearAbbrNum", "RefYearAbbrNumTeam", "YearAbbrNum" ), lambda cmpts: (cmpts[0], cmpts[1], ""), )
I also felt the parameters to
canonicalise_reference(reference_type, reference_match, canonical_form):
are in the wrong order.
reference_type
and canonical_form
go together. They originate in the same place in the code, from the data file I mentioned earlier, and if they were in a tuple or wrapped in a little object I certainly wouldn’t argue.
The thing we’re working on, that we take apart and reassemble is reference_match
. To me, that means it should be the first parameter we pass (see Listing 12).
def reference_components(reference_match, reference_type): components = re.findall(r"\d+", reference_match) for group in TypeGroups: if reference_type in group.Types: return group.Parts(components) def canonicalise_reference(reference_match, reference_type, canonical_form): year, d1, d2 = reference_components(reference_match, reference_type) corrected_reference = (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) return corrected_reference, year, d1, d2 |
Listing 12 |
And that I thought was that. And I went to bed.
It’s a new day
The following morning, I got a nudge from my internet fellow-traveller Barney Dellar, who said
I tend to think of for
-loops as Primitive Obsession. You aren’t looping to do something n times. You’re actually looking for the correct entry in the array to use. I would make that explicit. I’m not good at Python, but some kind of find or filter. Then invoke your method on the result of that filtering.
He was right and I knew it. Had this code been in C#, for instance, I’d probably have gone straight from the if
ladder to a LINQ expression.
He set me off. I knew Python’s list comprehensions were its LINQ-a-like, and I had half an idea I could use one here.
However, I thought list comprehensions only created new lists. If I’d done that here, it would mean I’d still have to extract the first element. That felt at least as clumsy as the for
loop.
Turns out I’d only ever half used them, though. A list comprehension actually returns an iterable. Combined with next()
, which pulls the next element off the iterable, and well, it’s more pythonic.
def reference_components(reference_type, reference_match): components = re.findall(r"\d+", reference_match) return next(group.Parts(components) for group in TypeGroups if reference_type in group.Types)
What’s kind of fascinating about this change is that the list comprehension has the exact same elements as the for version, but the intent, as Barney suggested, is very different.
At the same time, Barney came up with almost exactly the same thing, too [Dellar24]. We’d done a weird long-distance almost-synchronous little pairing session. Magic.
Reflecting
This is contrived, obviously, because it’s a single function I’ve pulled out of larger code base.
But, but, but, I do believe that now I’ve shoved it about that it’s better code.
If I was able to work to my way out from here, I’m confident I could make the whole thing better. It’d be smaller, it would be easier to read, easier to change.
The big finish
I’m sure I have made the code better, and I’m just as sure that I’d make the people I was working with better programmers too. I’d be better from working with them - I’ve learned from everyone I’ve ever worked with - but I’m old. I’ve been a lot of places, done a lot of stuff, on a lot of different code bases, with busloads of people. I know what I’m doing, and I know I could have helped.
I’m sorry I couldn’t take the job, but it needed more time than I could give. In the future, well, who knows?
PS
I think it’s important to note I didn’t know where I was heading when I started. I just knew that if I nudged things around then a right shape would emerge. When I had that shape, I could be more directed.
Barney’s little nudge was important too. He knew there was an improvement in there, even if neither of us was quite sure what it was (until we were!). That was great. A lovely cherry on the top.
PPS
I tried to do the least I could at each stage. In one place I took out two characters, in another I changed a single letter. Didn’t always succeed - some of what I did could have been split - but small is beautiful, and we should all aim for beauty.
This comes, in large part, from my man GeePaw Hill [Hill21] and his ‘Many More Much Smaller Steps’. He’s been a big influence on me over the past few years, and I’ve benefited greatly as a result.
PPPS (really, the last one, I promise)
I was proofing this article before pressing publish (which probably means there are only seven spelling and grammatical errors left), when I saw another change I’d make. (See Listing 13.)
def reference_components(reference_match, reference_type): components = re.findall(r"\d+", reference_match) for group in TypeGroups: if reference_type in group.Types: return group.Parts(components) def build_canonical_form(canonical_form, year, d1, d2): return (canonical_form.replace("dddd", year) .replace("d1", d1) .replace("d2", d2)) def canonicalise_reference(reference_match, reference_type, canonical_form): year, d1, d2 = reference_components(reference_match, reference_type) corrected_reference = build_canonical_form(canonical_form, year, d1, d2) return corrected_reference, year, d1, d2 |
Listing 13 |
Again, nothing huge but just another little clarification.
That really is it. For now!
References
[Dellar24] Barney Dellar on Mastodon: https://mastodon.scot/@BarneyDellar/112042140234945492
[Higgins24] The changes on Mastodon: https://mastodon.me.uk/@jezhiggins/112039275413895974
[Hill21] GeePaw (Michael) Hill: ‘Many More Much Smaller Steps’ (MMMSS): a series of five blog posts published from 29 September 2021 to 30 December 2021, available at: https://www.geepawhill.org/series/many-more-much-smaller-steps/
Footnote
- As this is a printed publication, in most listings the very wide lines are wrapped. Listing 1 is presented full-width, as is Listing 6. [Web editor's note: In the web version these listings will have a horizontal scroll bar if the window is too narrow to display the long lines.]
This article was published as two posts on Jez’s blog:
‘To See a World in a Grain of Sand’ (posted 24 February 2024) available from: https://www.jezuk.co.uk/blog/2024/02/to-see-a-world-in-a-grain-of-sand.html
‘If You’re So Smart’ (posted 7 March 2024) available from: https://www.jezuk.co.uk/blog/2024/03/if-youre-so-smart.html
Go to the second post to see all of the listings full-width (and some intermediate steps).
lives on the Pembrokeshire coast, largely to make return-to-office mandates impractical. Truth is, he hasn’t worked in an office for nearly 25 years, and has no intention of starting now. He’s been programming for a living that whole time and thinks he might be starting getting to get the hang of it.