ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinHapaxes, Singletons and Anomalies

Overload Journal #143 - February 2018 + Journal Editorial   Author: Frances Buontempo
Programmers can be odd. Frances Buontempo celebrates many manifold peculiarities.

Having failed to write an editorial to date, I wondered if I could manage to revisit my attempts at natural language processing. As ever, this distraction has led me astray and I have not written an editorial. As yet, I still cannot manage to automatically generate anything that is worth reading. Previously I searched for common words [Buontempo12], and discovered we have often had ‘test’ referred to in articles. It would be interesting to rerun this, and see if we have a new theme emerging. Write in if you get round to this.

An interesting alternative to discovering common words or phrases is hunting unique words, or hapaxes. Wikipedia [Wikipedia-a] describes a Hapax legomenon as a word which only appears once in a piece of text, or body of author’s works. Sometimes this is mis-ascribed as a word only used by a particular author. I’m not sure if there a term for that. Made-up? A hapax is a word the author only uses once. By extension, you may have dis legomena, which occur only twice. The designation doesn’t appear to scale beyond two.

Stepping beyond word analysis, programming has many examples of things that are singular. The first idea that springs to mind is the singleton pattern. Many attempts to enforce only one instantiation of an object using variants on this pattern lead to convoluted and hard to follow code. Rants happen in online forums or round the water cooler. Many readers will have their own war stories to tell, I am sure. A hapax in literature will tend to come about naturally, perhaps indicating a writing style or the subject matter. I would hazard a guess that a specific unique word is infrequently singled out and used deliberately just once in a written work. In contrast, the singleton pattern is used frequently. Indeed, the idea of a programming pattern implies frequent use. If a problem workaround or neat trick only comes into play in one set of circumstances, it is not a pattern. Patterns, by their very nature, repeat. The internet [etymonline] suggests an etymology from the word patron ‘something serving as a model’, tracing use of a pattern as a model for dressing making by Jane Austin in the eighteenth century. Software patterns don’t help us make dresses, but do help us talk about contexts and situations more abstractly. Names give us power. Isis finds out the Egyptian sun god’s true name [Wikipedia-b], Ra, by a trick and thereby gains power to put her son on the throne. In the fairy tale Rumpelstiltskin, the girl is freed from a promise when she discovers the imp’s true name. Why does Viktor Frankenstein never name his creature? Perhaps he abandons his creation, running away rather than naming it. Perhaps, naming it would be some claim of power over it, which he knew he did not have. A recent book [Shelley17] explores this, and many other issues raised by the story. Names matter, however, enough of patterns, patrons and names.

The patronus, of course, is an entirely different matter. This advanced magic spell from Harry Potter [Pottermore] is the only known defence against Dementors. It appears to act as a form of guardian angel. In the Harry Potter mythos, these are unique to people: each has their very own, often taking on an animal form. Straying into mythology of guardian spirits and animal totems will take us far, far away from programming! Do you have a single thing you turn to for help? Do you have something to ward off Dementors? Hold that thought for now. We’ve considered single words, names and lack of names. Let’s think about counting.

Many algorithms have more than one implementation. How many ways are there to sort a list? Can you order this list of ways to sort a list, in terms of time or memory efficiency? How many ways can you invert a matrix? Perhaps you would use a numerical library instead. Which one though? In order to count or enumerate anything you need an idea of equality or at least comparison. To claim something is singular, you need to show it is unlike anything else. For a hapax, you need to stem words. Would matriarch and matriarchy count as one word? Though they are not identically equal, natural language processing tends to trim or stem words, lopping off endings a plurals to get to the essence of a word. Your context will drive your choice or implementation of equality or comparison. As for enumeration or counting, you should see how difficult it is to define numbers. My final year dissertation for my undergraduate degree was entitled, ‘What’s a number?’ I am still not sure. You can attempt to answer this question by building on set theory, adding an element to a set to make it one bigger, thereby neatly attempting to avoid the need to define one. Or even add. Two sets are identical if you can put them in one-to-one correspondence, which put simply means constructing a function which maps from one to the other. Of course, that very sentence uses the words one and two, when we haven’t even managed to define one yet, let alone two. Words are difficult. Numbers are impossible? Magic? Useful? You decide.

A hapax or a singleton might strive to be, or accidentally be, unique. An observation of such an item can be informative. An observation of several somewhat anomalous items can also be informative. Many data mining or machine learning toolkits have outlier detection capabilities. Any observation of something that appears slightly odd, unusual, or out of place can be intriguing. Understanding what is usual or normal is one thing, but finding a curiosity is quite another. If something out of the ordinary happens, can you calculate the odds of this occurring? With an a priori distribution, for example half heads and half tails for a fair coin toss, you can mathematically calculate just how surprised you should be by a run of ten heads in a row. For other situations, you have no expected distribution up front. You can use evidence to attempt to work out an underlying statistical distribution, but past performance is no guarantee of future performance. You need to try and see if experiments match your predictions.

You can drop the attempt to fit data statistically, and try some machine learning techniques. Many of these don’t require rigorous mathematics to draw conclusions. This leads to derision from tormentors, disparaging machine learning as curve fitting. Quora [Quora] has a short discussion on this subject. Fitting a curve to data can be interesting. In fact, if you chuck out outliers you can get a better curve. Or at least a smoother one, with fewer kinks and a far simpler equation. How do you decide which points to ignore? Sometimes a decision is made up front. LIBOR, the London intrabank offered rate, averages the rate banks are offered when they want to borrow money. The data submitted covers a few currencies over several time periods. This is used to calculate a benchmark of the interest rate on loans. The top and bottom quartiles are discarded, so the figures are based on the middle numbers. Why? Possibly an assumption that the extremes were anomalies or even lies. It is very hard to find out now, because most online resources concerning LIBOR point to the various fixing scandals [for example, Wikipedia-c]. The London Review of Books wrote an interesting overview, in the midst of the financial crash in 2008 [LRB]. The process might involve an email of the numbers, mid-morning, but could be followed up with a phone call if the email didn’t arrive or the numbers looked wrong. Very hi-tech! There are two significant points here. First, the data was based on what bankers claimed they could borrow money at. Since the various scandals, there have been moves towards using actual transactions to find a benchmark. Using actual data rather than hearsay can give you different outcomes. Second, the extremes: the top and bottom quarter of the numbers are ditched, on the grounds they are or could be outliers.

If you have data with a Normal distribution, you can work out how likely a value is. If your data has a mean of zero and variance of one, a reading of 0.1 is reasonable. A reading of 100.1 is less likely. Much less likely. That, of course, doesn’t mean it’s wrong. You can’t reject something on the grounds it’s a bit of a surprise. Sometimes you can establish a reason for the unexpected observations. Perhaps you have a measuring instrument that needs recalibrating. Sometimes you might find your assumption of a normal distribution was incorrect. Sometimes you might need to introduce a hierarchical model, essentially introducing if-then-else to deal with some specific circumstances. Nassim Taleb has written about surprising events that have major impacts. He calls these Black Swan events. He states three criteria for such events:

  1. They are a surprise
  2. They have a major impact
  3. Once it’s happened, various theories are concocted to show it was bound to happen and you can predict when it will happen again

Taleb described these as “stemming from the use of degenerate metaprobability” [Wikipedia-d]. We are driven to explain something, assuming the world follows predictable patterns that make sense. Taleb emphasises that some events are unprecedented, and therefore cannot be reasoned about.

Some one-off events have catastrophic effects. In contrast, some outcomes may be surprising, but good. If you just happen to notice a config change which explains why your production system is broken, you might be able to fix things quickly. How likely were you to spot the problem? It depends. It might be unlikely, especially if the config was for a different system entirely, but for some spaghetti entanglement reasons it had a knock-on effect. I’m sure you have your own examples. A miracle? The philosopher, David Hume attempted to define a miracle in An Enquiry Concerning Human Understanding (originally published 1748, and republished many times since). He defined a miracle as a violation of a law of nature, by the intervention of a Deity. He regarded this definition as a way to stop “all kinds of superstitious delusion” [Wikipedia-e]. The thinking goes that a miracle must be a singular event, and so weight of evidence shows any witness of such an event is unreliable. Critics suggest he has got his maths wrong. I’ll leave you to decide whether miracles are possible. Whatever conclusions you draw, some events are ordinary while some are extraordinary.

As a programmer, you are extraordinary. I struggled to find reliable data to back this up. Computerworld [Computerworld13] claimed in 2013 there were about 18.2 million developers worldwide. Of seven billion or so people, that’s not many. It’s not clear how you count programmers. By profession? Incidentally making spreadsheets from time to time? CAD designers? Teenagers teaching themselves to code, alone at home? Stackoverflow runs a survey annually to investigate developer demographics [Stackoverflow17]. However many there are, it seems most have at least 20 years’ experience, are white and are male. If you are black, female or new to programming you are extraordinary. Nonbinary, genderfluid, genderqueer, trans, agender, etc. programmers are extraordinary too. If you are a white male programming with twenty or more years’ experience, you are also extraordinary. Being described as an anomaly or outlier sounds negative. Who wants to be normal though? Everyone is unique. That’s a good thing. If you are derided for being a geek, fear not. If you are frequently told, “Normal people don’t do that”, remember you are extraordinary. Find some like-minded people to chat to or work with, or join an organisation of like-minded programmers such as the ACCU, and enjoy yourself. If you then end up feeling like a fraud because everyone around you has more experience than you, or seems to know more details than you, don’t let the imposter syndrome take over. Even if you don’t believe in magic, or miracles or guardian angels, find a way to say “Expecto Pratronum!” at the Dementors trying to undermine you.

References

[Buontempo12] Overload 110, Aug 2012, ‘Allow me to introduce myself’, https://accu.org/index.php/journals/1904

[Computerworld13] https://www.computerworld.com/article/2483690/it-careers/india-to-overtake-u-s--on-number-of-developers-by-2017.html

[etymonline] https://www.etymonline.com/word/pattern

[LRB] https://www.lrb.co.uk/v30/n18/donald-mackenzie/whats-in-a-number

[Pottermore] https://www.pottermore.com/features/what-is-a-patronus

[Quora] https://www.quora.com/Is-Machine-Learning-just-glorified-curve-fitting

[Shelley17] Frankenstein: Annotated for Scientists, Engineers, and Creators of All Kinds, Shelley, Guston, Finn, Robert, Robinson. June 2017, MIT Press.

[Stackoverflow17] https://insights.stackoverflow.com/survey/2017

[Wikipedia-a] https://en.wikipedia.org/wiki/Hapax_legomenon

[Wikipedia-b] https://en.wikipedia.org/wiki/True_name]

[Wikipedia-c] https://en.wikipedia.org/wiki/Libor

[Wikipedia-d] https://en.wikipedia.org/wiki/Black_swan_theory

[Wikipedia-e] https://en.wikipedia.org/wiki/Of_Miracles

Overload Journal #143 - February 2018 + Journal Editorial