Names are important – improving use of terms in software engineering

November 14, 2016

In our field, few things are more important than reading code, which — except for one-man army companies — involves numerous developers reading and trying to understand the same code. One code base is read way more often than written or refactored (if not, you’re doing it wrong), hence the importance of a common understanding of the terminology. Herein I want to present challenges with examples, the different types of scope applying to a term and tips to improve on your use of terminology to foster better communication within companies and elsewhere.

Examples

Terms can occur at various semantic locations in code. Here are some examples (some real, some contrived) from the financial sector in which I’m working. They language assumption is C++ in this article, but the recommendations can be applied equally to other languages.

  • Type names (class/struct/enum/interface depending on programming language): Account

  • Function names: validateIban

  • Method names: getBankList

  • Module and namespace names: Billing::Aggregation

Terminology scopes

Even if you’re working in the same area, you might not immediately understand each of the example terms above in the way they are used in code and conversations in my company. The reason is that different levels and types of scope apply.

Global scope

Globally familiar words are often clear by themselves, and typically can be looked up in an English dictionary without having more information on the context. The dictionary should usually give no ambiguities. One example is Billing, which expresses that the topic is centered around bills in some sense. If all code had globally intuitive words only, our understanding would be perfect! Reality is that there’s almost always a context which describes (required) details.

Exceptions are truly global names. Think of public trademarks and product names known worldwide. These don’t need explanations anymore, but still the companies behind them have to ensure that the understanding does not get altered through the years. Some people wish for the term "Java" to only relate to an island… A good example is "Microsoft Office". It is familiar all around the globe even for non-tech people (keep this hint in mind).

Local scope (field of expertise, company, team, project, module, etc.)

Even if Billing is easily comprehensible, the example namespace Billing::Aggregation is probably gibberish to someone who is not from the financial sector, or even new hires who don’t know yet how the company handles bills — namely, in this example, by aggregating some key figures. The word may therefore as well be very specific to the company, with a different understanding in other businesses.

A local context also applies to the method name getBankList. Without extra information like a documenting comment, the signature is not enough to find out 1) what kind of list this is, 2) if "get" means downloading, retrieving via remote call or parsing from file, and so on. At best, the surrounding class/module/project is clear enough to the reader of the code to understand the concept later, or provides a clarifying unit test or sample input.

Terms may even be team-specific (or evolve so over time) by various reasons: access to sensitive code may be restricted, the software company grows and splits up teams, topics of teams are not interrelated and there are no intersection points for shared code/guidelines/rules, …

The worst case is when the scope gets so narrow that, earnestly, "it’s all in the code" (only). Below, I will list some recommendations for choosing names, and related tips, to not get to this point (I call it "people lock-in", as in "vendor lock-in").

Ambiguous contexts

The aforementioned "local scope" is one context for a term. If one word can be understood in different ways, it is contextually ambiguous. You don’t want to collect many such terms in your code base, or else switching to another project will leave you confused why accountName suddenly means "human description of bank account" for at another point it meant "username of login credentials".

Likewise, even if terms have a unique meaning across the code base, they may be represented as different types and therefore again cause an inconsistent understanding. For example, let’s assume that "account" is always a bank account, but once there’s a variable AccountInfo account and in another spot int64_t account. "The latter is quite obviously the account ID in the database!" — oh no sorry, Mr. Original Code Author, it’s not obvious!

Another popular piece of information stored in data structures is string address (compatible with all international addresses! 🌍) being represented differently elsewhere: struct Address { string address1; string houseNo; […​] }. These are even incompatible in conversion.

The key to overcome ambiguities is to stay consistent in naming. In the majority of cases, the main term (here: "account"), which by itself is not explanatory, can be suffixed to give a meaningful variable name: int64_t accountId and AccountInfo accountInfo (still nicely readable if type is omitted: const auto accountInfo = […​];).

Implementation detail scope

While validateIban very obviously validates IBANs, knowing only the function name doesn’t say anything about how the function works. It requires at least the function signature and possibly a documentation comment to grasp the semantics. A company or development team may have their concept of how all validateXYZ functions should work, e.g. throw specific exception or return false on invalid value, and even if that concept is "well-known", it’s a notion that must be transferred to new hires. Such an induction to the company’s development practices is of course necessary for new developers, but too much will overload those people, resulting in small details being forgotten. Let’s say you forgot what the function validateIban returns for an empty input string? It’s a very important detail, and the most sane way would be to consider an empty value as invalid, because then the caller can decide whether an empty/optional value is allowed depending on use case. Yet this detail is not found in the name (granted, it’s hard in this case without getting wildly over-verbose names).

Here are a few alternative function signatures (C++):

  • auto validateIban(const std::string& s) -> bool; — this suggests to the reader that the function does not throw and returns whether the input is valid or not. It does not say what happens in the case of empty string, but as stated above, this could just be left off because there’s a sane default behavior. Nevertheless, following the "verbSubject" naming principle, a better signature would be [[nodiscard]] auto isValidIban(const std::string& s) -> bool which makes it even clearer that the function doesn’t throw but returns a boolean. Developers don’t even have to read the full signature to use it correctly, and are warned (starting with C++17) if the return value is unused by mistake.

  • auto validateIbanOrThrow(const std::string& s) -> void; — the void result type and "OrThrow" suffix in the name makes it totally clear that the function will throw on invalid input. Whether you include the type of exception in the signature or name is a question for your style guide (e.g. template<typename TExc> …​ to make it explicit). Personally, I’d just throw a standard exception here (std::invalid_argument), and stay consistent in similar functions.

  • No function at all. Use strong typing to ensure that at the relevant spots, only valid IBAN arguments can be passed in (i.e. auto extractAccountNumberFromIban(const ValidIban& iban) -> std::string;). Along the same line, introduce the practice to validate inputs at the input boundary (e.g. remote call), not just later where you could forget calling validateIban by accident. This will also improve your error handling because you will fail earlier, and can write functions that make assumptions about their inputs and thus may even become exception-free. As mentioned in the linked post, using strong types is probably overkill if done throughout your code, so this is a also something for the style guide, or to decide per case.

Factors other than context

Surely the context defines to which domain a term belongs. Nevertheless other influences can help determine whether a name or term makes sense to use.

Complexity and language

The first influences I want to summarize here are complexity and (spoken) language. International mixes of development team members can be found in almost all companies. The language and culture barrier and gap are the most influential topics to be aware of when it comes to creating a common and mutual understanding of technical and personal themes. Hence it’s no wonder that the English language reigns software development both in coded and spoken words. To account for culture differences, complex English vocabulary should be banned where reading or listening comprehension is important.

To give an example, I want to name the Unicode standard. After reading 20+ articles (incl. the famous Joel Spolsky) about Unicode in the last 15 years, and continuously learning about its updates, its terminology still is only partially burned into my brain. The sheer count of terms is high, but in my opinion not the issue, since the memory of a software developer is quite durable once a term is clear. Can you distinguish UTF-8, UCS-2, UCS-4, UTF-16{BE,LE}, UTF-32, (UTF-7), character, glyph, character set, code point, surrogate pair, BMP, BOM, U+1F4A9? I have no problem recalling their meaning when I see them, but what really makes my brain smoke are the non-technical things mentioned in that list: is a glyph a fully rendered code point, or a partial symbol? How did they define character again — was it the same as a code point? Just look for a minute at their glossary and you’re going to be overwhelmed as well. In summary, the standard, the related myriad of blog posts, true/half-assed/false answers on StackOverflow and other resources are simply an overload for the software industry and simplifying now takes a huge amount of effort. If we were to use UTF-8 everywhere (great simplified glossary there!) already 20 years ago, there probably wouldn’t be crazy inventions like MySQL’s UTF-8 variants (yes, your UTF-8 enabled database probably cannot store all of Unicode!):

For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it.

See, complexity and amount of terminology is like a growing company — the smart ones can handle growth easily by keeping things simple and stupid, while the typical response to growth is levels of management, performance reviews, more business, less "family" feeling, or in other words: complexity.

Ambiguous wording

Imagine you’re in one well-defined context, have chosen simple English words that need no explanation in your opinion, developers you ask tell you they understand the meaning immediately — what could possible go wrong? You’ve landed a set of terms to be carved in stone. They will call a dictionary after you! Well, probably not… In reality this is long before the finish line.

One area for which I really have a strong opinion are filesystem terms. Those are around since ages but still confused and forcefully en-ambiguated (my opposite of disambiguated) or highly confused in code all the time, to the point where it’s not funny anymore. The problem is that even if the words are clear, and you were given Tanenbaum’s book on operating systems in studies class, the terms are still way too interchangeable. Find below some examples of ambiguous wording, including my proposals and what people also use as alternatives. I’m using lowerCamelCase examples here to also nitpick about spelling differences. This was the motivation to start writing this blog post, so sorry about the lengthy commentary! I’d like to hear comments on this admittedly very opinionated section:

  • file, f, path, p, filepath, filePath, filename: In operating system terms, a "file" can be a regular file, symlink, hard link, socket, FIFO, other special types or a directory. Often it is perfectly fine to use the terms "file" and "directory" to tell (regular) files apart from directories. Just think of the famous error message "No such file or directory". Usage depends a little on the use case, but mostly readers of code will simply understand because it is clear that a file with content is being read, or a directory is listed, for instance.
    But: file != path != filePath != filename ☝️. First of all, "filepath" is a spelling that nobody uses, so you also shouldn’t, while "filename" is funnily the typical spelling (not "fileName"), just like "filesystem" exists in some dictionaries alongside "file system" (I don’t have a preference there). A "file path" is a path that points to a (optionally existing) file, and is mostly used in code to mean a regular file (or transparently a regular file behind a symlink). The difference to a "path" is that the latter means it can point to any file type on the system, including a directory. Using the variable name path therefore is probably underspecified and not a good idea if the intention is specific. Using p alone as variable name is much worse than the familiar abbreviations f (to represent a file handle) or i (for loop indices).
    Moreover, people don’t seem to get the difference between filename and file path. A "file name" is the name of a file entry (mostly within a directory, but without exposing that context), e.g. "Hello.cpp", while its path may be any path pointing to that file, e.g. "/tmp/Hello.cpp" or "C:\SuperSource\Hello.cpp" (absolute paths), or "../../private/tmp/Hello.cpp" or — equaling the filename — "Hello.cpp" (relative paths).
    Last, if I were to see a variable called file, in C++ I’m most likely to guess that it’s a file input stream, while many people use that name in place of a file name or path, which is greatly misleading and semantically wrong. This is a case for a naming guideline, since different opinions exist, and it’s also slightly dependent on the programming language — in Python, I would use f for an input file stream and out or out_file for a writing stream, while in other languages such short variable names are unusual.

  • directory, dir, dirPath, folder: In my memory, it was mostly Microsoft coining the term "folder". Wikipedia explains that a "folder" is just the graphical metaphor that represents a directory on the filesystem, and that e.g. Windows has special folders (like "Photo library") that don’t map directly to a directory on disk. Therefore in code, the correct term is almost always "directory" or an abbreviation (dir). Unlike file, the variable name dir by itself says even less about its meaning: unless you’re working with directory handles, you couldn’t infer what dir should stand for, and if it might represent an absolute directory path, or something else. So often times, this had better be dirPath, or if the variable name includes the meaning (it should!), I’m tempted to omit the *Path suffix: bankStatementsDownloadDir.

Recommendations

In no particular order:

  • Simple English: Use vocables that are taught internationally and resolve to one clear meaning when looked up in a dictionary. You should not even have to look it up. It starts at easy terms like "replace" instead of "substitute", and continues to native level complexity (missing reasonable bad examples here, sorry), or even to words that are only understood in certain English-speaking countries.
    Code that reads like English sentences is often the best choice for later comprehension.

  • No code names. Made up words and names, or acronyms, can be a nice memory or story behind a project, but should not leak into the writing of its source code. Stay with clear English wording that other people can grasp.
    Also: prefer short names before abbreviations — please stay away from stupid acronyms and be smarter than governments, research institutions and armies who use letter abbreviations everywhere. Example: STYLE = "Strategic Transitions For Youth Labour in Europe" — you gotta be kidding me!

  • Comprehensible by non-techies: If terms are important and publically visible for other departments or consumers, name them accordingly. "Billing aggregated information per merchant" is much better than "Merchant tx sums" (totally contrived 😉). "Microsoft Office" is much better than "Humble Write Bundle".
    I could write a whole book about this item done wrong in public-facing user interfaces and applications. Assume Google sent you an e-mail "Login from unknown IP abcd:beef:1234:::1 with device supermario". Now estimate how many of the people in your neighborhood would react to such a mail, or even know what an "IP" is (or IPv6)? In reality, Google is much smarter, and the title for an unknown login alert currently reads "Someone has your password". While this could also be a spam subject, the average tech user is much more likely to react to clickbait titles warning about a virus or stolen password than to titles they don’t understand. No technical details like IP, device name or location are shared by Google’s alert (only after the click), but instead there’s a single, fat button "REVIEW YOUR DEVICES NOW". This far it’s wonderful naming and perfectly smart design to attract people to security measures — an outstanding example.

  • Maintain a technical glossary page or a good practices project: Create a Wiki or intranet page for developers to look up commonly used terms. You could even add the recommended variable name(s) in there for important concepts. Don’t pack too many words in there and don’t grant other (non-technical) departments write access because else they might quickly pile up half-true or unrelated descriptions of things that developers don’t even need to know, or must have a much deeper technical understanding of. If you’re one of those "our Wiki is always outdated" or "our Wiki is write-only" companies, you could instead "appoint" a best practices code project, so to say a flagship project that does most things (including naming) right and consistently. Newcomers should learn good practices from that project. In my team at work, for example, we develop implementations for many payment methods (e.g. Credit Card or PayPal are payment methods) based on the same module interface, so implementations only (need to) vary slightly in their overall logic and naming concepts. We implicitly know which projects are the ones we wrote this year, and as such are the ones where we applied the most modern practices and conventions to stay consistent or introduce better terminology. These latest projects can be seen as starting point for any new project. In addition, we have a Wiki page outlining important points to consider for these similar implementations — much like a checklist (not related to terminology per se, just as general hint).

  • Provide examples: If there is a core spot where a term stems from or which defines the main usage, for instance a module that parses the important company report called "monthly aggregated Blobby Volley results and player of the month", that code repository probably should contain a relevant unit test and sample file/input where reviewers can later look up what makes up this report (can be anonymized data), how its output would look like, and probably a short explanation of its meaning for the company. Alternatively, I imagine explanatory articles on the company Wiki, structured in reasonable order/hierarchy of topics, and linked in the glossary.

  • Ambiguous meanings: In many cases, code and terminology grew historically and you can’t easily change names anymore — accept the fact and try to disambiguate as far as possible. If a term "account" is ambiguous between two projects, let’s say project A ("LoginService", where it stands for login credentials) and B ("BankAccountService", here it represents bank account information), then ensure the ambiguous term doesn’t slip from project B into A, and vice versa.
    If both meanings need really be mixed within one code repository, use namespaces, type and variable name prefixes or suffixes to overcome the ambiguity: loginAccountInfo and bankAccountInfo. Before introducing terms, look which ones already exist, or else you won’t be able to disambiguate easily — for instance, Rust’s package manager cargo uses the word "target" publicly for both "build target" (as in make <targetname>) and for "target architecture" (alias platform triple, e.g. "x86_64-unknown-linux-gnu"), which is mostly clear in the code because internally it’s most often called "platform", but the slight annoyance remains existent because the public configuration key is still called target and will remain so for a long time to stay backward-compatible.

  • Use one consistent name and stop the typos already to make code grep-able. This allows to search through the whole code base and see where a term or type/variable name is actually in use. If you consistently used accountInfo for all places where you store a local variable about bank account information, you can more easily rename all places to the new desired name bankAccountInfo. Side note: in reality, renames tend to be a bit less trivial, though. The same applies to sentences such as public error messages: if they are all identical, or even in one shared linked library, it’s easy to fix/amend/replace/gettext-translate them.

  • Ensure a given name is clear within the desired scope: If you have a method getBankList, you should make sure that the parent class describes what it is about — e.g. DeutscheBundesbankXmlBankListParser is a little exaggerated but clearly says it parses the XML bank list of the German federal bank. The bigger the scope is, the more important good naming is for types and items that lie within. Imagine this class was part of a shared library that you’re selling to customers!

  • Function names should be verb-followed-by-subject where a reader should be able to infer the output from the verb ("validate" in our example was not helpful).

I hope this list proves helpful to see terminology from a different perspective and allows you to take action enhancing your practices and sweeping out old, nonsense names from your code.