Language Design for Unit Testing

(last modified: )

article type system

As you probably know, unit testing is a process of checking whether each component works as intended in operational terms, by using some concrete example inputs and outputs.

When it comes to discussing how unit tests should be written, some discussions often presume features of specific programming languages or their compiler implementations. However, since languages or compilers are not given as absolute and universal things, in principle the purpose of using programming languages should define how to design language features, not vice versa.

In this article, I want to consider what kind of design programming languages should have for making unit testing valuable and easy to perform and try to work out some ideas about it (though we cannot necessarily conclude something). Although it may be the case that some of what this article refers to have already been discussed in the context of software engineering and that I simply do not know them, I’d like to write down my rough thoughts for now and correct them afterward if necessary.

What kind of “component” is the target of unit testing?

First, I give a clearer definition of the “components” I referred to above. Targets of unit testing are modules or their members in the following sense:

  • Module: a unit for abstraction that hides the detail of implementation from the outside. Modules have a finite number of members, each of which is a function, a type, a nested module, or something like those. They can make some of their members private or provide their type members in an abstracted form.
    • Modules can be implemented depending on members provided by other modules. In many cases, the dependency among modules is required to be forming a DAG (directed acyclic graph), while some languages provide a mechanism of recursive modules for handling mutually dependent modules.
    • One typical formalization of modules is the one provided by ML (i.e., languages like OCaml or Standard ML). Other than that, Rust, Erlang, Haskell, or Elm have mechanisms corresponding to modules described here and indeed call them modules. Confusing it may sound, Golang has a similar mechanism called a package.
    • Although not covered in this article, modules often take a role as units for separate compilation. In addition, modules in Erlang are units for hot code loading.

Many languages have another kind of “component” described below. We call it a package here. Although packages also provide an encapsulation mechanism as well as modules do, packages and modules are different in the purpose of encapsulation:

  • Package: a unit for releasing programs. Packages handle compatibility of the interface, and consist of modules that are simultaneously modified.
    • Packages can also be dependent on other modules, and in many languages the dependency among packages is required to be a DAG.
    • Many languages use the term package for this kind of component. Again, Golang call this as a module.
    • Rust has another layer of units underneath packages called crates. Crates may best describe the unit of releasing rather than packages in Rust. However, because a package consists of more than one (library or binary) crate and must not contain more than one library crate, library crates and packages are virtually the same.

As a side note, packages may also have an encapsulation mechanism for handling whether to make each of their modules visible from the outside. Modules invisible from other packages are those intended to be used within the package. In such cases, it can be said that the division of the responsibility between modules and packages is a bit vague.

Anyway, probably we do not have to think of packages as targets for unit testing. If there are unit tests for packages, they can be regarded as those for (public) modules of the packages. The rest of this article thereby only deals with modules and members in the sense defined above, not with packages.

Is “you should test only public functions” true?

In many cases, we certainly want to perform unit testing per module and feel that it is sufficient to test only public functions. Arguments like “you should test only public functions” are certainly often seen as the following. One day I posted a tweet that casts doubt about what grounds such arguments have, and some people gave me guesses for it:

  1. Google Testing Blog: Testing on the Toilet: What Makes a Good Test?

    Tests shouldn’t refer directly to implementation details. The names of a class’s tests should say everything the class does, and the tests themselves should serve as examples of how to use the class.

  2. プライベートメソッドのテストは書かないもの? - t-wadaのブログ (abridged translation: “Shouldn’t we write tests for private methods? - t-wada’s Blog”)

    短くまとめると、プライベートなメソッドのテストを書く必要は 無い と考えています。



    Abridged translation:

    In summary, I think that you do NOT have to write tests for private methods.

    This is because most private methods can be tested through public methods. Private methods are implementation details and do not describe “behaviors observed from the outside,” which are the target of automated tests.

    Note that this argument assumes that you have written both production code and test code by yourself. For legacy production code that lacks tests and that you can no longer work on, the reflection is a powerful tool.

    • This blog post seems to be based on the premise that the target of “automated tests” written by those who have written “production code” is the ‘“behavior observed from the outside.”’
  3. Guess 1: People often make private functions public only for testing them and thereby break the abstraction. To avoid this, one should adopt the rule that only public functions should be tested.

  4. Guess 2: Tests shouldn’t assert properties stronger than those required by the interface of modules; too strong tests will break the modifiability of the implementation.

  5. Guess 3 (a bit odd one): Tests should concentrate on the interface of modules, and the correctness of implementation details should be instead verified by more mathematically exhaustive methods like program verification or formal methods.

I will discuss the validity of each rationale above later and I’d like to get back to the essential question beforehand: are unit testing essentially only for functions that are made public based on the abstraction mechanism of modules? I will go straight to the bottom line: perhaps not. Both public and private functions seem to have a good reason to be tested. Unit tests are classified into two categories. Both categories already have a name:

  • Black-box testing: testing for each module’s behaviors observed from outside the module. The target of black-box testing is only public functions, and how to test them can be determined solely by the design of the interface.
  • White-box testing: “source-level” testing for internal implementations. Unit tests of this kind are within the shell of abstraction and should be modified when implementations are changed.

The original claim “you should test only public functions” is equivalent to “you should only write black-box tests.” That is, from this viewpoint, unit tests are only for the interface of modules. This may cause a problem: tests do not conceptually care about how exhaustively internal implementations are tested. Of course, you can nonetheless measure the coverage of unit tests quantitatively. However, if you want to take the action of adding more tests in response to the shortage of coverage, that means that what kinds of tests should be prepared is dependent on internal implementations and that the concept turns out to be somewhat self-contradictory. In addition, refactoring internal implementations changes the coverage of unit tests in general and sometimes requires us to break unit tests into more detailed ones. In essence, in such cases, you are “virtually doing white-box testing in an awkward manner.”

With that said, I also think the original claim may assume that you should separate detailed implementations into another module if you want to write white-box tests, but I don’t understand the validity of that assumption. As described above, modules are for the abstraction of implementation. It seems not apparent to me that the boundary for testing essentially equals that for abstraction.

I return to the aforementioned rationales and raise questions about them here:

  1. “Tests shouldn’t refer directly to implementation details”
    • → I don’t understand the grounds of this claim because they are not stated explicitly.
  2. “most private methods can be tested through public methods”
    • → As I mentioned above, there is conceptually no warranty that private functions are adequately checked by black-box tests. This is because assuming the existence of the warranty implies that replacing internal implementations without changing the interface requires in general the modification of black-box tests and this contradicts the concept of black-box testing.
  3. “People often make private functions public only for testing them and thereby break the abstraction”
    • → I can understand this explanation as long as the language specification, especially that for the abstraction mechanism, is given and fixed; as initially stated, in principle we have to design the language specification based on the thought about how we want to write tests, not vice versa. Thus the explanation above looks to me like a “symptomatic therapy” rather than a universally applicable concept.
  4. “too strong tests will break the modifiability of the implementation”
    • → You can modify white-box tests when modifying the implementation. White-box testing is essentially a method in this way.
  5. “the correctness of implementation details should be instead verified by more mathematically exhaustive methods like program verification or formal methods”
    • → It will certainly be elegant if we can put such verification methods into practice with ease, but I don’t understand that it prohibits the existence of white-box tests. Because mathematical verification methods such as type checking using refinement types are not so fast nor scalable to the program size, I think that probably we would still want to run white-box tests frequently then.

So, my impression is that the reasons for not writing white-box tests are not so satisfying. I will thereby suggest the following principle: programming languages or their ecosystems need to offer a system where users can write both black-box tests and white-box tests with ease.

How existing programming languages formalize the mechanism of unit testing

  • In Rust, black-box tests are conventionally written in files under the tests/ directory and white-box tests are in source files by using #[cfg(test)], while some people write black-box tests in source files as well as white-box tests.
  • In OCaml, the practice of writing unit tests might not look as prevalent as in other languages (maybe because a relatively limited number of organizations use the language in industry or users are too accustomed to formal methods to tend to consider that tests are mere child’s play). It is nonetheless of course equipped with a system for writing unit tests as a feature of Dune, the de-facto standard build system for programs in OCaml. Black-box tests are written as modules separate from those of sources. Although it is not so widely known, white-box tests can be written in source files by using ppx_inline_test. This functionality is realized by using a metaprogramming mechanism of PPX, and test code will not be included in the resulting binaries. I would like to write an article about writing unit tests in OCaml on another day.
  • In Golang, “packages” (i.e., modules in the sense of this article) consist of possibly more than one source file. The source files constituting the same “package” share the “internal of the abstraction shell.” In general, unit tests that target functions in foo.go are conventionally given in foo_test.go whether they are white-box tests or black-box ones. You may use this language design as a reference if you have taste in separating tests from sources.
  • EUnit, one of the de-facto standard test tools for programs in Erlang, requires unit tests to be written in files separate from those of sources, similar to Golang. Rebar3, a build tool for Erlang programs, has the mechanism of disabling the abstraction, and this allows white-box tests to be separated from sources. For unit tests in Erlang, you can also use meck, a powerful library for mocking.

All languages mentioned here allow users to write both black-box tests and white-box ones.

For better or worse, black-box tests can be written as white-box tests, and thus it depends on the policy of the language design to what extent one should distinguish these two kinds of tests. It’s just my preference, but it will be better if a language has a mechanism for separating black-box tests and white-box ones like Rust’s tests/ convention. Ideally, it would be great if the ecosystem could give suggestions like “How about modifying this white-box test into a black-box one? It checks the behavior of public functions only.”

A language design for mocking

Unit testing has the fundamental principle that tests should be complete in on-memory operations. Some unit tests write data to files, start a DB server before executing queries, or communicate with external servers, but those effectful operations are harmful; they may cause probabilistic failures when some tests are run concurrently and both lock the same file, or they prevent tests from running when the external servers are down. As a result, they break the stability of tests. The following article points out problems caused by such flaky tests:

To avoid writing flaky tests, people usually have to pay attention to do so. From an engineering perspective, however, it would be much more desirable if we could assert the safety of tests and prevent flaky tests from running. Moreover, it would be quite fascinating if the language could force us to write programs that are easy to test.

Based on this idea, I have wanted to construct a system where the type checking fails if one writes unit tests that call effectful functions. Specifically, I came up with the idea of utilizing the effect system for expressing whether each function performs effectful operations. For example, we can give a type as follows to Http.do_request, a function that sends an HTTP request to an endpoint:

module Http :> sig
  type method
  val get : method
  val put : method
  val post : method

  type header = map string (list string)

  type request
  val make_request : method -> header -> string -> request

  type response

  val do_request : request -{E}-> async response

Here, -{E}-> is the effect annotation signifying that the function has effectful operations such as I/O or communication with another server and thereby is not complete in on-memory operations. Effect annotations are ranged over by \(φ\) defined by \(φ \Coloneqq \mathrm{N}\ |\ \mathrm{E}\). The symbols \(\mathrm{N}\) and \(\mathrm{E}\) stand for solely on-memory operations and for effectful operations, respectively. Function types are generalized to have the form \(τ \stackrel{φ}{→} τ\); the standard function type \(τ → τ^{\prime}\) is equivalent to \(τ \stackrel{\mathrm{N}}{→} τ^{\prime}\) here. As is typical with the effect system, the form of type judgments is generalized from \(\mathit{Γ} ⊢ e : τ\) to \(\mathit{Γ} ⊢ e : τ \mathrel{/} φ\), which reads “under the type environment \(\mathit{Γ}\), the expression \(e\) has type \(τ\), and should be regarded as \(φ\).” The typing rule for function applications can be defined as follows:

\[\begin{align*} \begin{array}{c} \mathit{Γ} ⊢ e_1 : τ \stackrel{φ}{→} τ' \mathrel{/} φ_1 \qquad \mathit{Γ} ⊢ e_2 : τ \mathrel{/} φ_2 \\\hline \mathit{Γ} ⊢ e_1\ e_2 : τ' \mathrel{/} φ_1 ⊔ φ_2 ⊔ φ \end{array} \end{align*}\]

Here, \(⊔\) is defined by the following:

\[\begin{align*} \mathrm{E} ⊔ φ &\coloneqq \mathrm{E}, & φ ⊔ \mathrm{E} &\coloneqq \mathrm{E}, & \mathrm{N} ⊔ \mathrm{N} &\coloneqq \mathrm{N} \end{align*}\]

That is, \(⊔\) is the join operator on the lattice where the underlying set is \(\{\mathrm{N}, \mathrm{E}\}\) and the order is defined by \(\mathrm{N} \mathrel{⋤} \mathrm{E}\).

The effect annotation \(φ\) does not necessarily make sense for the type checking of ordinary programs but does for that of test code. Given test code \(e\) (and a suitable type environment \(\mathit{Γ}\)), we reject it as inappropriate if \(\mathit{Γ} ⊢ e : τ \mathrel{/} \mathrm{N}\) does not hold. In this fashion, unit tests that use functions with \(\mathrm{E}\) such as Http.do_request shown above can be statically rejected and we can find that we should mock such functions in the unit tests. In addition, functions dependent upon Http.do_request will also be assigned \(\mathrm{E}\), and thus they tell us that we should mock them in unit tests without forcing us to look into their implementation. This propagation can be realized by the following typing rule for function abstraction:

\[\begin{align*} \begin{array}{c} \mathit{Γ}, x : τ ⊢ e : τ' \mathrel{/} φ \\\hline \mathit{Γ} ⊢ (λx : τ.\ e) : τ \stackrel{φ}{→} τ' \mathrel{/} \mathrm{N} \end{array} \end{align*}\]

Note that abstractions themselves are assigned \(\mathrm{N}\) since abstractions are immediate values and their evaluation is always on memory. If the body \(e\) is assigned \(\mathrm{E}\), then the abstraction \((λx : τ.\ e)\) has a type of the form \(τ \stackrel{\mathrm{E}}{→} τ^{\prime}\), and its application will be assigned \(\mathrm{E}\).


Although roughly, this article gave a general direction about what kind of design programming languages should have for making unit testing valuable and easy to perform. In summary, I wrote the following:

  • The target of unit testing is modules (i.e. units of abstraction) and their members including private ones.
  • Unit tests have two kinds: black-box tests, ones for testing interfaces of modules, and white-box tests, ones for testing internal implementations. Programming languages and their ecosystem should allow users to write both kinds of tests.
  • Ideally, it would be desirable if automated procedures such as type checkers could find which functions in unit tests we should mock rather than humans. It seems that we can indeed realize such a mechanism by using the effect system.