M. Emmanuel — Field Notes

Fail fast and loudly

Oct 29th, 2025

Fail Fast and Loudly

There are two major strategies that you can follow when writing software: fail fast and loudly or try to self-heal and manage the error.

In the context of the buy-side industry nothing can be worse than trying to self-heal. It is the worst advice that you can give to anyone because it simply lowers the bar of quality of the entire chain and opens the door to unexpected events and corrupted data or incomplete processes.

The self-healing myth reflects a poor strategy, a lack of understanding on how reliable software is built, and an implicit acceptance of low quality software in production. While that can be manageable for certain low quality industries (think of the business software written by large consultancy firms) it is totally unacceptable in the context of a hedge fund or buy-side.

Software has to be designed so it is so simple that it never fails and when it does it reflects an error that really needs to be fixed. Writing reliable software is not about being smart, it is about writing simple, readable code that does simple things. The trend of "belts and suspenders" or "handling extreme cases" gives a false sense of security. Software in mission critical systems has to fail fast and loudly when things are not going as expected, because if things are not going as expected I want to stop the operations and fix what is wrong. Think of it like the Emergency Stop in a factory production line, if something seems wrong you press the button, stop the line, figure it out and resume only when safe.

This might seem counterintuitive but it is not. Following that pattern simplifies the software since you no longer need to implement endless guardrails. Guardrails tend to be messy, intrusive, and full of small side effects that nobody really understands. By letting things fail fast you are actually in a better position.

The motivation to write self-healing or error-propagating code is strong in many companies. It usually comes from fear and lack of confidence. Bugs are an integral part of software development, not something to hide. Bad leadership permeates this fear that leads to the wrong design decisions. It leads to the culture where it is not only acceptable but encouraged to write software that corrupts data and process for not having a loudly crash in the initial stages.

The buy-side industry has another advantage: data pipelines tend to be stable and the data models have not changed in decades, software can and should be built to last years without major changes. When something goes wrong, I want to know it immediately, not next week when the results are already corrupted. This implies acceptance of less and longer iterations, which is not the regular trend.

Simplicity is always the key. There is nothing better than getting a big beautiful error message. Fear not exit(1). It is a good thing. It means you will be forced to fix the real cause instead of propagating the error with only God knows what side effects implied. If you passed the wrong data to that function, a segmentation fault is better than a silent return 1 or even worse a "ignore that particular record and parse the rest" strategy.

Failures are part of the process of building robust reliable and maintainable systems. Fail fast, fail loudly, and then fix it properly. That is how good software gets built and how bad software disappears quietly over time.

This needs to be designed from scratch since it implies that you need to be able to resume operations safely. And that is a concept for a different post.

Credits: Fail Fast term was coined by Jim Shore on September/October 2004 issue of IEEE Software.

2025-10-29