A complex machine

Reflections on how a complicated technology setup can be made to work for the team instead of against them.

Intro

I’ve been writing software (and managing people, budgets and projects - all with varying degrees of success) for quite a few years now. Over that time I’ve formed opinions and prejudices, consciously and unconsciously, about what is more likely to work and what is likely to lead to a mess. I joined BetterDoc at the start of 2021, so it’s been a bit over 6 months at the moment of this writing and I might still be missing clues other than what follows. What I’m about to share is not exactly a lesson in humility; more like an observation of the conditions under which a complex technological setup can be a success.

The BetterDoc setup

My past experiences taught me to be distrustful of buzzword-bingo setups; it’s usually the result of some developer or other being bored who ends up releasing some new piece of tech on production that has no justification for it to be there except that it’s new and shiny, and it made them forget their boredom this past week.

That was also my original impression a little after joining: micro-services (actually: micro-frontends!) developed using at least 2 different languages (Ruby & Elixir) in a poly-repo setup that are then independently packaged with Docker and deployed on AWS EKS. All for a product that essentially caters to about 40 concurrent users at most (BetterDoc is presently a ~100 person company and the main product is used almost exclusively internally). Then when throwing Redis, ElasticSearch and SQS queues into the mix, I distinctly remember thinking of the word ‘over-engineered’.

Of course I’m not saying that any of the tech above is actually bad - I merely suggest that the accumulation of all this “high-end” tech in one place is often the result of enthusiasm rather than critical thinking. But in BetterDoc’s case, it works.

Why it works

There are multiple reasons why this works I think, and while some of these reasons could be absent and the result would still be a functional (as opposed to a dysfunctional) setup for the most part, in the end having all the prerequisites leads to a much better stack than it would otherwise be.

It’s all about the people…

Yes, I’m aware that using clichés oftentimes weakens one’s arguments - but it needs to be said anyway.

The developers and coding managers at BetterDoc are almost all seniors or at the very least possess mentality and skills equal to those of seniors. They’ve built systems before and can intuit on what will work and what will not and they’ve worked in the confines of a team long enough to know which behaviours to avoid. This Netflix engineering manager seems to echo my thoughts on the advantages of seniority.

The same goes for non-coding managers too. These are not some people that were promoted into product-owner roles from the inside, they’re experienced people with actual qualifications (e.g. information architects) who are in charge of leading the product and have been working on software projects for large parts of their career. They know how to cater to the needs of developers and how to ensure the success of projects.

…and the attitude

Both groups of people identified above are aware that any software platform needs regular maintenance. Technical work (i.e. the kind of work that does not involve building a new feature) does not take a backseat to more immediate “value-providing” work; instead it is planned and prioritized exactly the same as anything else. Technical debt is not left to grow unchecked, which leads to making all kinds of future feature development that much faster in the end.

Going back and fixing things serves another important purpose too: preventing a broken-window type of development approach. Just as when entering a messy room one thinks nothing of being sloppy and e.g. letting their bread crumbs fall on the floor, messy code invites similar behavior from the next dev - after all if it’s already slow, what’s the big deal in having another N+1 query? This sort of thing establishes a precedence of course which in turn ratifies sloppiness- not really what one needs in their product. Fixing things then leaves little room for bad examples to become the norm, and that’s also another reason why maintenance work reduces chaos.

Strong conventions

I’ve worked at companies with setups where each dev could start a new repo with a completely out-of-place naming scheme - and do so again next month by following a new direction in naming. The result of course was badly named repos that were hard to discover, had no clear boundaries and anyone could include just about anything in there; for companies that are numbering devs in the hundreds not having clear repo guidelines really hurts.

At BetterDoc, there are established conventions for naming repositories, log statement format, data exchange between services, UI design, CI setup and many others that prevent things from becoming chaotic.

To contrast with my previous experience, even though there are close to 200 repositories present, finding the correct one is more often than not a simple concatenation of where-how-what: where it is located (the logical sub-system), how it performs its duties (the type of service/library this is) and what it does (which is usually just a single thing thanks to the microservice approach).

Automation and Tooling

This is one I didn’t notice so much at first (the best stuff just works, right?), but it’s obvious in hindsight. At BetterDoc there exists internal command-line tooling for creating a new service from templates, triggering releases, connecting to production containers etc. This helps reduce the uncertainty of how to go about stuff, and the more predictable something is, the better the tooling that can be written around it. This, in turn, further helps reduce volatility - a positive feedback loop if I’ve ever seen one. CI, e2e testing, Slack and Trello automations all play their parts in improving the quality of devs’ lives and development speed as well.

Clear boundaries

There have been previous posts and podcast interviews about how exactly micro-frontends are implemented here, but the gist of it is that each section on a page is served by a different service which handles all concerns around that particular piece of UI: from HTML rendering to data storage, everything is self contained. If you’ve read the micro-frontends guide on Martin Fowler’s website, the way it feels is a lot more like the ‘Server-side Template Composition’ paradigm than the pure JS/frontend solutions, as there’s an orchestrator service present that clients connect to for managing page requests.

Although perhaps complex-sounding at first, the end design is very reminiscent of the advice M. Nygard gives on splitting services vertically rather than horizontally (i.e. do not waste time on creating “REST API” services but rather focus on encapsulating entire data flows within a single service), which is something that this particular architecture manages to do so by definition. This design also lets developers work on something in isolation while safely ignoring everything outside that particular service’s domain, a fact which leads to faster onboarding and development cycles.

Of course there are times when a service might need data from another one, and conventions once again come to the rescue:

  • if a service needs to take a decision based on data stored by a different service, it can ask for that using pdi URIs. These are internal URIs that go through a central authorizer service which has a catalog of who serves what. This avoids chaotic scenarios where services might directly be talking to a multitude of other services in the cluster without an explicit listing of these data exchanges; instead a service can only talk directly to its own storage system(s) or the authorizer (as far as internal services go).
  • a service can also emit events. When a service returns an HTML response, it can also return an x-event HTTP header that the orchestrator will forward to the other micro-frontend services on the page that have registered for those events. There’s also the more familiar type of events that are published and consumed from queues asynchronously, but these are very limited in scope for now and no patterns have emerged about their usage yet.
  • intents is the last way a service can exchange data with another. These work as a sort of reverse events - essentially the service asks (by way of the orchestrator) anyone who knows how to fulfil a particular request to do so on its behalf and return the result.

Both events and intents need to be explicitly declared by both emitters and receivers for the orchestrator to forward them, so it’s always easy for devs to figure out the data flow just by looking at configuration files. The result of all that is that data access is structured and well-defined which makes inter-service communication testing much easier than it could be otherwise.

Documentation

Documentation is no joke here, and it is the job of every dev to get it done. A service is not considered finished until the documentation is in place about what it does and how it does it. Missing documentation is considered technical debt to be tackled at the next opportunity. Technical decisions that affect the architecture are documented as RFCs - so things like “why was X done using Y” are always explicitly documented in Confluence. Projects are laid out there too, along with the reasoning behind the proposed implementation - always just a search away from answering questions and offering insights into the workings of the business. Kubernetes guides, database backup/restore guides, dev machine setups and numerous other how-tos also contribute towards faster on-boarding while also preventing the chaos that could ensue by e.g. having duplicate implementations popping up because the new dev was not aware that what they just built was already there.

Closing remarks

Is everything perfect? Right. Of course it’s not - but the mess, the bugs and the problems always feel manageable, their scope limited. There’s a small number of places to look for bugs, and regression tests will make sure the same bugs won’t be seen again. Bad code is identified and queued for fixing as a part of daily work. If a problem turns out to be complex (e.g. something broke in Kubernetes), the experience of the people here helps to sort it out quickly. And if the solution given is novel, it’s documented for next time.

In the end, it all works because of deliberate design decisions (instead of what at first glance looked like accidental complexity) and having senior people at hand to support those decisions. Simple and effective.


This article reads much better after my colleagues Lucas M., Gar M. and Christian S. offered their suggestions before publishing.