Hunting Ghosts

Experts needed to find out that shit happens


For one task I need to start a group of services, provided through a docker-compose.yml. It worked yesterday and today I'll finish my task with a developer's test of the complete integration. But it needs seconds more than usual – I cultivated that feeling deep inside there's a delay thus there are going to be dragons ahead.

After searching through my sources I find no reliable theory why a database container could reject connections but it does keeping all the other containers from being started. They depend on this database. In the end and in an act of desperation I ask a collegue for assistence. Finally we have a look at disk space and it is nearly full. Of 16GByte a pile of 14.5GByte is in use. Thus running multiple stages of docker [network | volume | container | image ] prune frees nearly the entire space. Docker does never complain about this situation, neither does the database container.

Another system calls outbound third party providers and it works in almost all cases. But sometimes the connection is reset by the outbound service. This affects 1% of all calls. The logging system is of no great help, as it only tracks inbound services in detail. Outbound calls are logged as a one-liner with the URL and response code. The service provider simply denies what the log clearly states. Thus I cannot solve nor assist in solving the issue. It is simply left as is being negotiated between contractor and provider. In the end a HTTP-client had a (default) retry-policy that was triggered because of a malformed parameter. So the outbound provider's system recognized repeated attempts with a malformed parameter and reset the connection – threshold exceeded. Because of the retry-policy it looked as if only one attempt failed. Instead it were 25 but only the final state caused a log message. It would have been great if the log message would outline the failed attempts to some degree.

And every day I start my work with Docker (on Windows). I need to restart the Hyper-V-Manager first, then start Docker for Windows again and it works. If I don't do it in this order I get errors regarding ports already in use, network failures of a virtual one and various container start hiccups rendering an entire service mesh inaccessible from outside and partly broken inside.

There are many settings in Hyper-V-Manager as well as Docker for Windows regarding startup- and shutdown-behavior and collegues and I tried many combinations. None of them worked. So this is have you tried turning it off and on again taken to the next level.

In numerous other attempts it is simply a matter of time or one or two restarts to get it working again. Be it an application gateway, an authentication service issueing OpenID tokens or a process that cannot be killed for no obvious reason. I'd say today with virtualization and container technology I spend not only 25% of my time hunting such ghosts but already a third. Another third is spent in meetings and the rest is shared between assistance, unit testing and a very small amount of coding. Then working time is over and I continue reading the technical articles in detail, why it is useless struggeling to tame technology with more technology. And it only serves as a counterweight to keep a healthy balance.

I'd really like to satisfy customer's requirements on paper but implement it the way it works. As long as I don't have savings too high but stay slightly over budget I'd say nobody will come asking. Maybe that is why nobody talks about the true nature of a working backend. It'd render all academic books pointless.