Shielding the Messenger From the Firing Squad

One of the biggest challenges in our business is communicating to non-technical stakeholders exactly why something went wrong in an integration deployment.

The trouble stems from the fact that there are so many components and most of them are outside your control. For instance, if an external data source becomes inaccessible, your code won’t work, no matter how well you tested it during the development lifecycle. Every developer knows that.

Only, in an integration solution, it’s not just one data source that can go down; it’s seven. And then there are other components, like a SOAP Web Service which, in turn, has its own database. Of course, said database is not indexed so the service often takes 5 minutes or more to serve a single response. Then there’s the legacy ‘system’ that built an API to expose internal functionality but which actually runs on a spreadhseet so bloated with macros that it barely opens. Not to mention the FTP server plus two mounted network drives which the network team have been struggling to get stable for months.

Or what about the Mail Server that rejects notifications at crucial points in a given flow because someone sent their holiday photos to the team and filled the mailbox to capacity. And, finally, there is the app on Fred, the team supervisor’s, five-year-old mobile phone which dies at lunchtime every Monday because that is the day that he takes the train to the office across town and he always leaves his charger in the car.

When non-technical stakeholders (i.e. business users) are trying to do their job and something suddenly goes wrong, they automatically shoot the messenger. In our world, that is always the Integration Software.

‘My web application called your API and got an error. Data is not coming through. Why can’t you make your code work? Sort it out!’ is an all too common complaint from application developers and end users that consume integration APIs.

Dealing with these challenges is not something they teach you at university, or in the text books. And they certainly won’t cover this in the documentation and tutorials of your favourite integration technology (ahem Mulesoft).

Of course, there are ways to manage these challenges and there are things you can to to protect your system from external failures over which you have zero control. There are also steps you can take to educate your users/ api consumers.

In broad strokes, you need to consider the following:

Aim for Reliability over Speed
Use Async processing wherever possible
Plan for failures and cater for them
Be strategic / specific with Error Messaging
Use notifications to get ahead of the curve
Write a Strategic Test Plan
1. Isolate 3rd Party Product testing
2. Mock 3rd Party responses in Unit tests
Control the message to stakeholders

Reliability over speed

It is impossible to control the behaviour of external systems — but you can control how your system responds to bad behaviour. Use retry / until successful scopes to give your system every chance of success. When users complain that your system is sometimes slow, you should explain that it can only be as fast as the external products it relies on, but you have focused on reliability over speed which means that they can rely on the fact that your system will succeed every time, even if it is sometimes slower than expected.

Use Async Processing

In the above example, one obvious fix would be to process those pesky email notifications asynchronously. There can be no sillier reason for your overall flow to fail than because an email did not arrive at its intended destination when this has zero impact on your ability to deliver a response to the original requester. Your API consumers don’t care about the email notifications; they just want their data. By processing the notification asynchronously, even if that email is rejected for some reason, at least you can still serve the requested content to your API consumers. You can deal with the email issues separately. There is no reason to hang your solution’s destiny on a bloated mailbox and some holiday pics sent by Dan from Accounting.

Plan for Failures

If you expect all of your external systems to work as advertised, you are going to be disappointed. Rather, you should expect failures and plan for them. There are any number of ways you can manage external failures. These might include a Retry policy as outlined above. Or you could cache responses if appropriate. There are other options, too. Your options will be determined by your project’s unique requirements and constraints.

Strategic Error Messaging

It goes without saying that a clear error message sent to a consumer can save you and your support team a world of pain.

A 500 INTERNAL SERVER ERROR: Unexpected Error Occurred is likely to unleash the wrath of your organisation on you and your team. API Consumers (more specifically, their administrators) will tell business users that your API broke. And with good reason; what other recourse have you offered them with that error message.

Consider how any of the following error messages might refocus the attention not on your Integration solution but on the true culprit in a given scenario:

500 INTERNAL SERVER ERROR: Operations System Data Inaccessible
401 UNAUTHORIZED: ERP System Credentials Invalid
408 REQUEST TIMEOUT: HR Web Service (SOAP) Did Not Respond
500 INTERNAL SERVER ERROR: Network File Location (/xxx/xxxx) Unreachable
500 INTERNAL SERVER ERROR: Response Delayed, Approval Pending (i.e. Fred’s phone has died again)

The outcome (i.e. an error) is the same but this time the API admin is more likely to contact the correct support team and let stakeholders know that it was a specific system in the network that failed rather than simply blaming your API.

Use Notifications to Get Ahead of the curve

One of the easiest ways to keep the wolves from the door is to make sure your support team is the first to know that there is an issue somewhere in your API landscape, rather than wait for end-users to alert them to the fact.

By setting up notifications to trigger when specific events occur (e.g. the network drive goes down yet again), your support team can work with the network team to resolve the issue before the business starts to feel the pain. When you do this, everybody wins.

Strategic Test Plan

A strategic test plan should aim to test the components of your solution in isolation. This means:

Creating a comprehensive set of tests for each of your system’s third party components. Simple Request / Response. That way, if the system starts acting silly, you can run your tests against it and determine whether the external system has changed in some way or whether the error occurs after your component gets a hold of it.
Mock 3rd Party responses in Unit tests. In the same way that you need to test external systems in isolation, so you need to test your own Integration logic and flows in isolation. The last thing you need is for your tests to fail because the DBA’s decided to restore an old version of their Test database which everybody knows has data quality issues. Your tests will fail but this won’t be due to a problem with your flow logic. In essence, you want to make sure your unit tests are testing your flows and business logic; not the quality of external data

Control the message to stakeholders

You need a clear way of explaining why things went wrong when dealing with non-technical stakeholders. This is not about assigning blame. Rather, we need to do this to ensure that all stakeholders understand where the business needs to focus its attention to resolve issues quickly. The worst outcome is to have business stakeholders lose faith in a perfectly good solution and even consider replacing it with an inferior one — at huge cost to the business, I might add — when, in fact, the true culprit may be a legacy system that should have been deprecated years ago.