If you are thinking that this blog will give you the magic bullet for exception handling, then I have to disappoint you right away. In my opinion, there is not one specific magic bullet because there are so many different scenarios that require different types of exception handling. However, exception handling in general has something in common throughout all organizations; it is one of the last things that is thought about and implemented in your process. In general the implementation focuses primarily on the happy path, which is understandable as this is the scenario that will give (the most) business value. But if you think about exception handling at the start of the project you can save a lot of work and time. This will lead to less time-to-market and retaining money.

In this blog I will share some thoughts, which can be taken into consideration when thinking about exception handling.

First you have to identify the exception scenario(s) you want to handle. For each exception scenario you want to deal with, you will have to think about who (can be multiple parties) needs to be informed. When you know whom you need to inform you also have to think about how you want to inform these parties. The answers to these questions will have impact on the design of your process.

Service interface

The previously mentioned points in the introduction can have an impact on the interface of the service when your process is started as a (web) service.

There are two options for the interface when starting your process as a service.

  • The process is started using a Fire-And-Forget message or as Request-Response where the response message is only an acknowledgement for receiving the request. In both cases no information is returned to the party starting the process about the status of the process.
  • The process is started as Request-Response and (part of) the validations are performed so the party starting the process can be asked to correct/complete the information needed during the process.

Service availability

The choice for either of these options is dependent on the availability of services needed to validate (and process) the information passed as part of the request message. When the necessary services are only available during a service window you can also choose to use a combination of the two options. During the service window you can immediately do the validations and processing and outside the service window you can send a response that the request will be processed as soon as possible.

Retry mechanism

While validating and processing the ‘information’ in the requested message, it is likely that various different services are called in the process. There is a possibility for each of the service calls to fail. So, instead of immediately informing a party that a service call has failed, you might want to retry the same service call first (maybe after an interval period). This will, of course, depend on the type of error. For example if the service doesn’t return an answer in time (time-out), you could retry it after a minute. But if the service returns an error stating that the object you are looking for doesn’t exist or that the update request doesn’t meet certain requirements, it would be ineffective to retry after a minute. Because you will receive the same error over and over as long as you don’t change anything in the data.

If you decide to retry certain service calls when they are failing (with a certain response) you don’t want to do this indefinitely. So you will have to provide some retry mechanism that counts the number of retries and informs a party again after the last retry.

Switch mechanism

Aside from a retry mechanism you can also think about a switch mechanism. This functionality you would want to use when you know a service is failing constantly and you don’t want to call the service until the error has been corrected (e.g. with time-outs or database failures). This functionality can save you money when you have to pay per service call or processing time by the called service. The switch will have to be reset once the problem is solved, this can be done automatically (e.g. when the service supports ping-functionality to see if the service is responding again after time-outs) or manually by a maintenance person.
Moreover, when the switch has been reset you will have to think about what to do with the service calls that have not been sent. Are you able to store these requests and process them one-by-one once the switch is reset? Or do you have to inform a party that the backend service is not available and therefore the process will not be able to start at all.

Service Level Agreements

In addition to services that are failing, you can also have (sub) processes and/or tasks that are failing to be processed within an upfront defined time period. These upfront-defined time periods are called Service Level Agreements, when a Service Level Agreement (SLA) is not met you would want to act on this situation. This can be done in several different ways. This is among others dependent on the type of user (if any) currently working on the process and tasks.

When it takes too long for a task to be picked up you can increase the priority of the task so it will appear higher or with a different look-and-feel in the task list of the employee.

Also, managers (of the employee) can be informed when processes or tasks are taking too long. They can decide themselves what to do with the process and/or task. They could for example increase the priority manually or reroute a task to somebody else who has a lower workload than the current employee working on the task.

Another example is when information is requested from a party outside the (regular) process and the process is on hold until the party provides the requested information. This could be for example a missing form of a customer. The (sub) process or task waiting for the information can have a SLA defined on this request. So, if the information is not sent before the SLA expires, you can send reminder messages. But if that doesn’t result in the desired information then the next step can be a cancellation of the request. At this step, you could inform the requestor again about the request that has been cancelled.

The SLA’s can be defined on different levels in your process; it can be on a process level (from beginning to end), sub process level or task level. Of course these SLA’s have to be in sync with each other. Because it is usually useless to define a SLA of one-week on a task, when the process containing this task has a SLA of one day.

I hope this blog has given you enough food for thought. Hopefully it has inspired you to think about exception handling in ways you haven’t thought about before and especially to think about exception handling at the beginning of your project. This way you can implement your own magic bullet by combining the ideas in this blog to your specific situation.


 About René

René van Seuren Senior BPM Test BPMCompany I am a Senior BPM Consultant and I like to work on complex integration issues combining processes on one side and service integration on the other side. I aim for short and long term solutions to satisfy both functional (business) and technical (IT/Ops and platform) needs. These solutions need to be understandable for all relevant parties (business, architects, developers and maintenance).