Thursday, May 27, 2010

Why Not To Use Distributed Supervision

Let's take the example of distributed supervision. Let's say we have one supervisor on node A that supervises children on node B and node C. What happens if network throughput slows down on the connection from Node A to Node B? This isn't transparent to the Supervisor on Node A. What should the supervisor on Node A do in this case? It has no contact with node B, so it's unsure whether or not the child is still running. Even if the child isn't running, it couldn't start a new child on node B. Should it start a child on node C (assuming it has communication there)? What should it do if node B comes back and the child is still running there? This is the simplest of the possible distributed supervision problems. It gets more subtle from here on out.

The fact that this kind of distributed supervision is even built into Erlang has more to do with Erlang's historical platforms than with its general usefulness. In the early days, Erlang was designed to run on these big telecom machines. The machines contained a bunch of separate computers all in the same cabinet, all directly and tightly connected via a hardware backplane. The TCP pipe was integrated into this hardware backbone. If the backbone went down, the entire system was screwed. They didn't have to worry about network partitioning, slowdowns, etc. So they didn't. Given that context the approach to distribution that they took makes sense, but it also means that these things don't work really in a more real world network scenarios.

This does not mean that you shouldn't use the distribution primitives that are provided. Not at all. In many situations it's perfectly acceptable to base a distributed system on the built in node to node message passing. You just need to be aware of and handle the types of network failure cases that your distributed application is likely to encounter.supervisor on node A that supervises children on node B and node C.What happens if network throughput slows down on the connection fromNode A to Node B? This isn't transparent to the Supervisor on Node A.What should the supervisor on Node A do in this case? It has nocontact with node B, so it's unsure whether or not the child is stillrunning. Even if the child isn't running, it couldn't start a new childon node B. Should it start a child on node C (assuming it hascommunication there)? What should it do if node B comes back and thechild is still running there? This is the simplest of the possibledistributed supervision problems. It gets more subtle fromhere on out.The fact that this kind of distributed supervision is even built intoErlang has more to do with Erlang's historical platforms than with itsgeneral usefulness. In the early days, Erlang was designed to run onthese big telecom machines. The machines contained a bunch of separatecomputers all in the same cabinet, all directly and tightly connected via a hardware backplane. The TCP pipe was integrated into this hardware backbone. If the backbone went down, the entire system was screwed. They didn't have to worry about network partitioning, slowdowns, etc. So they didn't. Given thatcontext the approach to distribution that they took makes sense, but it also means that these things don't work really in a more real world network scenarios.This does not mean that you shouldn't use the distribution primitives that are provided. Not at all. In many situations it's perfectly acceptable to base a distributed system on the built in node to node message passing. You just need to be aware of and handle the types of network failure cases that your distributed application is likely to encounter."

No comments: