Thursday, March 25, 2010

Erlang Factory: Andy Gross: Distributed Erlang Systems In Operation: Patterns and Pitfalls

Andy is the VP of engineering and social media liability at Basho Technologies. He has been doing work on distributed systems ever since he worked at akami for 7 years. He really wished he knew about Erlang back then. After moving to apple he wrote distributed systems in objective c and so on. After getting sick of that he went to work at mochi media and fell in love with Erlang.

Now at basho he is continuing his love of Erlang. This talk will be Andy's view of how to write distributed systems in Erlang.

Goals:

Decentralized - no single point of failure, no special nodes, no masters and so on.
Distributed - embrace the truly asynchronous nature of of these systems, Networks are faulty, things can go away.
Homogenous - all nodes are capable of doing anything. No node is special. You should be able to take down any node and not worry about having backups - everything is a backup. This is very nice operationally.
Fault Tolerant (emergent goal) - if you strive to achieve the previous goals you are a long way toward a very fault tolerant system.
Observable (a bit of a disjoint goal) - when something goes wrong you need a way to examine the system.

Anti Goals:

Global state -
* the use of PG2 or hot data in mnesia. When using mnesia to serve important parts of the system. This can really hurt you in network partition systems. This is not very asynchronous.
* globally registered names. This is really the same thing. This is replicated data. If you see a split, things can get very funny. This is fragile and you will feel it in production.

Distributed transactions - this again gets back to global state. This is how systems gain global state. This involves temporary consensus on how to change values. The algorithms used for this are flaky in complex production situations (unreliable networks and so on)

Reliance on physical time - reliance on wall clock time is another form of reliance on global state. Rely on logical time, vector clocks, is much better. This type of clock can establish a happened before relationship and help events be ordered.

Compromise your goals

* decentralized (no masters)
You may need to communicate with a central auth or ldap server. This forces a compromise.

* Distributed (nodes use only local data)
you may just have to provide a synchronous interface for clients based on requirements.

* Homogeneous (no node is special)
There may be performance reasons why you need to centralize some calculation on a few nodes.

* Distributed transactions on global state
The whole purpose of your system may be to provide global state. You could be writing mnesia :)

* No reliance on physical time
Vector clocks imply a performance hit. You may need to use physical time. There are also complexity constraints using vector clocks. Vector clocks are not always perfect as well. You may need to use physical time when things can't be resolved.

System Design

Here we talk about the things that distributed systems have in common.

Cluster membership. You need to know what nodes are members of the system. Option 1 is to use a configuration file. This is not very elastic. The use case of your system may be to have the option to add and remove nodes very rapidly. Another option is that you can contact a seed node to join a cluster and then use a gossip based protocol to propagate that membership.

Make sure you don't add any of this complex stuff like vector clocks and gossip protocols just for the sake of it. Don't be an architecture astronaught.

Load Balancing and resource allocation. You need to figure out how you are going to dole work out to all the nodes in your system. Options are:

* static assignment. If you have knowledge of your data set this may work.
* round robin or random assignment may work. Implicit in this is that all units of work are created equal. This can result in overloaded nodes at times.
* static hashing. kind of a sharding based approach. This can work well except it does not behave well in the case of nodes coming and leaving the cluster.
* Consistent hashing. This is used in dynamo, riak, cassandra, voldemort. This is basically a hashing mechanism that behaves much better in the face of nodes coming and going within a cluster.

Liveness checking

This is similar to the notion of cluster membership.

nodes() and net_adm:ping operations can be too low level. They are mechanisms to know about though. When you deploy an app at scale sometimes they can fail. Sometimes you don't want traffic to come to a node that enters a cluster. You want to just run a test or so on, nodes() is way to granular.

Sometines you wnat to divert trafic away from a node for inspection purposes.

Soft state/gossip protocols

This can be acheived by using a gossip protocol. The way this works is, consider the cluster membership situation. You contact a seed node, and that node makes a change to its local state, and then passes the information on, the node it passes on to checks to see if it already has that update and makes a decision to pass on the data and update it's state. This can get complicated because you do need to be able to deal with slightly out of date. Consistent hashing and vector clocks can help here.

Running your system

Don't rely on working Erlang on end user machines.

Ship code with an embedded runtime and libraries. This is an OTP doctrine.

Put version/build info in the code. This seems simple and stupid but very easy to overlook.

Upgrading Code

Hot code loading for small emergency fixes.

For new releases, reboot the node. Why not use the hot code upload? Sometimes systems evolve so quickly that it is hard to use this functionality. I fully intend to use this once riak reaches 1.0 but now it is difficult. Reboots are nice tests of resiliency anyhow.

Debugging Running systems

remote erlang shells work great except when distributed Erlang dies (it happens)

run_erl or even screen gives you a backdoor for when runsh fails.

rebar makes this easy

what if you don't have access to the box?


OPS - Other peoples systems
and other peoples ops teams, using the Erlang shell is a scary thing to them. It should be because it is quite powerful.

use your shell one liners and put them in a module called debug.

Get data out via http/smtp/snmp. Use this to access your debug one liners.

Use disk_log/report_browser. Report browser rb is really nice for combing long sasl logs.

Another nice idea is that you set up your own system that can send http posts back to your web servers for logging so that you can be really proactive about supporting your customers. Make sure this data is ananomized and safe for customers.







1 comment:

Jens said...

Really interesting! Thanks!