When Edge becomes the Center - A shift to coordination free database architectures

Nearly all of the software we use today is part of a distributed system in the cloud. Apps on your phone participate with hosted services in the cloud; Even legacy desktop operating systems and applications like spreadsheets and word processors are tightly integrated with distributed backend services in the cloud.  

With the growth of mobile (e.g., IoT and 5G) more and more application services will operate at the edge. Therefore edge is increasing in relevance with respect to centralized clouds. Multiple consortia and platforms exist to help move computation out of the cloud and toward the edge, such as OpenFog, Microsoft Azure IoT, Amazon IoT Greengrass, and so on. But this is just the beginning. These platforms are far from exhausting the new opportunities created by the expansion of the edge. New abilities will appear at the edge that were never possible with the cloud alone. 

It is also clear the locus of control is moving from the large centralized data centers managed by a small group of organizations to a logical infrastructure that is much more dispersed. As the Internet continues to evolve toward improved reliability and connectivity, increased bandwidth, and reduced latency, the Internet will move to a new equilibrium in which the role of the edge will be much more important. One could say that the edge is becoming the center.

The infrastructure at the edge is made of decentralized points-of-presence, micro data centers in 5G towers, and edge gateways spread around the global connected by high latency unreliable wide-area networks.  The dispersed nature of edge infrastructure creates some interesting challenges for distributed systems. 

For example, at edge, an app/device needs to do its job on its own, without asking for permission from someone else: What if that other app/device is busy? What if it’s actually waiting itself? What if the network is slow or disconnected?  

Another important ingredient in edge is sharing; as explained in this post, edge computing has a distributed data problem. Take the example of a multiplayer game: I want immediate feedback from my game device, but I also need to observe the game universe around me.

Cloud databases are not suitable for the Edge

Today’s databases are cloud-native and built on the idea that “truth” is at the center in the cloud. This is straightforward for simple client/server systems, but becomes much more complex when the center is small and the edge is vastly bigger and ever growing. These cloud databases rely on coordination protocols which enable autonomous, loosely coupled nodes to jointly decide how to control basic behaviors like leader election, distributed locking etc. 

Unfortunately, the expense of coordination protocols can make them “forbidden fruit” for programmers. When it comes to high performing scalable distributed systems, coordination is a killer. It’s the dominant term in the Universal Scalability Law. When we can avoid or reduce the need for coordination, things tend to get simpler and faster. See for example Coordination avoidance in database systems.

Following is a snapshot of roundtrip latencies between different parts of the world. The observed data latencies will be likely much higher due to load balancers, routers and gateways along the way.

null

Coordination requires cross communication between multiple regions incurring latencies like above before a response is returned to the user. 

The issue is not that coordination is tricky to implement, though that is true. The main problem is that coordination dramatically slows down computation, or stops it altogether. 

James Hamilton from Amazon Web Services summarizes this point well, using the phrase “consistency mechanisms” where we use coordination:  The first principle of high scalability is to batter the coordination mechanisms down to a minimum, move them off the critical path, hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them.

So when edge becomes the center, 

we need databases that are coordination free, always available, provides low latency and causally consistent with convergent conflict resolutions enabling them to scale to large number of edge locations. 

This simplifies the application architectures for the edge and reduces the bottlenecks created by the databases that are designed for the cloud.

Coordination free databases for the Edge

With database at edge,  the data is at the edge. There is never a need for the user to wait for a  round trip across the globe to complete. All operations can be handled by reading and writing to the database in local edge PoP and data synchronization and convergence happens in the background. This enables the applications to respond instantaneous to user/device inputs, never needing to show a spinner while you wait.

For example, consider an analogy of driving on a highway during rush hour. If each car would drive forward independently in its lane at the speed limit, everything would be fine: the capacity of the highway could be fully exploited. But unfortunately, you will always find drivers who have other places to go than forward! 

null

 

To prevent two cars from being in the same place at the same time, we drivers engage in various forms of coordination when entering traffic, changing lanes, coming to intersections, etc. Also we stick to formal protocols like traffic lights and stop signs. We also frequently engage in ad hoc forms of coordination with neighboring cars by using turn signals, eye contact, and the familiar but subtle dance of driving our vehicles more or less aggressively. 

With all these mechanisms, one thing is common: they slow us down when traffic is crowded. Worse, these slowdowns propagate back to the drivers behind us, and queuing effects amplify the problems. In the end, rush hour on the highway is a nightmare—wildly less efficient than the highway’s capacity . 

The applicability of this analogy to edge infrastructure is fairly direct. In principle, database at each edge PoP could proceed forward autonomously with its ordered list of requests, and make progress as quickly as possible. But to avoid conflicts on shared state (akin to two cars being in the same place at the same time), cloud native databases employs coordination protocols to stay “safe”. The effect of these protocols is to cause one or more databases at edge PoPs to idly wait until some other edge point successfully sends a signal saying it is done.

In many cases, coordination is not a necessity. It is the side effect of a design decision done by the underlying database. 

For example, consider stop lights: they allow drivers to mediate access to a shared intersection by following a waiting protocol. Stop light delays can be easily avoided by taking advantage of another approach i.e., an overpass or tunnel removing the intersection entirely. In short, there is no need to employ coordination via stop lights; they are just one engineering solution to a problem, with a particular tradeoff between cost of initial implementation and resulting throughput.

So the question becomes what is the family of problems that can be consistently computed in a distributed fashion without coordination, and what problems lie outside of that family?


So when do we need coordination?

To answer this question, we need to take a small detour into concept of monotonicity. Monotony means dull, boring, never a surprise or variation. Monotonic systems are similar, they never move. With monotonicity, once we learn something to be true, no further information can come down the line later on to refute that fact.

Monotonic programs are “safe” in the face of missing information, and can proceed without coordination. Good example of this is scenarios like Does there Exists a deadlock? The presence of one positive example gives us an answer in the affirmative and additional positive examples don’t change that fact. 

Non-monotonic programs, on other hand must be concerned that truth of a property could change in the face of new information. Good example of this is scenarios like garbage collection. Here the system cannot proceed until the node knows all information has arrived, requiring the nodes to coordinate. Additionally, non-monotonic programs are order-sensitive i.e., the order in which they receive information determines how they toggle state back and forth, which in turn determines their final state. 

By contrast, monotonic programs simply accumulate beliefs; their output depends only on the content of their input, not the order in which is arrives. In short,

A program has a consistent, coordination-free distributed implementation if and only if it is monotonic. Otherwise the program requires coordination.

You can read more about this here - Keeping CALM: when distributed consistency is easy.


Consistency, Availability & Convergence

The edge native database has to support many PoPs (locations) characterized by wide-area deployments across tens of data centers. Such a geo-distributed database need to satisfy multiple competing goals i.e., 

  • availability to ensure all operations issued to the database complete successfully
  • low latency to ensure near-instantaneous responses
  • partition tolerance to provide an “always on” user experience 
  • scalability to adapt to increasing load and storage demands
  • and a sufficiently strong consistency model to simplify programming and provide users with the system behavior that they expect.

An ideal database would provide linearizability i.e., strong consistency, where as soon as a client completes a write operation to an object in one datacenter, read operations to the same object in all other data centers will reflect its newly written state. Linearizability simplifies programming and users experience the database behavior they expect. 

But per CAP theorem, when the network breaks (which is unavoidable), it’s impossible to provide at the same time strong consistency (everybody sees the same updates in the same order) and availability (the application can always read and write its data). Either the service is unavailable at times and users are unhappy, or data diverges at times and developers are in trouble. So, here’s the conundrum: 

We need availability, but we need consistency. Is there a way out?

We all know that Eventual Consistency is often too weak, and can be unintuitive and hard to work with. On the other hand most problems do not need Linearizability (which the CAP theorem argument is really all about). 

There is a big grey zone in between these two extremes - Linearizability & Eventual Consistency, but it is in this grey zone that most of the interesting use-cases and applications reside.

Given we must sacrifice strong consistency (i.e., Linearizability) for availability, the next strongest consistency model that is achievable without sacrificing availability is Causal+ Consistency . 

Casual+ consistency ensures that the database respects the causal dependencies between operations with convergent conflict resolution. 

Consider a social network where you can view photos of your friends only. Alice has a photo that she doesn’t want Bob to see. So she first unfriends Bob, then posts the photo. However, in many databases, the photo may arrive at Bob’s replica before the unfriend, and Bob sees the photo even though security says otherwise. To avoid this anomaly, the database should guarantee causal consistency . Under causal consistency, if a process can observe an update, it can also observe all preceding updates. Thus, any replica that receives the photo would also observe that Alice and Bob are not friends. 

In short, programmers never have to deal with the situation where they can get the reference to the photo but not unfriend request, unlike in systems with weaker guarantees, such as eventual consistency.

Another competing goal the edge native databases have to address is always-on availability in the presence of partitions. 

Executing operations without coordination between replicas in distributed databases under partitions, can create conflicts that must be resolved. CRDT, or Conflict-Free Replicated Data Types, tackle that problem in a systematic, theoretical proven approach, based on simple mathematical rules, by providing automatic reconciliation. CRDT is a type of specially-designed data structure used to achieve monotonicity (absence of rollbacks) and strong eventual consistency (SEC). You can read more about CRDTs here - A comprehensive study of CRDTs.

Given two or more CRDTs you can merge them in any any order with any precedence any amount of times and the results will always consistently converge to the same thing. This is a very useful trick to manage consistency in distributed systems. 

By combining causal consistency and convergent conflict handling i.e., causal+ consistency, the database can ensure clients see a causally-correct, conflict-free, and always-progressing database at each edge location. This simplifies the life of the programmer and they experience the database behavior they expect. Compared to this eventually consistent systems may expose versions out of order. 

We have barely scratched the surface so far but the post is already longer than I hoped. Probably we will need few more blog posts down the line to cover few other aspects I have in mind like scenarios that require linearizability and why databases should handle natively hybrid consistency levels. 

Another area is garbage collection (GC) of operation logs in CRDTs and doing GC in a coordinated vs coordination free manner.  Another area to consider is the workloads at the edge are mostly event driven. So I believe the edge database architectures should support streams and event processing natively providing a single paradigm rather than requiring yet another product in the mix.

Anyways, the main take away of this post is, 

Cloud native databases that rely on coordination protocols are not a good fit for the infrastructure edge that is characterized by wide-area deployments across tens of data centers. Instead the infrastructure edge is best served by coordination free database architectures providing just right consistency - Fast as possible and Consistent when necessary.


Macrometa Fast Data Platform

And finally, a sales pitch :-) Macrometa Fast Data Platform addresses the above (with some variation during failover) and few additional things not covered in the post like support for hybrid consistency models, multiple data-models, real-time change feeds, global & local streams to handle event driven workloads, Pub/Sub & intelligent global routing. etc. 

Give it a try with a free developer account to explore and decide for yourself. 

null

 

LOVE IT, HATE IT? LET US KNOW

Durga