We talked about data will always be loss, but it’s not always clear to everyone that when and where. Let’s take a pretty simple, 4-step scenario as an example. Below is a minimized service, which involves a set of web servers, running behind a load balancer. Every web server uploads their logs via FileBeat. The logs are then passed to a Logstash service to apply filtering. The filtered logs are then written to ElasticSearch for log search, visualization and monitoring. The full architecture can ensure a high availability as well as scalability, since all web servers are configured as stateless. The topology can be demonstrated as below.
+----------+ +-----------------------+
| +----+ Web Server (FileBeat) +----+
| | +-----------------------+ |
| | |
| | +-----------------------+ | +----------+ +---------------+
| Load +----+ Web Server (FileBeat) +----+--+ Logstash +---+ ElasticSearch |
| Balancer | +-----------------------+ | +----------+ +---------------+
| | |
| | +-----------------------+ |
| +----+ Web Server (FileBeat) +----+
+----------+ +-----------------------+
For people who have knowledge of FileBeat, Logstash and ElasticSearch, they may probably know these services have already introduced approaches to prevent data loss failure. For example, FileBeat remembers the offsets if last read. When a machine needs reboot on security updates, FileBeat can continue reading from latest offset since last restart. Moreover, services like ElasticSearch introduces more powerful approaches like multiple replicas to avoid data loss. Shouldn’t it be sufficient enough to ensure no data is lost? Unfortunately, there are still many challenges.
The first challenge is hardware failure. Most approaches we know have implicit assumption that they can assume reliability of local disk storage. However, in many cases, storage can break due to either hardware or software failures. For example, an old harddisk can be completely unreadable after a writing failure. Or, an machine may not be able to power on again because an unexpected power outage burns mainboard. They can happen on both virtual machines and physical machines. In these cases, the data stored on those machines will be gone. The data can be almost impossible to get back. Or, if you are lucky enough, that your data center may (rarely) keep hardware engineers to rescue.
The second challenge is inability of recovering. One important fact in data world is, we should never expect data can be recovered if it’s related to user action change. Let’s assume an online advertising service introduces a bug, which returns ads from advertiser A as response for advertiser B. Then some users may view the ads and make clicks. The system is then deployed to production for 12 hours on a business day morning until developers found it incorrect. Though developers quickly engage and fix this bug, the clicks made during the 12 hours happened on advertiser A anyway. We can’t bring back revenue for advertiser B by asking end users come back and click corrected ads again. There will be no possibility for us to compute correct click count for advertiser B.
A third challenge, though technically shouldn’t be called “loss”, is delay. In a time-sensitive system, some data may be considered eligible just because it’s coming too late. For example, online advertising may consider a click coming too late as “non-billable”, which waives the cost to advertisers. In real world, sometimes we can see a server may be brought down but fail to start for long time. If the serving business is time-sensitive, it’s possible that the data coming back were considered no use after it’s back.
As we may think about, that developers may face far more complicated pipeline topology in many real world scenarios. Data loss can happen in many places of a system, and can’t always be avoidable. Please do not get me wrong: I’m not saying such loss is as expected. The point is to discuss how a practical system should be designed, which allows data loss discoverable and measurable.
Next
Now we should have enough preparation. We know the concept of data completeness and the challenge. In next a few sessions, let’s talk about approaches and techniques to make an online service system meet our data-friendly design goal.