Understand three kinds of data

  ·   4 min read

It’s been a while I try to summrize my scattered notes from my 5-year data engineering career. I was not data scientists or analytists, instead, I worked to build up data pipelines, from ingestion, transmation, to cleansing. My goal is to makes sure the data can be safely transferred from frontend to offline store, converting from raw format (usually text logs) to a self-explanable data structure, with reasonable data quality. It was painful when I stepped in, being bitten by pitfalls from existing system and errors made by myself. And I noticed I was not lucky enough to see books or articles to talk about how to design a data system without such pitfalls. So it looks a good idea I write it down.

Business Facts, Dimensions, and Join Keys

From my previous projects, I always category data into three kinds. I call them Business Fact, Dimensions and Join Keys. Let’s use an example of a typical online advertising serving stack to explain. In pay-by-click advertising stack, one of the key data is click. A typical click event can be represented as a data structure with fields (or we can call them columns, if you prefer terminologies from SQL world). A click object itself is considered Business Fact. Once we know the amount of payment on a given location on web page, the backend billing system can compute the bill of our advertisers with a simple formula of SUM(click.clickedAd.amount). For sure the click is business critical, because every missing of such field means an amount of revenue lost.

Dimensions, on the other hand, is used for understanding data from different aspects. A good example is IP address of each click, which represents the location of people who clicks ads. A backend advertising stack always have a sub-system to conver the IP address into human-readable geolocation data like country, state/province, city, etc. When we have click amount and geolocation, a data analytist can see the distribution of clicks by a group by operation. We can see, that the purpose of Dimensions is to enrich insights, which is usually useful for data scientists to understand the behavior of people or system.

Join Keys, known as name, are used to bring facts from different data sources. Think advertiser’s account ID as an example. As a data analytist, I want to understand how the top advertiser who earns most clicks spend money on bidding strategy around Valentine’s data. Do they put most of they budget on Feb. 13, or slightly add their investment day-by-day one week before Feb. 14? Which one is more successful? Usually, the bidding behavior history are stored in database associated with advertiser administrative panel, which is deeply protected in backend to make sure they are invisible from browser. In this case, the account ID plays as a bridge between top clicks and advertiser database, by performing an inner join operation on account ID. Similar with Dimension, Join Keys itself does not directly impact business. Its value is to bring more insights by putting multiple facts together.

What’s the point we distinguish these kinds?

From data engineering point of view, the three kinds of data has different requirements, which leads different design, including but not limited to value range, validation, and debuggability support. The purpose is to make sure we can keep our data interpretable to facts, when data is corrupted or lost.

People may don’t agree with me. I did hear of arguments that I am on a wrong direction. Instead, they want to talk about how to reach 100% data delivery without data lose or corruption. Unfortunately this is not what I see. Data can be lost from any place that out of our control: an unexpected power outage in data center may write bad data to harddisk; a bug from developer can happen anyway that cause wrong data written to log. We can, and we should, keep improving our engineering process and system to reach a bug free mode. However, we always need a fallback protection on worst condition.

Next

Designing a data pipeline requires a mixed knowledge of data and system knowledge. Here I describe three different kinds of data, yet there are two more knowledge areas we need to prepare before we talk about design principles. I will cover them in next a few posts.