It’s been a while I try to summrize my scattered notes from my 5-year data engineering career. I was not data scientists or analytists, instead, I worked to build up data pipelines, from ingestion, transmation, to cleansing. My goal is to makes sure the data can be safely transferred from frontend to offline store, converting from raw format (usually text logs) to a self-explanable data structure, with reasonable data quality. It was painful when I stepped in, being bitten by pitfalls from existing system and errors made by myself. And I noticed I was not lucky enough to see books or articles to talk about how to design a data system without such pitfalls. So it looks a good idea I write it down.
Business Facts, Dimensions, and Join Keys
From my previous projects, I always category data into three
kinds. I call them Business Fact
, Dimensions
and Join Keys
.
Let’s use an example of a typical online advertising serving stack
to explain. In pay-by-click advertising stack, one of the
key data is click. A typical click event can be represented as a data
structure with fields (or we can call them columns, if you prefer
terminologies from SQL world). A click object itself is considered
Business Fact
. Once we know the amount of payment on a given location
on web page, the backend billing system can compute the bill of our
advertisers with a simple formula of SUM(click.clickedAd.amount)
.
For sure the click is business critical, because every missing of such
field means an amount of revenue lost.
Dimensions
, on the other hand, is used for understanding data from
different aspects. A good example is IP address of each click, which
represents the location of people who clicks ads. A backend advertising
stack always have a sub-system to conver the IP address into
human-readable geolocation data like country, state/province, city, etc.
When we have click amount and geolocation, a data analytist can see
the distribution of clicks by a group by operation. We can see, that
the purpose of Dimensions
is to enrich insights, which is usually
useful for data scientists to understand the behavior of people or
system.
Join Keys
, known as name, are used to bring facts from different data
sources. Think advertiser’s account ID as an example. As a data
analytist, I want to understand how the top advertiser who earns most
clicks spend money on bidding strategy around Valentine’s data.
Do they put most of they budget on Feb. 13, or slightly add their
investment day-by-day one week before Feb. 14? Which one is more
successful? Usually, the bidding behavior history are stored in database
associated with advertiser administrative panel, which is deeply protected
in backend to make sure they are invisible from browser. In this case, the
account ID plays as a bridge between top clicks and advertiser database,
by performing an inner join operation on account ID. Similar with
Dimension
, Join Keys
itself does not directly impact business.
Its value is to bring more insights by putting multiple facts together.
What’s the point we distinguish these kinds?
From data engineering point of view, the three kinds of data has different requirements, which leads different design, including but not limited to value range, validation, and debuggability support. The purpose is to make sure we can keep our data interpretable to facts, when data is corrupted or lost.
People may don’t agree with me. I did hear of arguments that I am on a wrong direction. Instead, they want to talk about how to reach 100% data delivery without data lose or corruption. Unfortunately this is not what I see. Data can be lost from any place that out of our control: an unexpected power outage in data center may write bad data to harddisk; a bug from developer can happen anyway that cause wrong data written to log. We can, and we should, keep improving our engineering process and system to reach a bug free mode. However, we always need a fallback protection on worst condition.
Next
Designing a data pipeline requires a mixed knowledge of data and system knowledge. Here I describe three different kinds of data, yet there are two more knowledge areas we need to prepare before we talk about design principles. I will cover them in next a few posts.