Building A Data-Driven Culture: Technical Foundations

Data-driven culture, like any culture, is a set of shared and reinforced behaviors. The principal behavior of a data-driven culture is the behavior of making decisions based on data. The prerequisite for doing this is the presence of a technical infrastructure that provides access to relevant data, for every role and at every level of an organization. A successful infrastructure allows people to access the data they need, understand what they're looking at, and trust what they find. Without this foundation, there is little hope of creating a culture in which data is a primary guide. 

The development of a data-driven culture has both technical and cultural components. These are related, of course, but they come alive through different means. The rest of this article focuses on my experience building the technical foundations required to support data-driven cultures. In a later article, I'll discuss the behavioral components of data-driven culture, and where I've succeeded and struggled with these. 

A data-driven culture requires data that is both trustworthy and accessible. There's nothing complicated about these concepts. The difficulty is entirely in the execution. Datasets, like the organizations that create them, are complex and messy and constantly evolving. 

Trustworthy Data 

Whenever I think about culture, I inevitably find myself returning to the importance of trust. Data-driven culture is no exception. In this case, it's the data itself that must earn the trust of its consumers. I've sat through so many meetings kicked off with a chart—a market breakdown, maybe, or a burndown, or the results of an A/B test—that yield nothing but arguments about whether the chart is correct, rather than discussions about what its message is. This is the consequence of data that is not perceived as trustworthy. A shared belief in the facts is critical for using data meaningfully in decision making. Without trustworthy data, discussions will remain speculation-driven as people argue about whether and why the data can or should be believed. (See: #fakenews.) 

What makes data trustworthy? In my experience, trustworthy data is:  


Clean data is data that a newcomer can understand and process without help. This means that tables and columns have meaningful names like "Customer" and "DateOfBirth." No "Test1" or "NewTitle" allowed! Irrelevant columns have been deleted. Arbitrarily-chosen placeholders used to indicate maximum or unknown values, like –1 or 999 in an "Age" field, have been converted into proper nulls. (Every engineer knows that such placeholder values are a terrible idea, and yet some version of these have appeared in every dataset I've ever explored. Theory, meet practice.) Importantly, a clean data set can—and generally will—still be incomplete, skewed, or otherwise infuriating. Cleanliness refers only to the fact that the layout, labels, and values can be easily interpreted. 


Data cleaning requires empathy. When I was a data-focused software engineer, straight out of school, I was appalled at the level of disorder in the databases I worked with. My colleagues were very sharp, but clearly they were cave people when it came to data. Fast-forward to a year later, when enough time had passed for me to have contributed meaningfully to the evolution of our databases. That "TmpName" column? Yeah, that was mine, and it was being used in production. And, just as I could explain the unfortunate but independently-reasonable series of decisions that led us to "TmpName," so there was an unfortunate but piecewise-reasonable story behind every other dirty quirk. Now, when I struggle with unclean datasets, I remind myself that disorder comes far more often from the pressure of scarce resources than from stupidity or malice. This helps with the delicate work of enlisting data owners in the cleanup of the datasets they've created. 


No matter how clean, data is never entirely self-explanatory. Time-valued fields require time zone annotations (GMT? Time zone of company HQ? Local time of a mobile transaction?) Fields like "Price" need to be documented with units and currency. Broad labels like "Price" also need some context: Is it the list price? The price that the customer actually paid? Which related field lists the number of units or months of service that were purchased for this price, and the type(s) of product sold? Documentation for fields that are computed using business logic, such as a binary "IsActive" flag on a user account, should include a description of the business logic used to determine the flag's value. And so on.  


Everybody hates documentation. Few people enjoy writing it, and even fewer people enjoy reading it. Nobody at all enjoys maintaining it. But, without documentation, the best-case scenario is that stakeholders will constantly barrage the raw data owners with questions, effectively turning them into gatekeepers. In the worst case, stakeholders will skip the questions and make assumptions. These assumptions will be made in earnest, but they will inevitably be incorrect. Nobody wants to read documentation, but they will do it if it leads them to the answers they need. 


Stakeholders must be able to understand where each data item comes from. Any transforms applied between the raw data and the reported numbers must be straightforward and documented. Salespeople should understand how the sales they record in their CRM flow into databases and affect the KPIs on dashboards. Product managers should understand how product usage is captured and how each user action accrues up to feature- or product-level metrics.  


Transparency is the key to building stakeholder trust. There is no replacement for letting stakeholders exercise their own judgement. One of the first things I do to understand a dataset is see whether/how a truth that I already know is accurately described in the data. If the data is product instrumentation, for example, I'll use the product and watch the instrumentation stream until I understand exactly how each of my actions in the product is reflected in the data. If I can't understand the connection, I'm not going to be able to pull any meaning out of the data. Similarly, when I build analytics solutions, I love seeing my data-savvy stakeholders perform similar tests. Not only is this a fast track for me to earn their trust, but I've found that it also builds excitement and creates advocates.  


Two people asking the same question, with the same intent, should get the same answer. This is obvious, but practical challenges arise from siloed and/or replicated datasets and from different or evolving organizational definitions of the same concepts. For example, sales and engineering teams might differ in their definitions of an "active" customer. The sales team may refine its definition over time, complicating year-over-year comparisons of active customer counts. Or, customer data may be reflected in multiple databases—one owned by finance and one owned by the engineering team, for example—which can lead to inconsistencies because of the purpose for which each database was designed. 


Consistency can be a tricky problem to address. No data landscape on an interesting scale will ever be 100% self-consistent. The right data organization can help minimize inconsistencies, though. For example, commonly-used concepts like "IsActive" can be codified directly in the database. This leaves no room for misinterpretation, and it also eliminates the possibility of human error. When definitions change, the old versions should persist in the data, clearly marked with the time at which they were deprecated to facilitate historical analyses. Tableau-like exploration tools, too, can help enforce consistency by codifying and exposing a single interpretation of a dataset. Sometimes less or limited data is the right answer. 


The most perfectly-organized, transparent, and consistent data is still useless if it's out of date. If stakeholders can't get timely information—the threshold for freshness varies with the type of data—then they'll go right back to asking developers to pull the freshest data on an ad-hoc basis. Developers will become gatekeepers. Data-centric discussions will be undermined by a lack of trust in the relevance of the facts. 


I've experienced this firsthand on a product whose user data didn't flow into a place that I could access until midnight of each day (there were some unfortunate org-chart boundaries in the way). Any changes to a user account, whether made at 10am or 3pm or 11:59pm, were invisible to me until I arrived at work the following morning. When I undertook a product update that automatically changed a user setting, I had to do it in relative blindness. During the testing and deployment of those updates, I constantly badgered the developers who had access to the live data that I needed. I took up a lot of their time. They hated it. I hated it. But we all kept at it, because we understood that timely data was the only way to get trusted feedback about the effects of the product updates. Later, we drew on this experience to eliminate the impediments related to the org chart and broaden access to the live data. 


Creating trustworthy datasets out of complex and nuanced raw data is one of the key responsibilities of anyone building a data-driven culture. It is an ongoing process, since solutions must evolve with the underlying data and systems. Sometimes the process starts and ends with the relatively straightforward task of cleaning up a few databases. Other times, as has more often been the case in my experience, the answer includes some kind of curated reporting layer. The best solution depends entirely on the context. This is what makes data design so much fun for data nerds! 

Self-Service Data  

Trustworthy data is useless unless the people who need it can get to it. Who needs access? Everyone. 

A self-service data access model is one in which everyone in an organization is enabled with direct access to the data than they need for common tasks or decisions. This doesn't mean that everyone has access to everything, and it doesn't mean that security is compromised. It does mean that employees are trusted with access to the data they each need to do their jobs (I told you—for me, all roads lead back to trust). To execute this effectively, security policies must be flexible and easily configured. Granting or revoking access should happen within minutes, not days. Access should be applied at the level of individual employees on the people side, and at the level of databases or even tables on the data side. 

There are no gatekeepers in a self-service model, though there will likely be an individual or even a team tasked with enabling this access. The difference? A gatekeeper provides data or answers. An enabler provides tools that allow stakeholders to explore data and construct answers on their own. You'll know you're on the right track when people start asking your enablers, "Where can I find this data?" instead of, "Who can get this answered for me?" or even, "Who can get this data for me?"   

A data-driven culture is only possible when everyone—not just technical stakeholders—can access the data they need to inform their decisions. It is unreasonable and indeed undesirable to expect people in non-technical roles to learn SQL or an equivalent. Instead, it is up to the enablers to create access solutions that meet stakeholders where they are.  

In some cases, appropriate access does mean providing direct database access via a SQL terminal. I, for example, always prefer to get data straight from the source. In other cases, appropriate access may mean providing pre-configured visual exploration tools like PowerBI, Tableau, or even Excel's PowerQuery. I've found these to be particularly effective tools for people who are data-savvy, but not necessarily technical. These could be sales or product managers, for example, who have a few questions in mind but who welcome the freedom to explore new hypotheses as they arise. In still other cases, appropriate access means simply answering a few pre-defined questions. At the executive level, for example, live dashboards like Geckoboard or Databox are great for keeping a pulse on top-line KPIs without diluting focus by exposing too much detail.  

How To Get There? 

I believe the most important step to creating a solid data infrastructure is to treat that infrastructure like a product. This means assigning it an advocate, most likely a product manager. This person's role is to work across the organization to identify the data needs of various stakeholders, and then work with engineers to design and build an environment that provides self-service access to this data in a trusted way. The data advocate should maintain a roadmap and backlog, and be accountable for demonstrating concrete progress through—you guessed it—metrics. 

Lack of an accountable advocate is, in my opinion, one of the two major reasons that the technical components of data and BI initiatives fail. Without an advocate, discussions about data consistency or reporting tools will inevitably be postponed in favor of planning the next product sprint, or fixing a few additional bugs. Data infrastructure will only evolve toward a goal if someone consistently prioritizes progress in the appropriate direction.  

The other reason that technical foundations fail to evolve, even when they have an advocate, is because they have the wrong advocate. The people who are generally most technically qualified to fill the enabler role—that is, developers or data engineers—are often blinded by their own expertise. They (we) are used to all the quirks of the data as it sits in production systems. They (we) work with it in its raw, messy state every day. This intimate knowledge can get in the way of designing accessible solutions.  

You'll know you have this problem if someone in an enabler role is saying things like, "Oh, sure, that data you're looking for is in the 'Test1' column. Just remember to filter out all values above 999 before you take the average!" An effective advocate must have the customer empathy and product management skills to envision a long-term path to trustworthy data access that works for all stakeholders. S/he must also have the interpersonal skills required to effectively communicate and influence across organizational boundaries and to both technical and non-technical audiences. 

Treating data infrastructure like a product also means that agile methodologies can and should be applied. It's a terrible idea to overhaul an entire data infrastructure all at once. Having a long-term strategic vision is great, but the path to get there should be taken one step at a time. Pick one important stakeholder and give them a dashboard. Pick one database to clean up. Not only does this let your enablers learn as they work, but it yields small and concrete successes. These signal commitment and competence to stakeholders and skeptics alike. 

Trustworthy and self-service data is a necessary, but not sufficient, condition for a data-driven culture. Data alone doesn't influence culture. Behaviors influence culture. In my next article, I'll discuss the behaviors that lead to data-driven decisions and ultimately a data-driven culture.