Data & AI Strategy: Four Horsemen of Data Apocalypse

Part 2: Building a robust foundation for your AI investment

Gokhan Ciflikli

Sep 18, 2023

This is part 2 of the Data & AI Strategy series. Check out the below posts for the previous entry:

Delivering Data & AI Strategy

Gokhan Ciflikli

September 11, 2023

Read full story

A data leadership position at a startup/scaleup comes with two distinct hats:

Data & Analytics
- This covers building the data infrastructure, establishing data governance (hah!), supporting and enabling business functions (Marketing, RevOps etc.) and cross-functional product squads
Data Science/Machine Learning/AI
- Which is essentially the hype flavour of the day
- OK, I exaggerate—these activities can be very beneficial to the company, but historically only a handful of companies have really benefitted greatly from them. The rest maybe gets a 3-5% efficiency bump overall. That’s not bad—it probably pays for the investment in terms of headcount and then some.

The issue now (as of writing) is the perceived and probably somewhat real for some* inflection point narratives brought on by the surge in LLM development and research. This is similar to the data science hype of 2010s w.r.t. inflated investor and C-suite expectations on how their company would benefit from being tech-enabled. We know that the vast majority of small to mid-market companies failed to realise any actual ROI from data initiatives.

This is likely to play out again in this ‘AI’ manifestation of the same cycle.

There are various reasons why your company will fail to realise any actual returns from AI:

Your data is crap
Your data infrastructure is crap
Your data governance is crap
Your data culture is crap
Your AI strategy is crap
Your company has no real AI use case (your company strategy is crap?)

You and the-powers-that-be might be fixated on your AI strategy being crap. However, it is likely that that’s the least of your worries, by far.

The first four—your lack of reliable data/infrastructure/governance/culture—falls under the GIGO principle; any other downstream task—especially ML/AI—will only exacerbate the data quality issues (e.g. inaccuracy, bias). You will also find that, if you haven’t previously dealt with algorithmic bias reduction, it could be quite tricky.

Take the example of addressing the canonical gender bias example in language models (i.e. he is a boss, she is a secretary). In a gendered language like English, you might think of the following:

Identify a list of scenarios you want to de-bias
Assign equal probability to nouns, verbs etc. (i.e. make gender pronoun contextually uninformative)

This, however crude and on paper, solves your problem. But now, you have a model that does not reflect the structural inequalities of the society/world it has been trained on. The primary reason why you deploy this model in production is that your company achieves some business value using it. You managed to address gender discrimination, but now your model is not accurate anymore—it doesn’t work in real life because real life is still rife with inequality.

‘Tis but a single issue. Let’s take a deeper look at each of the four horseman (horseperson?) of the data apocalypse to understand the ways you can be screwed over in your quest to get value out of AI.

From left to right: Data Quality, Data Infrastructure, Data Governance, and Data Culture. Top middle: The Lamb, saviour of data, writing some dbt documentation. (Viktor Vasnetsov, 1887)

Data Quality—Death

The quality of your data is everything, no exaggeration. Startups are known to have especially low quality data—fast pace/hack mindset, arbitrary backend engineer choices hastily made, pains of hyper-growth, lack of continuity in leadership, data swamps etc.

You need to capture everything your company generates, but not in a haphazard way. This can prove to be agonisingly difficult.

The domains you need to capture are many: marketing, sales, customer ops/support/enablement, product. For subscription-based SaaS businesses, this is a nice closed loop where at the end you also capture the decision to churn/renew.

But also: this is literally hell. You need to cajole, harass, coax and otherwise align with dozens of functional decision-makers. These people, like you, have million things to do other than making your life easier.

It feels like an easy sell in the beginning—after all, everyone’s life would be easier if they had constant access to reliable, high-quality data. The challenge is in getting your stakeholders—middle and senior management—to pull their weight when it comes to trickle-down accountability. You do not have the time to police every engineer or PM.

What you can do is to highlight the cost of doing nothing. Every day without access to high quality data, your company is losing out on some unrealised gains. Put some numbers on it—back of envelope calculations are fine—and keep banging on about it like Martin Luther in 1517.

Data Infrastructure—Famine

The machinery of how you handle your data puts an upper bound of what you can achieve with your data. On one hand, the data you capture could be low quality at the source—not much to do other than asking the issue to be fixed or finding an alternative. OTOH, you might mess up with the data while it is travelling in your system—this is bad, and it should not happen.

The whole journey of ETL/ELT and reverse ETL (e.g. putting enriched data back to Salesforce) must be designed so that they are robust and resilient. Brittle data pipelines will suck the life out of your DE teams while greatly lowering your internal stakeholders’ trust in the central data team.

A classic scenario is where the data at the source shows one thing, and the supposedly cleaned/transformed/enriched data shows something else. This is a great way to destroy the foundations of the bridges you are hoping to build.

One piece of advice—others are likely to resist technical solutions to their problems (fair, TBH). Meaning, some BI tool like Looker might solve your problem, it may not solve theirs.

You can find a middle ground by guaranteeing what you can while giving some leeway. For example, you can adopt the following strategy:

All upstream data to conform to [standards], which the central data function guarantees to meet (SLAs)
Any downstream task—analysis, visualisation, spreadsheet-as-s-database solutions—are undertaken at their own risk (no guarantees)

Let people use spreadsheets if they are so inclined—they are probably good at it! Your job is to make sure the upstream data is trustable, not to ban Excel.

Data Governance—War

The authority of the data function to impose standards and policy across the company is illegally undersold.

You might build a robust infrastructure, and at a given point in time, achieve high data quality. We are talking about Finance and RevOps ARR numbers matching, PMs having access to accurate product domain analytics, Marketing folks doing attribution analysis using channels and the like.

OK, maybe I went too far in saying Finance and RevOps numbers will match, but you get the gist.

Well done! Enjoy your brief moment of happiness, as this will merely be a blip in the dark, if you don’t have governance in place. Things take so long to build, but only need a moment to destroy. And they (read: everyone else in the company) will destroy your beautiful infrastructure and extinguish the quality of your data, unless you defend it day after day, week after week. This is the ongoing battle you will have to wage.

Others do not take the aforementioned actions because of malice or any other negative motivation. They are merely at the mercy of their intrinsic motivations; whatever appears to make their life more difficult, people will resist. It is your job to motivate yourself to show up with the resolve to continue enforcing the agreed standards and policies.

Data Culture—Conquest

‘Build it, and they will come.’

Stop it.

If that was true, then all technical data leaders would rejoice, and >80% of all high-level mishirings in data in the last n years would be retroactively pardoned.

The shared understanding of the nature of data dictates how your colleagues perceive the data initiatives.

If others at your company think data is an enabler; meaning it aligns with their intrinsic motivations, they will be your natural allies.

Conversely, if they have been burned by data and data initiatives before, and worse yet—established their own siloed shadow data sources—they will resist. They will give you a vote of no confidence, and will try to maintain the status quo—benefitting locally because they are solving their own problem, but losing globally (i.e. company wide), because they block cultural transformation and change management.

You need to address the naysayers ASAP. Otherwise, the narrative will fester. Again, no one is acting out of malice—they simply cannot get their job done. This is a good framework to conceptualise your place in the company: Assume every other business function has jobs to be done, and none of the jobs are about working with the data team. They just want to go from A→B, and you are merely a necessary evil they have to bargain. This makes you focus on what your stakeholders really want, rather than conceptualising them as technical problems (e.g. if only we could deploy a churn prediction model).

You have several options going forward. Depending on the power configuration, if you have strong exec sponsorship, you can leverage it to borrow yourself some time while you come up with a solution to their data woes.

If you cannot expect a higher-up intervention, then you need to start slow. Identify several pain points of the stakeholder—ask, don’t assume—and deliver some quick wins, if possible. This will buy you some time and if you are lucky, some goodwill as well. Then, try to capitalise on a string of quick wins to kick the doors wide open for long-term collaboration.

Here, one tendency is to get caught up in the success of the quick wins and assume you can keep delivering. However, a succession of quick wins is not a strategy. You would be just winging it and being lucky. Strategy is how you achieve long-term success, which is the topic of Part 3. Catch you then!

TeamCraft

Delivering Data & AI Strategy

Discussion about this post