Data as Software

Data = Software

The best practices that apply to the software development lifecycle also apply to the data analytics development lifecycle.

There is a greater tendency in Data to use low code tools than there is in software development. This makes sense: data is a constrained software problem. The constraints of working with only known rows and columns allow for low code tools that drag and drop columns to aggregate rows in known ways.

However, excellent data analytics needs to reflect the business logic of the organization. The software development industry is greatly experienced at the task of encoding business logic into software. The best practices used in software development apply also in data, even if low code tools are used.

These best practices include both cultural best practices and technical best practices.

Cultural Practices

There are two common failure patterns for data teams:

Overly Reactive Data Teams

Data teams often tackle problems / requests as they come up, bouncing from one data task to another without ever stepping back to understand the underlying user needs that are driving the requests, or investing in tooling to make solving common tasks easier.

Reactive data teams often feel overworked. They are constantly running against deadlines. There’s little advance notice and every request is urgent.

Organizations often perceive overly reactive data teams as heroic because they see the effort being put in, but they may also:

Wonder why seemingly simple requests take so long
Become frustrated with the amount of revisions it takes to get something right because the team didn’t have an intake process that helped them understand the original request
Unknowingly use data that is wrong because they team isn’t verifying data quality and/or the data team isn’t providing leadership on the business insights revealed by the data

Boil the Ocean Data Teams

The other extreme is data teams that are not responsive to day-to-day requests, instead focusing almost entirely on one massive infrastructure project that attempts to solve all of the data problems at once.

These teams may have become frustrated with the drawbacks of their overly-reactive approach and decided to finally tackle the root cause: they need fancier technical infrastructure.

Or, they are natural engineers who love the technical challenges of building data architecture more than they love the business impact of data.

The people on boil-the-ocean data teams often feel proud of their technical achievements. They continually advocate for more data resources and get may frustrated when asked to spend time on immediate user needs.

Organizations often perceive boil-the-ocean data folks as geniuses, because they are tackling the most technically complex work in the organization. However, they may also:

Become frustrated by how long it takes to get anything done
Question the return on investment of the time spend on a tool that never seems to meet end users needs
Develop ways to track their own data, because the data team is unresponsive to their needs
Worry about succession planning — the overly complex solution makes sense only to the people on the data team. What if they left?

Synthesis

These challenges are not unique to data. Any organization that is doing custom software development encounters them.

The routines of software development are deeply helpful here, including:

Dedicating time to understand and document end user needs, defining and aligning on a clear scope for efforts and assigning scopes to short sprint cycles
Focusing on “Minimal Viable Product” deployments and iterating in response to feedback
Building a culture of frequent production deployments

These are cultural / adaptive approaches to running a data team, but technical solutions. In the rush to hire technical data talent and stand up data infrastructure, they can often be forgotten. But ultimately high performance data teams have a high performance culture centered around principles that deeply overlap with the software development lifecycle.

Tools

In addition to the cultural adoption of the SDLC, the software development industry has developed outstanding tooling that equally applies to data teams.

Schema Changes Go Through Version Control

Version control systems allows teams to collaborate effectively on the same codebase.

All schema changes should go through version control. In practice, that means writing schema changes in code, committing them using Git, and making a pull request on a Git repository like GitHub.

Schema changes impact enough people that there should be a second set of eyes on them through a pull request review. Having the version history helps identify precisely when things changed, and makes rolling back changes easier. Schema changes can be easily tested using a local development branch, a staging database and a production database.

The pull review process is a great opportunity for senior engineers to give feedback on the approach that junior engineers take. The senior engineers’ code should be reviewed too, giving junior engineers the opportunity to see in depth what quality code looks like, and to make the feedback relationship reciprocal.

If your organization’s use of data is at all successful, there will be enough people depending on the data that it is worth taking a structured approach to deploying changes through version control.

Test for Uniqueness

How many times have you joined two tables and accidentally created multiple rows?

More times than you realize, surely.

Adopting a software development mindset for data includes creating “unit tests” that verify that the data is as it is expected to be.

These tests help create trust from end users: I know the data is what it says it is, because we tested it.

At a minimum, every view / table should test that what you think uniquely identifies a row actually uniquely identifies a row.

Run Tests on Every Schema Changes Through CI/CD

It’s not enough to run tests interactively via a SQL query. Creating a trusted data product means ensuring that every time you deploy a new schema change via your version control system, you are automatically executing your test suite to verify that there are no breaking changes.

This is called CI/CD in software development (continuous integration / continuous deployment).

Notice When Things Break

Believe me, you don’t want your CEO to be the one who notices when the executive dashboard hasn’t been updated in a week because a data pipeline failed.

Monitoring and observability is an entire field in software development, and new data-specific tools have emerged in the data landscape that make it easier to monitor the types of batch jobs so common in data.

Documentation in Place

How many person-years have been spent on documenting data dictionaries that are immediately out-of-date when schema changes are deployed?

Tools like dbt allow for documentation-in-place, where tables and columns are documented in the same place they are defined (and tested).

Documentation-in-place allows for your CI/CD process to automatically reject new code that doesn’t meet the minimum documentation standards, ensuring that your data dictionaries are always up to date.

What about AI?

Would this be a blog post without mentioning AI?

One of the primary benefits of treating data as software is that AI is very good at developing software.

If columns are documented in the same codebase where new data products are being defined, tools like Roo Code can load the documentation into context and understand your data schema, making query generation as simple as asking for what you want in plain English.

Compare that to point-and-click tools like Azure Data Factory, where every change needs multiple clicks to execute.

Azure Data Factory-like tools will continue to improve and will continue to integrate AI.

But AI is already amazingly good at writing SQL.

Perhaps the most persuasive reason to treat data as software is to unlock the productivity of your data analysts by letting them use AI-first software development tools. Find out more by contacting Insource’s Data & AI team.