Data gravity is the notion that large datasets tend to attract relevant applications, services, and smaller datasets. This IT terminology is a metaphor, coined in an article in 2010 by a software engineer named Dave McCrory.
Data Gravity is not a new notion
While Data Gravity may appear as a challenge to some, especially in the distributed IT world, it’s not a new notion to mainframe shops. Back in the 80’s, banks and other financial institutions did not have a lot of choices as to where to store their customer data: on physical paper or in datasets in a computer, mainframes at the time. Given the architecture of these systems, nowadays called “System Z”, it was logical to keep applications close to the data (in any case, there was no other place to move data to!), so slowly but surely the data grew “in-place” over time. Back in those days, when a company needed another business application, it was built on the top of the one database, the one version of the truth. Applications were built to coexist and operate efficiently, while querying and manipulating data to the same data source.
Over the past decade or so, a trend emerged to move data to public clouds to save on storage costs and to build applications faster/easier using Agile methodologies. Enterprises started to copy their business data to data lakes for Machine Learning analytical processing.
Recent technical breakthroughs for on-prem analytics
Data replication off platform challenges / Data Gravity benefits
Corporations typically allocate budget on a per-project basis, such as building a new application for a new service, and demand quick time to market, usually on the promises of Agile Software Development. Project managers and IT software engineers know that integrating their new application as part of the ecosystem of business applications gravitating towards the production database will introduce operational challenges that slow down delivery of the new application. Moving data over to a cloud storage environment is “in budget” and avoids those operational challenges; however, it introduces new challenges for
data integrity – the one version of the truth remains the production database, and data changes need to be replicated to the cloud environment. Data replication, which can be a technical challenge by itself, is also an expensive way of processing large amounts of data, especially if you need consistency across the various data sources.
data security – it is easier to secure business and other PII data in a single on-prem database, than if the data is replicated to various cloud storage providers multiple times (effectively one new database per each new application being built).
Finally, leveraging Data Gravity (i.e. having data in one place) can also reduce storage costs. Granted, cloud storage is less expensive than on-prem, but consider that the existing critical business applications rely on performance and the availability of a DBMS that is capable of processing huge transactional workloads. If a new replica of the source database is created for each project, the cloud data storage cost per application or service is low and fits the project’s budget. However, the cumulative cost of all these duplicate off-prem datasets can be significant enough to be a concern to the overall corporation.
How to harvest benefits from Data Gravity in today’s world
The need for business agility and quicker time-to-market drove adoption of Agile methodologies across the software industry. Inspired by the typical architecture of distributed applications, data duplication often appears a natural solution, avoiding the operational challenge of integrating new applications within an already complex IT workload in production: the new application interacts with its own database, in a silo. Quick delivery, the new application is a success… but that said:
Making sure the data from Production is replicated fast enough for the application not to operate with stale data is a challenge for the operational team, and
Insuring that yet another copy of the Production data is safe is a nightmare for the security administrators
A better way of building business applications is to leverage DevOps processes and tooling adapted to the existing environment. It allows you to iteratively build applications on the top of the source data, the one version of the truth. Continuous testing throughout the DevOps pipeline and regular engagement with the operational team ensures that the new application can co-exist efficiently as part of the ecosystem of existing business applications. And the business data is kept safe and secure in the Production database.