How Dataplex helps to reduce the redundancy of data in Google Cloud!

Discover how Dataplex helps to implement DataMesh across the whole data landscape.

A data mesh is a new paradigm of data management and a data sharing strategy that helps to decentralize data ownership, allowing domain data owners to manage their own data. Distributing datasets across various locations enhances data accessibility and operational efficiency.

Dataplex assists in logically organizing your data and related artifacts into a Dataplex Lake or data domain. This approach enables you to unify distributed data and organize it according to business contexts.

Teams often access datasets from various Google projects for different objectives in business contexts.

Source: Google Cloud Documentation: https://cloud.google.com/

Why Do we need Datamesh?

Data-driven decision-making is essential for an organization’s success. The primary challenge is that business needs evolve faster than processes can adapt, leading to a growing gap between data and its value.

Your organization uses various data sources for decision-making. How easy is it to access these sources, and can you trust their reports? Who owns and manages these data? Should a centralized team oversee all data, or is it time to decentralize ownership and allow teams with the best context to manage the data?

Data platforms can effectively support organizational goals by eliminating past concerns about capacity and engineering time for integrating new data sources. With reduced data processing, networking, and storage barriers, organizations can now more affordably manage larger volumes of data from various sources.

However, despite advancements in data platforms, the organizational model for generating and accessing analytics data has not evolved. Many companies still rely on a central team to compile and make data assets accessible, which can slow down their ability to harness the full value of their data.

Let’s take a few real-world problems:

The major issue is the data bottleneck, where only one team of Data Architecture or an individual called a “Data Architect” can access the data.

Consequently, all data requests must go through them, and they are expected to interpret use cases and determine data needs without adequate domain knowledge.

This frustrates data analysts, data scientists, and business users who need data for decisions. Over time, many individuals start making decisions without data due to the delays.

Along with data bottlenecks, Data chaos often occurs when individuals become frustrated with data access bottlenecks. They may copy data without verifying its quality or source in their search for relevant information.

This duplication can lead to confusion about the original data’s accuracy and meaning.

Besides creating governance challenges, it results in wasted resources and increased complexity, ultimately slowing productivity and eroding trust in the data.

Layered Data Architecture:

In the Current world context, where data is moved from one layer to another, each layer holds a copy of data. Leveraging existing datasets directly from the source can save time and money instead of creating layers and moving data across layers.

Decentralized Architecture:

Engineering efforts to build data assets on every component of the data life-cycle are greatly reduced due to decentralizing the data ownership. Implementing Data Mesh by enabling DataPlex can also help maintain Data Quality, Data Profiling, and Secured Data Access.

Organizations should enable business domains to autonomously create, analyze, and share data products as long as a valid use case exists. Each domain will manage its data products throughout its lifecycle.

A central data team is necessary in this model but does not own the data.

They aim to help users derive value from data by enabling them to build, share, and use data products independently. They establish standards and best practices for secure and interoperable data products, implement governance policies to build trust, and provide tools to ensure compliance. Additionally, they maintain a common platform for self-service discovery and usage of data products across domains, supported by a self-service, serverless data platform.

In 2019, Zhamak Dehghani introduced DataMesh, applying a DevOps approach to data management. Coincidentally, Google is using this concept and decentralizing data platforms using BigQuery. Instead of consolidating data from various domains into a central data lake, each domain can host and serve its datasets in an easily accessible way.

Data Mesh is an architectural approach that decentralizes data ownership to teams with the most relevant business context. These teams are responsible for keeping data fresh, trustworthy, and easily discoverable within the organization.

Data is treated as a product, managed by those who know it best. Effective governance is also decentralized, allowing data owners to customize management and access within set boundaries.

The concept of a Data Mesh is appealing because it aligns business needs with technology, helping to overcome organizational barriers to data value.

To implement a Data Mesh, companies must adopt four principles: Discoverability, Accessibility, Ownership, and Federated Governance, which require collaboration between technical teams and business leaders.

Implementing Data Mesh greatly helps Data Management & Data Governance, as many data warehouses and data marts become obsolete and disappear.

Teams managing data domains in a decentralized organization may need to form hybrid groups of data professionals responsible for data curation, management, engineering, and governance. This transition will impact daily operations and employee evaluations, necessitating buy-in from stakeholders and leadership across the organization.

Data Mesh offers a decentralized approach to data ownership, where each domain creates and consumes data, allowing for faster scalability of data sources and use cases.

You can join data across domains without duplication by utilizing federated computation and access layers with BigQuery and BigLake.

Analytics Hub and Dataplex support data discovery and centralized governance. At the same time, Looker provides a unified semantic model for easy data access by scientists, analysts, and business users, streamlining data consumption and permissions.

Reference Materials:

Google Cloud Documentation: https://cloud.google.com/

We have reached the end of the story. Next time, we will meet with more interesting topics. Stay tuned!

Thanks for reading! If you enjoyed it, don’t forget to follow, clap, and show some love. Spend at least a minute to get your read count.

To read more, don’t forget to subscribe. Click below to subscribe: https://medium.com/@mskmiba/subscribe

How Dataplex helps to reduce the redundancy of data in Google Cloud!

Let’s take a few real-world problems:

Layered Data Architecture:

Recent Posts

Comments