Mastering Medallion Architecture for managing Bronze, Silver and Gold
What is medallion architecture?
A medallion architecture is a data design pattern used to logically organize data in a lake house, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer/zone of the architecture.
This architecture guarantees ACID (Atomicity, Consistency, Isolation, and Durability) as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics.
❑ Bronze Zone
The bronze layer is usually a reservoir that stores data in its natural and original state
The main purpose of this layer is to provide a historical archive of the source system. It is also useful for reprocessing, if needed without rereading the data from the source system.
Data that is stored in bronze has usually the following characteristics:
- Maintains the raw state of the data source in the structure “as-is”.
- Data is immutable (read-only).
- Managed using interval partitioned tables, for example, using a YYYYMMDD or datetime folder structure.
- Retains the full (unprocessed) history of each dataset in an efficient storage format, for example, Parquet or Delta.
- For transactional data: can be appended incrementally and grow over time.
- Provides the ability to recreate any state of a given data system.
- Can be any combination of streaming and batch transactions.
Raw data is CDC from the production database in AWS and stored in GCS in a hive partition formatted file. Then, https://cloud.google.com/biglake?hl=en is applied to provide a unified interface for analytics
Data ingested in the bronze layer typically:
- Maintains the raw state of the data source.
- Is appended incrementally and grows over time.
- Can be any combination of streaming and batch transactions.
In conclusion, Bronze serves as a staging layer and a source for other layers, primarily accessed by technical accounts.
❑ Silver Zone
The Silver layer provides a refined structure over data that has been ingested. It represents a validated, enriched version of our data that can be trusted for downstream workloads, both operational and analytical
This is the place where the transformation — applying functional business rules happens. These functional business rules modify incoming data to fit the business requirements
- Common tasks within this layer include defining schemas and data types, removing unnecessary columns, deduplicating raw data, and applying cleaning (data quality) rules for validation and standardization, modifying or removing personally identifiable information (PII) from datasets to protect individuals’ privacy.
- Additionally, enrichment processes may involve merging datasets to enhance the overall value of insights.
Silver may have the following characteristics:
- Uses data quality rules for validating and processing data.
- Typically contains only functional data. So, technical data or irrelevant data from Bronze is filtered out.
- Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD), either type 2 or type 4. This means additional columns are added, such as start, end and current columns.
- Handles missing data, and standardizes clean or empty fields.
- Data is usually enriched with reference and/or master data.
- Data is often cluttered around certain subject areas.
- Data is often still source-system aligned and organized. Thus, it has not been integrated with other domain data yet.
If you’re handling Silver data that requires querying, it’s advisable to utilize a denormalized data model. This approach eliminates the necessity for extensive joins and aligns better with the distributed column-based storage architecture.
❑ Gold Zone
the Gold layer houses data that is structured in “project-specific” databases, making it readily available for consumption.
consists of meticulously curated and aggregated data. Data from your Silver layer is transformed into high-value data products with a structure that is served to your data consumers.
Various project-specific presentation layers, such as Customer Analytics, Product Quality Analytics, Inventory Analytics, Customer Segmentation, Product Recommendations, Marketing/Sales Analytics, etc., are housed within this layer. Kimball-style star schema-based data models or Inmon-style Data marts are frequently integrated into the Gold Layer of the Lakehouse.
→ utilizing a denormalized and read-optimized data model with fewer joins, such as a Kimball-style star schema
The gold layer is expected to have the following characteristics:
- Gold tables represent data that has been transformed for consumption or use cases.
- Historization is applied only for the set of use cases or consumers. So, Gold can be a selection or aggregation of data that’s found in Silver.
- In Gold, you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case-specific optimizations, etc.
- Data is highly governed and well-documented.
Reference
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion