As data in organizations continue to grow, the amount of complexity and processing in a data pipeline grows hand in hand. Databricks Delta is designed to handle both batch and stream processing as well as concerns with system complexity and aims to solve these issues by providing high-performing, reliable, and simple data pipeline solutions.
What is Databricks Delta?
Databricks Delta, a component of the Databricks Unified Analytics Platform, is an analytics engine that provides a powerful transactional storage layer built on top of Apache Spark. It helps users build robust production data pipelines at scale and provides a consistent view of the data to end users.
Benefits of Databricks Delta
Organizations that want meaningful information out of their data must have a data pipeline that enables the end users. Some of the issues that data engineers face when developing data pipelines are query performance, data reliability, and system complexity. Here’s how Delta tackles these challenges.
As data grows exponentially in size, being able to get meaningful information out of your data becomes crucial. Using several techniques, Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet.
- Data Indexing – Delta creates and maintains indexes on the tables.
- Data Skipping – Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query.
- Compaction – Delta manages file sizes of the underlying Parquet files for the most efficient use.
- Data Caching – Delta automatically caches highly accessed data to improve run times for commonly run queries.
The end users of the data must be able to rely on the accuracy of the data. Delta uses various techniques to achieve data reliability.
- ACID transactions – Delta employs an all or nothing approach for data consistency.
- Snapshot isolation – Ensures that multiple writers can write to a dataset simultaneously without interfering with jobs that are reading the dataset.
- Improve data integrity through schema enforcements.
- Checkpoints to ensure data is delivered and read only once even if there are multiple incoming and outgoing streams.
- Upserts and deletes support – Being able to handle late arriving and changing records and cases where records should be deleted.
- Data versioning capabilities allow organizations to rollback and reprocess data as necessary.
An overly complex system makes it difficult to respond to change. Delta provides a data analytics architecture that is flexible and responsive to change.
- Delta can write batch and streaming data into the same table, allowing a simpler architecture and quicker data ingestion to query result.
- Delta provides the ability to infer schema for the data input which reduces the effort required in managing schema changes.
Additional Databricks Delta functionalities
Delta has some additional capabilities that your organization may find useful. Some of those highlighted capabilities are listed below and can help you make an informed decision.
Simple transition from Spark to Delta
Switching from Parquet to Delta can be as simple as replacing code referencing “Paquet” to “Delta”. Because Delta tables contain metadata and additional functions such as upserts, less custom coding is required.
Generally, when you update tables with an underlying Parquet file you overwrite the entire file. Delta allows you to update the records at the table or partition level. In addition, it provides deduplication capabilities and simplifies updates by having a merge (upsert) capability. Users can also track and handle SCD type 2 changes.
Like updates, users can delete records defined by a condition instead of rewriting the entire file.
Query your data lake directly
With Delta, consistent reads during appends, updates, and deletes on your data lake is provided. In a self-deployed Spark setup organization must determine a way to set up consistency.
The ability to reference previous versions can be useful for a data scientist who needs to run models on datasets as of a specific time. Users can query Delta tables as of a certain timestamp. Table versions are created whenever there is a change to the Delta table and can be referenced in a query. This time travel capability also allows users to rollback in cases of a bad write.
Optimize data layout for better performance
Delta table sizes can be optimized with a built-in “optimize” command. Users can also optimize certain portions of the Delta table that are most relevant, for example, the last 3 months, instead of the entire Delta table which can help save on the overhead cost of storing metadata and can help speed up queries. Data Skipping and ZORDER Clustering techniques allow for consistent and fast reads on the Delta tables.
Drawbacks of Databricks Delta
While Delta has many benefits, there are some points to consider prior to using it as your data pipeline solution:
- Delta is only available as part of the Databricks ecosystem. You may want to consider whether the other tools on Databricks would fit your organization’s data architecture prior to moving forward with Delta.
- While Databricks is available on AWS and Azure, it is not currently available on GCP.
- There are some challenges if the data from the Delta tables need to be consistently migrated to another system upstream. See the Can I access Delta tables outside of Databricks Runtime? section of their FAQ.
- Consider the cost of a Databricks solution.
Who should use Databricks Delta?
Databricks Delta could be a useful platform for organizations
- That are already using Databricks components and needs a data pipeline solution to build out their data lake.
- That are currently using the Hadoop/Spark stack and would like to simplify their data pipeline architecture while improving performance.
- With large datasets that require high volume processing capabilities but without the hassle of managing the metadata, backups, upserts, and data consistency.
Databricks Delta might not be the right platform for organizations
- With small datasets that can implement a traditional database solution.
- Where data consistency is less important than getting the data to the data lake quickly.
- That are using technology stacks other than Databricks that have a similar tool to Delta.
- That will require constant extracts of data from Delta tables into a separate upstream system.
- With GCP as their cloud provider.
Moving forward with a new data engineering solution is a big decision for an organization. Hopefully some of the points discussed in this post help you and your organization make an informed decision on whether Databricks Delta is right for your circumstances.
The introduction of Databricks Delta is exciting and fills in a gap in the data analytics space. As a Data Engineer who has built many data pipelines in the past for clients, I can see how useful Delta can be. As Databricks continues to build out Delta, it will be interesting to see how the product continues to evolve.
For more information on Databricks Delta, check out these resources: