This article is contributed. See the original author and article here.
Azure Synapse now enables you to query data stored in Apache Delta Lake format. This is one of the top feedback requests and we are happy to announce that this feature is now available in public preview.
In this article you will learn how to run the T-SQL queries on a Delta Lake storage from your Synapse workspace.
What is Delta Lake?
Delta Lake is an open-source data format that enables you to update your big data sets with guaranteed ACID transaction behavior. Delta Lake is a layer placed on top of your existing Azure Data Lake data that can be fully managed using Apache Spark APIs available in both Azure Synapse and Azure Databricks.
Delta Lake is one of the most popular updateable big data formats in big data solutions, and frequently used by many data engineers who need to prepare, clean, or update data stored in data lake, or apply machine learning experiments.
Why would you query Delta Lake with Azure Synapse?
Azure Synapse provides a serverless endpoint with every Synapse workspace. This provides a few key benefits when it comes to querying Delta Lake data:
- Easy data sharing between Azure Synapse and Azure Databricks without the need to copy and transform data.
- Connecting a large ecosystem of reporting and analysis tools with your data stored in Delta Lake format.
- Favorable pay-per-use consumption model where you don’t need to pre-provision resources. You are paying only for the queries that you are executing.
You can use Azure Synapse and Azure Databricks to prepare and modify your Delta Lake data sets placed in the Azure Data Lake storage. Once your data engineers have prepared the data, your data analysts can create reports using the tools such as Power BI.
Using the serverless query endpoint in Azure Synapse, you can create a relational layer on top of your Delta Lake files that directly references the location where Azure Synapse and Azure Databricks are used to modify data. This way, you can get the real-time analytics on top of the Delta Lake data set without any need to wait for a pipeline to copy and prepare data.
Data sharing without copy, load, or transformation of Delta Lake files is the main benefit of serverless SQL pools. The serverless endpoint in Azure Synapse represents a bridge between a reporting/analytics layer where you use Power BI or Azure Analysis Services, and your data stored in Delta Lake format. This enables a variety of tools that work on T-SQL endpoints to access Delta Lake data.
In this solution, every role in your organization that works with big data can use the preferred tools to complete their tasks:
- Data engineers can keep using the standard tools for data preparation and transformation (for example the notebooks in Azure Synapse or Jupyter environment).
- Data analysts can keep using their favorite reporting tools such as Power BI to analyze data and present the reports to the end users.
Azure Synapse enables your teams to implement end-to-end solutions on top of Delta Lake files with no friction or a need to change their standard tools.
How to query Delta Lake in Azure Synapse?
The serverless endpoint in Azure Synapse (serverless SQL pool) enables you to easily query data stored in Delta Lake format. You just need to provide a URI of the Delta Lake folder to the OPENROWSET function and specify that the format is DELTA. If you have plain parquet files, you can easily convert them to Delta Lake format using Apache Spark.
This query enables you to explore data in your Delta Lake data sets. The OPENROWSET function will automatically determine the columns in Delta Lake format and their data types by inspecting Delta Lake folder.
When you complete exploration, you can create views or external tables on top of your Delta Lake folder. The partitioned views are preferred approach if you have partitioned Delta Lake structure because they can more optimally execute your queries and eliminate the partitions that do not contain the data that should be returned in the queries.
Tools and applications such as Power BI or Azure Analysis Service can read data from these views or external tables, without need to know that the underlying data is stored in Delta Lake format.
Learn more about Delta Lake query capabilities in Synapse documentation.
Feedback
This feature is currently in public preview, and the Azure Synapse team is happy to hear your feedback.
Some features such as temporal/time-travel queries, automatic synchronization of Delta Lake tables created in Spark pools, and updates of Delta Lake data are still not available in the public preview. We would be happy to get your feedback related to the new features in this scenario, so you can post your ideas in Azure Feeback site.
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments