Private preview of Data Virtualization in Azure SQL Database

This article is contributed. See the original author and article here.

We are very happy to announce the private preview of Data Virtualization in Azure SQL Database. Data Virtualization in Azure SQL Database enables working with CSV, Parquet, and Delta files stored on Azure Storage Account v2 (Azure Blob Storage) and Azure Data Lake Storage Gen2. Azure SQL Database will now support: CREATE EXTERNAL TABLE (CET), CREATE EXTERNAL TABLE AS SELECT (CETAS) as well as enhanced OPENROWSET capabilities to work with the new file formats.

The list of capabilities available in private preview are:

CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL) – SQL Server | Microsoft Learn with Shared Access Signature (SAS Token) support;

CREATE EXTERNAL DATA SOURCE (Transact-SQL) – SQL Server | Microsoft Learn with support for Azure Storage Account v2 (ABS), and Azure Data Lake Gen2 (ADLS);

CREATE EXTERNAL FILE FORMAT (Transact-SQL) – SQL Server | Microsoft Learn with support for CSV, and Parquet;

CREATE EXTERNAL TABLE (Transact-SQL) – SQL Server | Microsoft Learn with support for CSV, Parquet, and Delta;

CREATE EXTERNAL TABLE AS SELECT (CETAS) (Transact-SQL) – SQL Server | Microsoft Learn with support for CSV and Parquet;

OPENROWSET (Transact-SQL) – SQL Server | Microsoft Learn with support for CSV, Parquet, and Delta.

Metadata functions:

Just like in SQL Server 2022 (Data Virtualization with PolyBase for SQL Server 2022 – Microsoft SQL Server Blog) and Azure SQL Managed Instance (Data virtualization now generally available in Azure SQL Managed Instance – Microsoft Community Hub), Data Virtualization in Azure SQL Database also supports updated metadata functions, wildcard search mechanism, and procedures that enables the users to query across different folders and leverage partition pruning, commands like:

Filename();

Filepath();

sp_describe_first_result_set.

Benefits:

Major benefits of Data Virtualization in Azure SQL Database are:

No data movement: Access real-time data where it is.

T-SQL language: Ability to leverage all the benefits of the T-SQL language, its commands, enhancements, and familiarity.

One source for all your data: Users and applications can use Azure SQL Database as a data hub, accessing all the required data in a single environment.

Security: Leverage SQL security capabilities to simplify permissions, credential management, and control

Export: Easily export data as CSV or Parquet to any Azure Storage location, either to empower other applications or reduce cost.

Getting started:

For simplicity, we are going to use publicly available NYC Taxi dataset (NYC Taxi and Limousine yellow dataset – Azure Open Datasets | Microsoft Learn) that allows anonymous access.

-- Create data source for NYC public dataset:
CREATE EXTERNAL DATA SOURCE NYCTaxiExternalDataSource
WITH (LOCATION = 'abs://nyctlc@azureopendatastorage.blob.core.windows.net');

-- Query all files with .parquet extension in folders matching name pattern:
SELECT TOP 1000 *
FROM OPENROWSET(
 BULK 'yellow/puYear=*/puMonth=*/*.parquet',
 DATA_SOURCE = 'NYCTaxiExternalDataSource',
 FORMAT = 'parquet'
) AS filerows;

-- Schema discovery:
EXEC sp_describe_first_result_set N'
 SELECT
 vendorID, tpepPickupDateTime, passengerCount
 FROM
 OPENROWSET(
  BULK ''yellow/*/*/*.parquet'',
  DATA_SOURCE = ''NYCTaxiExternalDataSource'',
  FORMAT=''parquet''
 ) AS nyc';

-- Query top 100 files and project file path and file name information for each row:
SELECT TOP 100 filerows.filepath(1) as [Year_Folder], 
filerows.filepath(2) as [Month_Folder],
filerows.filename() as [File_name], 
filerows.filepath() as [Full_Path]
FROM OPENROWSET(
 BULK 'yellow/puYear=*/puMonth=*/*.parquet',
 DATA_SOURCE = 'NYCTaxiExternalDataSource',
 FORMAT = 'parquet') AS filerows;

-- Create external file format for Parquet:
CREATE EXTERNAL FILE FORMAT DemoFileFormat
WITH ( FORMAT_TYPE=PARQUET );

-- Create external table:
CREATE EXTERNAL TABLE tbl_TaxiRides(
 vendorID VARCHAR(100) COLLATE Latin1_General_BIN2,
 tpepPickupDateTime DATETIME2,
 tpepDropoffDateTime DATETIME2,
 passengerCount INT,
 tripDistance FLOAT,
 puLocationId VARCHAR(8000),
 doLocationId VARCHAR(8000),
 startLon FLOAT,
 startLat FLOAT,
 endLon FLOAT,
 endLat FLOAT,
 rateCodeId SMALLINT,
 storeAndFwdFlag VARCHAR(8000),
 paymentType VARCHAR(8000),
 fareAmount FLOAT,
 extra FLOAT,
 mtaTax FLOAT,
 improvementSurcharge VARCHAR(8000),
 tipAmount FLOAT,
 tollsAmount FLOAT,
 totalAmount FLOAT
)
WITH (
 LOCATION = 'yellow/puYear=*/puMonth=*/*.parquet',
 DATA_SOURCE = NYCTaxiExternalDataSource,
 FILE_FORMAT = DemoFileFormat
);

-- Query the external table:
SELECT TOP 1000 * FROM tbl_TaxiRides;

Private Preview Sign-up form:

Data Virtualization in Azure SQL Database is in active development, Private Preview users will help shape the future of the feature, with regular interactions with Data Virtualization product team. If you want to be part of the private preview a sign-up form is required and can be found here.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

Private preview of Data Virtualization in Azure SQL Database

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you