This article is contributed. See the original author and article here.
Azure Data Lake Storage Gen2 (ADLS Gen2) is a set of capabilities dedicated to big data analytics, built on Azure Blob storage, so it supports Azure blob Storage API while also has its own File System API.
Blob Storage API: https://docs.microsoft.com/en-us/rest/api/storageservices/operations-on-blobs
File System API: https://docs.microsoft.com/en-us/rest/api/storageservices/data-lake-storage-gen2
These interfaces allow you to create and manage file systems, as well as to create and manage directories and files in file system. Azure Data Lake Storage Gen2 APIs support Azure Active Directory (Azure AD), Shared Key, and shared access signature (SAS) authorization.
In this blog, we will introduce how to use Azure AD service principal to upload file to ADLS gen2 through file system API using Powershell script.
Part 1: Register an application with the Microsoft identity platform and apply the valid role assignment for access. https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app
1. Register a new application in Azure AD.
2. Select account type based on your business requirements.
3. Assign Storage Blob Data Owner role to the service principal, which grants the service principal full access to blob data rights. You may assign other blob data role according to your business requirements. For the details of built-in roles’ permissions please refer to the document https://docs.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#storage-blob-data-owner.
Part 2: Generate an access token of the service principal for the Rest API calls. https://docs.microsoft.com/en-us/rest/api/azure/#client-credentials-grant-non-interactive-clients
1. In the Azure Portal application Overview, we can obtain the Application ID (client id) and Directory ID(tenant id).
2. In the Certificate & Secret, create a secret with an expiration time.
3. To generate an access token for the storage, we need to name the resource endpoint for storage resource provider as storage.azure.com.
In the document https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-oauth2-client-creds-grant-flow#get-a-token, we can see how a token endpoint work in a common scenario.
Powershell function Example:
function Get-StorageAADAccessToken()
{
param($TENANT_ID, $client_id, $client_secret)
$URI="https://login.microsoftonline.com/$TENANT_ID/oauth2/v2.0/token" #We are using the oauth version 2
$CONTENT_TYPE="application/x-www-form-urlencoded"
$HEADERS = @{
"Content-Type"=$CONTENT_TYPE
}
$grant_type="client_credentials"
$resource="https://storage.azure.com/.default"
$BODY="grant_type=$grant_type&client_id=$client_id&client_secret=$client_secret&scope=$resource"
$ACCESS_TOKEN = (Invoke-RestMethod -method POST -Uri $URI -Headers $HEADERS -Body $BODY).access_token
return $ACCESS_TOKEN
}
Part 3: Upload the file using File System interface.
To upload a file using file system interface will use the three APIs, Create File, Append Data and Flush Data. All APIs will use the *.dfs.core.windows.net endpoint instead of *.blob.core.windows.net endpoint.
- Create: https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create
- Update (Append & Flush): https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/update
Here is a logic flow to upload a large file.
- The first position is 0
- The next position is the last position plus the last content length.
- We can send multiple append data requests at the same time, but the position information needs to be calculated.
The Powershell methods example:
1. Create File is a Create API in the file system. By default, the destination is overwritten if the file already exists and has a broken lease.
function Create-AzureADLS2File()
{
param($STORAGE_ACCOUNT_NAME, $ROOT, $PREFIX) ## storage account is the name of the ADLS gen2 account, root is the file system container, prefix is the path and file name of the storage account
$URI="https://$STORAGE_ACCOUNT_NAME.dfs.core.windows.net/"+$ROOT+"/"+$PREFIX+"?resource=file"
$DATE = [System.DateTime]::UtcNow.ToString("R")
$ACCESS_TOKEN=Get-StorageAADAccessToken -TENANT_ID $TENANT_ID -client_id $CLIENT_ID -client_secret $CLIENT_SECRET
$HEADERS = @{
"x-ms-date"=$DATE
"x-ms-version"="2019-12-12"
"authorization"="Bearer $ACCESS_TOKEN"
}
Invoke-RestMethod -method PUT -Uri $URI -Headers $HEADERS
}
After create a file by the Powershell custom method below, you will get a zero size file.
Create-AzureADLS2File -STORAGE_ACCOUNT_NAME frankpanadls2 -ROOT test -PREFIX file1
2. Append Data is a part of Update API in the file system. “append” is to upload data by appending to a file.
function Upload-AzureADLS2File()
{
param($STORAGE_ACCOUNT_NAME, $ROOT, $PREFIX, $POS, $BODY)
$URI="https://$STORAGE_ACCOUNT_NAME.dfs.core.windows.net/"+$ROOT+"/"+$PREFIX+"?action=append&position=$POS"
$DATE = [System.DateTime]::UtcNow.ToString("R")
$ACCESS_TOKEN= Get-StorageAADAccessToken -TENANT_ID $TENANT_ID -client_id $CLIENT_ID -client_secret $CLIENT_SECRET
$HEADERS = @{
"x-ms-date"=$DATE
"x-ms-version"="2019-12-12"
"authorization"="Bearer $ACCESS_TOKEN"
"content-length"=0
}
Invoke-RestMethod -method PATCH -Uri $URI -Headers $HEADERS -Body $BODY
}
If we have content below, we can get a list of position and content length.
data row 1
data row 22
data row 333
Upload-AzureADLS2File -STORAGE_ACCOUNT_NAME frankpanadls2 -ROOT test -PREFIX file1 -POS 0 -BODY "data row 1`n"
Upload-AzureADLS2File -STORAGE_ACCOUNT_NAME frankpanadls2 -ROOT test -PREFIX file1 -POS 11 -BODY "data row 22`n"
Upload-AzureADLS2File -STORAGE_ACCOUNT_NAME frankpanadls2 -ROOT test -PREFIX file1 -POS 23 -BODY "data row 333`n"
There will be no data in the file until you flush all content in the file.
3. Flush Data is a part of Update API in the file system. “flush” is to flush previously uploaded data to a file. This request is similar to PutBlockList in the blob storage api, but will need to specify position.
function Flush-AzureADLS2File()
{
param($STORAGE_ACCOUNT_NAME, $ROOT, $PREFIX, $POS)
$URI="https://$STORAGE_ACCOUNT_NAME.dfs.core.windows.net/"+$ROOT+"/"+$PREFIX+"?action=flush&position=$POS"
$DATE = [System.DateTime]::UtcNow.ToString("R")
$ACCESS_TOKEN= Get-StorageAADAccessToken -TENANT_ID $TENANT_ID -client_id $CLIENT_ID -client_secret $CLIENT_SECRET
$HEADERS = @{
"x-ms-date"=$DATE
"x-ms-version"="2019-12-12"
"authorization"="Bearer $ACCESS_TOKEN"
"content-length"=$POS
}
Invoke-RestMethod -method PATCH -Uri $URI -Headers $HEADERS
}
Flush-AzureADLS2File -AzureADLS2File -STORAGE_ACCOUNT_NAME frankpanadls2 -ROOT test -PREFIX file1 -POS 36
We will see the flushed file like below with all content.
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments