January 2021 - Page 54 of 98 - Dr. Ware Technology Services

Data Validation at Scale with Azure Synapse

by Contributed | Jan 15, 2021 | Technology

This article is contributed. See the original author and article here.

Overview

In the world of Artificial Intelligence and Machine Learning, data quality is paramount in ensuring our models and algorithms perform correctly. By leveraging the power of Spark on Azure Synapse, we can perform detailed data validation at a tremendous scale for your data science workloads.

What is Azure Synapse?

Azure Synapse is a Data Analytics Service that provides tools for end-to-end processing of data within Azure. The Azure Synapse Studio provides an interface for developing and deploying data extraction, transformation, and loading workflows within your environment. All of these workflows are built on scalable cloud infrastructure and can handle tremendous amounts of data if needed. For data validation within Azure Synapse, we will be using Apache Spark as the processing engine. Apache Spark is an industry-standard tool that has been integrated into Azure Synapse in the form of a SparkPool, this is an on-demand Spark engine that can be used to perform complex processes of your data.

Pre-requisites

Without getting into too much detail, the main requirements you will need for running this code is an Azure Synapse Workspace and a data set on which you would like to perform some validations loaded to Azure Storage. The technique shown here provides a starting point for performing these types of data validations within your own use case.

For demonstration purposes, I have loaded a data set of hard drive sensor data (link below) to an Azure Storage account and linked the storage account in Synapse (https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started)

I have also set up a SparkPool (link) for developing in PySpark Notebooks within Azure Synapse, I can read the data by connecting to the data source and using some simple PySpark commands:

from  pyspark.sql.functions import input_file_name
import os

df = spark.read.load('abfss://[container]@[storage_account].dfs.core.windows.net/[path]/*.csv', format='csv', header=True).withColumn('file', input_file_name())

display(df)

One key part to take note of is the command .withColumn(‘file’, input_file_name()), this adds a column to your dataset named file with the name of the source data file. Very helpful when trying to find malformed rows.

Let’s start validating

After loading the data you can begin by calculating some high-level statistics for simple validations.

df_desc = df.describe()
display(df_desc)

the describe() function calculates simple statistics (mean, standard deviation, min, max) that can be compared across data sets to make sure values are in the expected range. This is a built-in data function that can be used on any data. You can save these data sets back to your data lake for downstream processes using:

df_desc.write.parquet("[path][file].parquet")

But we can get much more complex by using spark queries to further validate the data.

We can count how many rows were contained in each file using this code:

validation_count_by_date = df.groupBy('file','date').count()

This count can be useful in ensuring each file contains a complete dataset. If file sizes are outside of the expected range, it may mean that the file is either incomplete or contains excess data.

Furthermore, we can even look at the specific values within the data at a row-level granularity. The code below uses the named columns “file” and “date” to the group missing values across all files in your data set. Missing values are defined as Null or NaN values in the dataset.

from pyspark.sql.functions import isnull, isnan, when, count, col

cols = [count(when(isnan(c) | col(c).isNull(), c)).alias(c) if c not in ['file', 'date'] else count(col(c)).alias('count_{}'.format(c)) for c in df.columns]
missing_by_file = df.groupBy('file', 'date').agg(*cols)

display(missing_by_file)

Don’t be intimidated by the above code. I am essentially using some python loops to generate a list of columns cols, and then using the * (star) operator to pass those columns to the select function. I encourage you to try running these commands on your data to become familiar with the power of python.

We can go a step further in calculating the ratio of missing values for each column within a file:

missing_by_file_pct = missing_by_file.select(['date', 'file', *[(col(c)/col('count_file')).alias('{}_pct'.format(c)) for c in df.columns]])
display(missing_by_file_pct)

Now instead of looking at total values, we can identify incomplete data by looking at the percentage of filled-in values in a file. Again, this technique accompanied by business rules can be very powerful in validating raw data sources and ensuring data quality in a scalable method.

Up until this point, I have only defined the column names of “file” and “date”, all other columns have been derived from the source file header/schema so there is no need to maintain column lists as part of the validation rules. If you do require checking for specific values within a file, you can easily extend these examples such as this:

from pyspark.sql.functions import countDistinct
validation_modelcount_by_date = df.groupBy('file','date').agg(countDistinct('model'))
display(validation_modelcount_by_date)

The query above identifies all the distinct values for the column “model” which are present in each file. As you can see the count of distinct models varies slightly from file to file, we can establish an acceptable range fairly easily by looking at this data.

How can I use this in my processes?

Going back to the high-level architecture diagram shown at the beginning of the blog. This technique can be applied during the ingestion of your data to certify whether the raw files are qualified to be in the data lake. By using Synapse Spark we can perform the row-level checks fast and efficiently, and output the results back into the data lake. Downstream processes such as Machine Learning Models and/or Business Applications can then read the validation data to determine whether or not to use the raw data without having to re-validate it. With Azure Synapse and Spark, you can perform powerful validations on very large data sources with minimal coding effort.

Data reference: Backblaze Hard Drive Stats

Github Link: AzureSynapseDataValidation/Validation-Sample.ipynb

The January 15th Weekly Roundup is Posted!

by Contributed | Jan 15, 2021 | Technology

This article is contributed. See the original author and article here.

Surface Pro 7+ brings next level performance and versatility to the enterprise

It’s now easier for you to store, sync, and share large files in Microsoft 365. This latest improvement will help increase the upload file size limit for Microsoft 365 from 100 GB to 250 GB—which includes uploads of files into SharePoint, Teams, and OneDrive.

News for Frontline workers
Push tasks from corporate to frontline workers with task publishing in Teams
Empower your frontline workers with these Azure AD capabilities that just went GA
Approvals in Microsoft Teams, Now Generally Available

Riny van Eekelen is the Member of the Week. Riny is very active in the Excel discussion space, starting topics and replying to members with help.

View the Weekly Roundup for Jan 11-15th in Sway and attached PDF document.

https://sway.office.com/s/E2obAPPpUN0nAblc/embed

NSA Releases Guidance on Encrypted DNS in Enterprise Environments

by Scott Muniz | Jan 15, 2021 | Security, Technology

This article is contributed. See the original author and article here.

The National Security Agency (NSA) has released an information sheet with guidance on adopting encrypted Domain Name System (DNS) over Hypertext Transfer Protocol over Transport Layer Security (HTTPS), referred to as DNS over HTTPS (DoH). When configured appropriately, strong enterprise DNS controls can help prevent many initial access, command and control, and exfiltration techniques used by threat actors.

CISA encourages enterprise owners and administrators to review the NSA Info Sheet: Adopting Encrypted DNS in Enterprise Environments and consider implementing the recommendations to enhance DNS security.

Azure Governance for ISVs

by Contributed | Jan 15, 2021 | Technology

This article is contributed. See the original author and article here.

permalink: https://aka.ms/FTAISVGovernance

Microsoft Azure includes many governance features that can help an independent software vendor (ISV) with resource and subscription management. In this set of three video presentations, I present your options and features.

Although these are not talking about any ISV-only features, they are ISV-focused – so while this content would certainly be helpful for you if you’re not an ISV, you should be aware that is the lens through which I am presenting the content.

These videos include a English closed-caption-file reference, and are best viewed full-screen.

Please let us know in the comments what you think, and if you have questions! Also, this is the first of a set of ISV-oriented Azure videos. If you have an ISV-specific topic that you feel would be good for a 15-20 minute video or a small set of them, let us know… we already have videos on logging and monitoring and on multi-tenant databases under development…

Session 01: Organizing Resources

Sorry, your browser doesn’t appear to support embedded videos.

Session 02: Policies, Initiatives, and Blueprints

Sorry, your browser doesn’t appear to support embedded videos.

Session 03: ISV Summary

Sorry, your browser doesn’t appear to support embedded videos.

Configuration Manager – Collect Client Logs

by Contributed | Jan 15, 2021 | Technology

This article is contributed. See the original author and article here.

Hi Folks! I am Naveen kanneganti and welcome to my blog post.

You can now send client notification action from the Configuration Manager to trigger a client device to upload its client logs to the site server. In order access collected log files you have several options available. the following are some of the options in resource explorer to access collected log files

Open Support Center: Launches Support Center. from support centre you can open the logs in the viewer.

View file: Opens the folder where the zip file is located with File Explorer.

Save: Opens a Save File popup box.

Prerequisites

Administrative user needs Notify resource permission on the Collection

The following built-in roles have this permission by default:
- Full Administrator
- Infrastructure Administrator

Collect client logs

Launch ConfigMgr Console, In the Assets and Compliance workspace, go to either the Devices or Device Collections node.

Right-click on a device, or a device collection.

Choose Client Diagnostics and then Click Collect Client Logs.

Click OK on the prompt

On the client computer, you can check Diagnostics.log under C:WindowsCCMlogs .

On the Management point, you can monitor from MP_SinvCollFile.log

On the site server you can monitor Sinvproc.log for the SIC file, processed Successfully as shown below

Capture Resid and time stamp from Sinvproc.log. You can check whether Zip file is collected to the location:[site server]SMS_CS1inboxessinv.boxFileCol[Resid]

(ex: svrcm01.css.intSMS_CS1inboxessinv.boxFileCol16777228 )

View collected Client Logs

Open Support Center

Follow the below steps to view collected logs, using this option

Launch ConfigMgr Console, In the Assets and Compliance workspace go to devices and right click on the Device

Choose Start and then click on Resource explorer

In the resource explorer, choose Diagnostic files and then right click on latest files. Click on open Support Center

Support Center open logs automatically as shown below

Click on any log you interested and click Open from top ribbon

Log viewer automatically open and display the log information

2. View Files

Follow the below steps to view collected logs, using this option

In the resource explorer, choose Diagnostic files and then right click on latest files. Click on View File

The zip file will be open with File explorer

You can browse to ccmLogDataCollectorlogs folder and view the Collected Logs

3. Save

Follow the below steps to view collected logs using this option

In the resource explorer, choose Diagnostic files and then right click on latest files. Click on Save

It automatically opens Save popup Box to Save the files to your desired location

After logs are saved to your desired destination folder, you can browse to .CcmLogDataCollectorlogs and view collected log files.

Hope this step-by-step process helps to start using Collect Client logs!

If you are looking to collect extensive list of logs, registry settings, and diagnostic data from a device or collection of devices please check URL:https://github.com/russrimm/CollectAllLogs

Best Regards

Naveen Kanneganti
Customer Engineer

Microsoft Services

« Older Entries

Next Entries »

Data Validation at Scale with Azure Synapse

The January 15th Weekly Roundup is Posted!

NSA Releases Guidance on Encrypted DNS in Enterprise Environments

Azure Governance for ISVs

Configuration Manager – Collect Client Logs

Collect client logs

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you