This article is contributed. See the original author and article here.
In a historic day, Microsoft today announced it has transitioned Azure HDInsight to the Microsoft engineered distribution of Apache Hadoop and Spark, specifically built to drastically improve the performance, improved release cadence of powerful Open Source data analytics frameworks and optimized to natively run at cloud scale in Azure. This transition will further help customers by establishing a common Open Source Analytics distribution across various Azure data services such as Azure Synapse & SQL Server Big Data Clusters.
Starting this week, customers creating Azure HDInsight clusters such as Apache Spark, Hadoop, Kafka & HBase in Azure HDInsight 4.0 will be created using Microsoft distribution of Hadoop and Spark.
As part of today’s release, we are adding following new capabilities to HDInsight 4.0
SparkCruise: Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation. These redundancies increase the processing time and total cost for the users. To reuse computations, many big data processing systems support materialized views. However, it is challenging to manually select common computations in the workload given the size and evolving nature of the query workloads. In this release, we are introducing a new Spark capability called “SparkCruise” that will significantly improve the performance of Spark SQL.
SparkCruise is an automatic computation reuse system that selects the most useful common subexpressions to materialize based on the past query workload. SparkCruise materializes these subexpressions as part of query processing, so you can continue with their query processing just as before and computation reuse is automatically applied in the background — all without any modifications to the Spark code. SparkCruise has shown to improve the overall runtime of a benchmark derived from TPC-DS benchmark queries by 30%.
SparkCruise: Automatic Computation Reuse in Apache Spark
Hive View: Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable
data ETL, a mechanism to put structures on the data, and the capability to query and analyze large data sets that are stored in Hadoop. Hive View is designed to help you to author, optimize, and execute Hive queries. We are bringing Hive View natively to HDInsight 4.0 as part of this release.
Tez View: Tez is a framework for building high-performance batch and interactive data processing applications. When you run a job such as a Hive query Tez, you can use Tez View to track and debug the execution of that job. Tez View is now available in HDInsight 4.0
Frequently asked questions
What is Microsoft distribution of Hadoop & Spark (MDH)?
Microsoft engineered distribution of Apache Hadoop and Spark. Please read the motivation behind this step here
• Apache analytics projects built, delivered, and supported completely by Microsoft
• Apache projects enhanced with Microsoft’s years of experience with Cloud-Scale Big Data analytics
• Innovations by Microsoft offered back to the community
What can I do with HDInsight with MDH?
Easily run popular open-source frameworks—including Apache Hadoop, Spark, and Kafka—using Azure HDInsight, cost-effective, enterprise-grade service for open-source analytics. Effortlessly process massive amounts of data and get all the benefits of the broad open-source ecosystem with the global scale of Azure.
What versions of Apache frameworks available as part of MDH?
Component |
HDInsight 4.0 |
Apache Hadoop and YARN |
3.1.1 |
Apache Tez |
0.9.1 |
Apache Pig |
0.16.0 |
Apache Hive |
3.1.0 |
Apache Ranger |
1.1.0 |
Apache HBase |
2.1.6 |
Apache Sqoop |
1.4.7 |
Apache Oozie |
4.3.1 |
Apache Zookeeper |
3.4.6 |
Apache Phoenix |
5 |
Apache Spark |
2.4.4 |
Apache Livy |
0.5 |
Apache Kafka |
2.1.1 |
Apache Ambari |
2.7.0 |
Apache Zeppelin |
0.8.0 |
In which region Azure HDInsight with MDH is available?
HDInsight with MDH is available in all HDInsight supported regions
What version of HDInsight with MDH will map to?
HDInsight with MDH maps to HDInsight 4.0. We expect 100% compatibility with HDInsight 4.0
Do you support Azure Data Lake Store Gen 2? How about Azure Data Lake Store Gen 1?
Yes, we support storage services such as ADLS Gen 2, ADLS Gen1 and BLOB store
What happens to the existing running cluster created with the HDP distribution?
Existing clusters created with HDP distribution runs without any change.
How can I verify if my cluster is leveraging MDH?
You can verify the stack version (HDInsight 4.1) in Ambari (Ambari–>User–>Versions)
How do I get support?
Support mechanisms are not changing, customers continue to engage support channels such as Microsoft support
Is there a cost difference?
There is no cost or billing change with HDInsight with Microsoft supported distribution of Hadoop & Spark.
Is Microsoft trying to benefit from the open-source community without contributing back?
No. Azure customers are demanding the ability to build and operate analytics applications based on the most innovative open-source analytics projects on the state-of-the-art enterprise features in the Azure platform. Microsoft is committed to meeting the requirements of such customers in the best and fastest way possible. Microsoft is investing deeply in the most popular open-source projects and driving innovations into the same. Microsoft will work with the open-source community to contribute the relevant changes back to the Apache project itself.
Will customers who use the Microsoft distribution get locked into Azure or other Microsoft offerings?
The Microsoft distribution of Apache projects is optimized for the cloud and will be tested extensively to ensure that they work best on Azure. Where needed, the changes will be based on open or industry-standard APIs.
How will Microsoft decide which open-source projects to include in its distribution?
To start with, the Microsoft distribution contains the open-source projects supported in the latest version of Azure HDInsight. Additional projects will be included in the distribution based on feedback from customers, partners, and the open-source community.
Get started
- Release Notes: https://docs.microsoft.com/azure/hdinsight/hdinsight-release-notes
- SparkCruise documentation: https://docs.microsoft.com/en-us/azure/hdinsight/spark/spark-cruise
- Read this developer guide and follow the quick start guide to learn more about implementing open-source analytics pipelines on Azure HDInsight.
- Stay up-to-date on the latest Azure HDInsight news and features coming up in the near future by following us on Twitter #HDInsight and @AzureHDInsight.
- For questions and feedback, please reach out to AskHDInsight@microsoft.com.
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments