Five Benefits of Azure Data Factory


With so many data integration solutions on the market today, why should your organization consider Azure Data Factory?

There are many great reasons, including:

    • Fully managed Azure service

    • Code-free and low-code transformations

    • Combination of GUI and scripting -based interfaces

    • Easy migration from SSIS

    • Consumption-based pricing

Let us explore each of these in more detail!

Fully managed Azure service

One significant challenge with traditional ETL/ELT tools is complexity of deployment. Organizations require teams of experts to carefully install, configure and maintain data integration environments. These experts could be in-house or external 3rd parties; either way, it's an expensive proposition.

Azure Data Factory is fully managed by Microsoft as part of its Azure platform. Microsoft takes care of the Azure Integration Runtime (IR) that handles data movement, it managed the Spark clusters that handle Mapping Data Flow transformations, it regularly updates the ADF developer tools and APIs and it monitors the platform 24/7 across more than 25 regions to ensure peak performance.

All these activities take place without any involvement from us as customers. This allows us to focus on delivering value to our customers through the data integration workflows themselves, instead of getting side-tracked into time-consuming platform administration and maintenance.

screenshot create data factory

Code-free and low-code transformations

The "T" in ETL and ELT stands for Transformation. Transformation is commonly the most challenging aspect of a data integration process. Many enterprises have been developing custom scripts to deal with this step. The scripts could be written in SQL, Python, C# and a multitude of other industry standard and proprietary general-purpose programming languages, or their data-centric variants.

Regardless of the language chosen, the challenge with all code-based approaches is maintenance complexity and the ease of inadvertently introducing bugs.

The former is due to developers having to be intimately familiar with the language to write efficient transformations. New developers must undergo extensive training. The onboarding process is further complicated by each developer having his/her unique coding style, a style others have to adjust to, or a style the organization can try to standardize via coding conventions. The high risk of introducing bugs is there because real-world data transformation scripts often involve modules with tens of thousands or more lines of code. Every line represents a possible source of hidden, hard-to-find and hard-to-fix bugs.

ADF has two internal transformation technologies that can be used to create transformations without a single line of code. Mapping Data Flows, based on the industry-standard Apache Spark platform, is one of these technologies. The second, currently in public preview and based on the proven Power Query / M engine, is Wrangling Data Flows,.

With either technology the ADF developer can rapidly create 100% code-free transformation or transformations that are primarily without code (aka "low-code"). Even developers who are expert programmers typically find it faster to use Mapping Data Flows or Wrangling Data Flows than to write code. Code-free and low-code transformations have been meticulously optimized by Microsoft and thus perform the same or even better than manually crafted code that accomplished the same result. Thus, the benefit here is much higher developer productivity, with corresponding bottom-line benefits to the organization.

screenshot adding data flow

Combination of GUI and scripting -based interfaces

The other challenge common with enterprise ETL/ELT platforms is that they locked their users into specific and often very proprietary tools. These tools were typically either UI-based or scripting-based, with no effective ability for the user to switch between different tools during the DevOps lifecycle. Also, the UI-based tools were typically implemented as platform-native applications (ex. Windows, Linux, Mac OS, etc.) and brought in complex installation and upgrade requirements.

With ADF, the Microsoft team took a different approach. It created a standard HTML5 development environment at https://adf.azure.com. This environment needs only a current web browser to access. Yet it maintains the look-and-feel of a traditional platform-native GUI application. For example, you can easily drag-and-drop activities and arrange them into a data integration pipeline. The interface is so impressively implemented that developers often forget they are interacting with a web browser.

A variety of other interface types are available for scripting and automation purposes. These include: PowerShell, REST API - direct and via SDKs, and Azure Resource Manager (ARM) Templates. PowerShell has been around for many years and is used on Windows, Mac OS and Linux for all types of automation. Its Azure version is even available via Azure Cloud Shell. REST API opens up many additional integration possibilities. .NET and Python can call higher-level SDKs that these APIs expose. Other platforms can consume the APIs directly. And finally, ARM Templates are perfect for declarative integration, such as for replicating consistent configurations and thus avoiding configuration drift.

GUI logos

Easy migration from SSIS

SQL Server Integration Services (SSIS) has been introduced into the SQL Server stack back in 2005. The technology has proven hugely popular over the years. Organizations have invested millions of dollars into creating SSIS data integration packages for their specific requirements.

Microsoft has taken this into account and provided a way to migrate SSIS packages into ADF, while still taking full advantage of Azure and ADF's Platform as a Service (PaaS) paradigm. In other words, with ADF one no longer needs to create and manage a SQL Server VM just to run SSIS.

How does this work? The magic ingredient is the Azure-SSIS Integration Runtime (IR).

Just like the main Azure IR, the Azure-SSIS runtime is fully operated by the Azure team. However, instead of using the new ADF technology for the data integration pipelines and activities, it runs existing SSIS packages. These packages can use both the newer SSISDB catalog deployment option (introduced in SQL Server 2012) and the original SSIS package-level deployment option. The packages can typically remain 100% the same as before, or at least stay remarkably close to this as-is state. Additionally, your existing tools such as SSDT, SSMS and dtutil.

Taking advantage of Azure-SSIS IR gives you the best of both worlds: a fully managed runtime environment on Azure, plus the ability to retain your existing SSIS packages, tools and processes.

screenshot integration runtime setup

Consumption-based pricing

The ETL/ELT tools have historically had high licensing costs. These costs usually had to be paid upfront. This might have been fine on large, fully predictable, multi-year efforts. The reality today is that change is often unpredictable, and organizations must adapt rapidly, in an agile manner, run short POCs, "fail-fast", etc.

These business and IT goals are inconsistent with large up-front license purchases. The other cost-related challenge with traditional licensing approaches is that organizations would almost always be either over-provisioned or under-provisioned. With over-provisioning they would be paying too much for licenses, just in case - ex. to handle workload peaks. With under-provisioned they would be saving money, but experiencing slow performance during workload peaks.

The utility-type consumption-based pricing in Azure is much better suited to this new business and IT operating environment. With Azure Data Factory there are zero upfront costs, while performance is always optimal.

Ongoing costs with ADF are segmented into two main areas: development and pipeline execution / runtime.

The development costs primarily consist of data factory storage. If you use Mapping Data Flows you would also be paying for debugging Spark clusters. Both ingredients are typically fairly minor, although Spark clusters costs do need a little bit of care to minimize (ADF provides controls such as TTL to facilitate this). Importantly, there is zero cost to using any ADF development tools, unlike with say Visual Studio, which has paid Professional and Enterprise editions.

Pipeline execution / runtime costs are tied to complexity and volume of data integration workflows being run. Pricing is elastic and automatically adjusts to workloads. There is no need to pre-pay, predict, estimate or otherwise tune licensing.

The ADF team created a metric that simplifies pricing calculations called Data Integration Unit (DUI). Execution costs are calculated based on the number of DUIs consumed by the pipeline(s) at runtime. Azure-SSIS runtime uses different pricing, though it also ultimately ties to complexity/unit-of-time concepts.

The result is transparent, easy to understand pricing that minimizes up-front license purchase risk and dynamically adapts to your workload requirements.

screenshot azure data factory

Written by Aaron Kraus