Azure Functions vs Azure Data Factory
Some popular Azure services include:
I won’t be dwelling extensively here on all the different services offered by Azure, but anyone curious enough can make a simple web search to gather such information.
The list of all available Microsoft Azure services can be found at - Directory of Azure Cloud Services
Being a Data Scientist, my day job frequently requires extraction of loads of data from multiple sources with long neverending python codes while implementing extremely tiring and time consuming ETL processes.
As a part of this reading, the two services we will be knowing about and comparing extensively here are:
● Azure Data Factory
● Azure Functions
Both Azure Functions and Data Factory facilitates the user to run serverless, scalable data operations from multiple sources and perform various customized data extraction tasks.
Most of the time, I get asked by my peers and potential clients on what service to use for implementing a no frills data ETL process which is both efficient and scalable.
The answer to the above question is not so simple one.
Each of the above mentioned services have their own knacks which the other one lacks. Now let us dive into the descriptions and major utilities of both the Services.
To get started with the above mentioned services, go to the links at - Azure Functions Portal, Data Factory and follow the instructions.
Many tutorials and reading materials are available on the web on how to Create and Configure Azure Functions or Data Factory Pipelines.
Here I won’t be discussing the setup and configuration of Azure Functions or Azure Data Factory but rather the utilities of both
Azure Functions are event driven, compute-on-demand Platform as a Service, or PaaS experience that populates the Azure platform with capabilities to implement code or tasks triggered by events occurring in Azure or third party service as well as on-premises systems. Azure Functions facilitates the developers by connecting to various data sources or messaging solutions to retrieve status of tasks, process and react to such events (or) triggers.
Functions work on a Server-less framework and hence the user gets to work with the platform provided pre-configured operating system images, which lessens the burden of any patching or maintenance. Being a Serverless-service Functions can be auto-scaled based on the workload. Functions sometimes are the fastest way to turn an idea into a working application
Anyone familiar with AWS can point it out that Azure Functions sound and work a lot like AWS Lambda and Cloudwatch.
Input: The Input of the function can be anything like Azure Blob Storage, Cosmos DB, Microsoft Graph Events, Microsoft Onedrive Files, Mobile Apps etc
Output: The Output of the Functions can be any data storage service or a web application depending on the output or the result of the executed function like Azure SQL Database or Power BI etc.
Trigger: Trigger is an event or action that wakes up the function based on a defined set of rules like pre-defined scheduling using CRON expressions, or changes to the Storage containers, messages from Web Apps, and HTTP triggers.
● They are scalable: Being a serverless-configuration the function is autoscaled depending on the workloads, which helps in lowering usage costs by Paying for only what the user consumes.
● Wide range of triggers and connectors for 3rd Party Integrations: Can be integrated to work with most 3rd Party Applications and APIs.
● Open-Source: Being an open-source tech stack the Functions can be hosted anywhere on Azure or in your own Datacenter or cloud.
● Other Productivity features also includes: Deployment Slots, Easy-Auth etc
1. Schedule Sending of Serverless Emails using Python or any other Language using CRON expressions.
2. Running SQL Queries to perform table operations in Databases.
These are only a few examples off the bat but the possibilities are endless with what we can do using Azure Functions
There are basically 3 plans available: Consumption Plan, Premium Plan & Azure App Service Plan. With the Consumption Plan being the cheapest where you only pay for the duration the Function is run. Detailed Pricing Overview can be found at : Pricing - Functions
● Users are required to be proficient in coding to create, code and configure Functions to suit our needs.
Azure Functions have a limit on the execution times of the functions depending on the Subscription of the user.
Azure Data Factory, the data migration service offered by Microsoft Azure helps users build scalable and automated ETL or ELT Pipelines.
To begin with, what is an ETL Pipeline?
An ETL Pipeline refers to the series of events or processes implemented to Extract data from various data storage systems and Transform the extracted data to desired formats and Load the resultant transformed datasets into various output destinations like databases or data warehouses for analytics reporting etc
While there are many ETL solutions that can run on any infrastructure, this is very much a native Azure service and easily ties into the other services Microsoft offers.
Azure Data Factory is more of an Extract-Load (EL) and Transform-Load(TL) process unlike traditional ETL processes.
Native ETL methods require the user to write these codes from scratch using scripting languages like Python, Perl or Bash to perform the ETL Processes.These processes may sometimes become tedious mainly when dealing with large databases consisting of multiple schemas and tables.
Azure Data Factory or ADF facilitates the users to set up such complicated and time-consuming ETL processes by eliminating the hurdle of writing long lines of code.
The following list makes up the major structural blocks of ADF that work together to define and execute an ETL Workflow.
Connectors or Linked Services: Linked services provide the user with configuration settings to connect, extract or read/write from numerous data sources. Depending on the job, each data flow can have multiple linked services or connectors.
Datasets: Datasets contain the data source configuration settings on a more granular level which can include a database schema, a table name or file name, structure, etc. Each of the dataset refers to a certain linked service and that linked service in turn determines the list of all possible properties of the input and output datasets.
Activities: Activities are the actions employed by the user to facilitate data movement, transformations or control flow actions.
Pipelines: Pipelines consist of multiple activities bundled together to perform a desired task. A data factory can have multiple functioning pipelines. Using pipelines makes it much easier to schedule and monitor multiple logically related activities.
Different Types of Triggers Include:
● Schedule Trigger: A trigger that invokes a pipeline on a time based schedule.
● Tumbling Window Trigger: A trigger to facilitating the execution of operations periodically.
● Event-based Trigger: An event-based trigger executes the pipelines in response to an event, such as the arrival of a file, or the deletion of a file, in Azure Blob Storage.
The Integration Runtime (IR) infrastructure used by ADF facilitates data movement, compute capabilities across different network environments. The runtime types available are:
● Azure IR: Azure IR provides a fully managed, serverless compute environment in Azure and plays a major role in the data movement activities in cloud.
● Self-hosted IR: This service manages copy activities between a public cloud network data stores and data stores in private networks.
● Azure-SSIS IR: SSIS IR is required to natively execute SSIS packages. The below attached image shows the relationships among different components of ADF:
● Server-less Infrastructure: ADFs are able to run completely within Azure as a native serverless solution. This eliminates the mundane tasks of maintenance and update of softwares and packages. The definitions and schedules are simply set up and then the execution is handled.
● Connectors: ADF Supports integration with multiple 3rd Party Services with the help of the connectors offered by Azure.
● Minimal Coding Experience: Data Factory provides the user with an UI to perform all the data mapping and transformation tasks on datasets from various inputs. In turn facilitating users even with no coding background to set up complicated ETL data flows.
● Scalability: ADF also allows the use of parallelism while keeping your costs to only what is used. This scaling benefits the user when time is of the essence; one server for 6 hours or 6 servers for one hour costs the same, but accomplishing the task in 1/6th of the time.
● Programmability: ADF has many programming functionalities such as loops, waits, and parameters for the whole pipeline.
● Utility Computing: The user only pays for what is used, there are no times with idle servers costing money without producing anything, and it can be scaled up when or if needed.
● Supports long and time consuming(>10 min) Queries.
● Pipeline orchestration and execution.
● Data flow execution and debugging.
● Number of Data Factory operations such as creating pipelines and pipeline monitoring.
More details on pricing and how to calculate cost at - Pricing – Data Factory
● ADF allows programmability like loops, wait and parameters in the pipelines but this is no match to the flexibility offered by a native ETL code with the help of Python for example.
● Even though the UI offered by Azure for designing the pipelines eliminates the task of coding, it eats up our time taken to familiarize with the UI.
After this extensive research of both the services, now circling back to our main question:
What to use - Azure Functions or Azure Data Factory for Data Migration or ETL Processes?
Let’s briefly list some basic differences in the resources and computation offered by both the services:
1. Customisable with the help of Code.
2. Better for short-lived instances( < 10 min) in case of a basic Consumption Plan.
Azure Data Factory:
1. Customisable but not as flexible as the customisation offered by coding in a scripting language.
2. Supports long running queries or tasks.
3. Supports multiple data sources and tasks in a single pipeline.
If the data load is low or the task doesn’t consume a lot of time then the better service to choose would be Azure Functions, as the cost would be lower compared to a pipeline setup.
Note that a single instance of Azure Function running multiples times a day will incur extra costs. Else, If the data to be extracted is large and time-consuming and/or the user is not so proficient in scripting, I would suggest the user to opt for the Azure Data Functions as it clearly the better option out of these two. But, the final choice always lies with the user whether to choose Azure Functions or ADF after considering one’s own requirements.
To strengthen the ADF service and overcome the limitations in the customisation, Microsoft has introduced a feature which helps in integrating Azure Functions as a part of an ADF Pipeline basically making Azure Functions a subset of ADF.
So code written in any one of the supported scripting languages can be bundled as a Azure Function and included with the ADF Pipeline to perform the desired data operations. More details about the integration can be found at - Azure Data Factory with Azure Functions.
Data Scientist - NeenOpal Analytics @Indra Teja Sadem
Data Analytics in Election Campaigns
Application of AI for Search Engine Optimization (SEO)
February 29, 2020
Market Basket Analaysis
December 27, 2019