Airflow as a powerful automation tool : 7 steps to deploy it on cloud

In today’s fast-paced business world, automation is essential for increasing efficiency and ensuring that workflows are handled seamlessly. Apache Airflow has become one of the most popular tools for orchestrating and automating workflows. Whether you’re managing data pipelines, handling ETL jobs, or scheduling complex tasks across systems, Airflow provides a reliable and scalable solution.

In this blog, we’ll cover what Apache Airflow is, its origins, how to install it on Ubuntu, and how to deploy your first Airflow task on AWS.

How to automate tasks using Airflow
Airflow Automation

What is Airflow?

Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It allows you to automate complex tasks, manage dependencies, and execute workflows at scale. Airflow uses Directed Acyclic Graphs (DAGs) to represent workflows as code, giving you flexibility and control over how your tasks are executed.

Who Developed Airflow?

Airflow was originally developed by Airbnb in 2014 to manage the company’s complex workflows and data pipelines. In 2016, Airflow became an open-source project under the Apache Software Foundation and has since become one of the leading tools in data engineering and task automation.

Key Features of Airflow

  • Modular and Extensible: Easily extend Airflow with custom operators, executors, and plugins.
  • Scalable: Airflow can scale to handle thousands of tasks per day across distributed systems.
  • Open-source: Free and backed by an active community of developers.
  • Integration-friendly: Airflow integrates well with a variety of systems like Hadoop, AWS, Google Cloud, and more.

Installing Airflow on Ubuntu (Step-by-Step)

Let’s walk through installing Airflow on an Ubuntu system. This is a beginner-friendly guide that will help you get up and running quickly.

Step 1: Update Ubuntu

First, update your system to ensure all your software packages are up to date. Open your terminal and run the following commands:

sudo apt-get update
sudo apt-get upgrade

Step 2: Install Python and Pip

Airflow requires Python 3.7+ and pip to be installed on your system. Run the following commands to install them:

sudo apt-get install python3 python3-pip

Verify the installation by checking the Python and pip versions:

python3 --version
pip3 --version

Step 3: Set Up Virtual Environment

It’s recommended to install Airflow inside a virtual environment to keep it isolated from other system dependencies.

sudo apt-get install python3-venv
python3 -m venv airflow_env
source airflow_env/bin/activate

Step 4: Install Apache Airflow

Once the virtual environment is activated, you can install Apache Airflow using pip:

pip install apache-airflow

Install necessary dependencies, such as PostgreSQL or MySQL connectors, if needed:

pip install apache-airflow[postgres,aws]

Step 5: Initialize Airflow

Airflow uses a metadata database to keep track of task history and DAGs. Initialize the database using:

airflow db init

Step 6: Create a User

You’ll need to create a user to access the Airflow web UI:

airflow users create \
    --username admin \
    --firstname Firstname \
    --lastname Lastname \
    --role Admin \
    --email admin@example.com

Step 7: Start the Airflow Web Server and Scheduler

Now, start the Airflow web server and scheduler:

airflow webserver --port 8080
airflow scheduler

You can now access the Airflow UI by navigating to http://localhost:8080 in your browser.

How to Deploy an Airflow Task on AWS

Once Airflow is installed and running locally, you can deploy your first task to AWS. In this section, we’ll set up a simple DAG that runs a task on an EC2 instance.

Step 1: Set Up AWS Credentials in Airflow

To interact with AWS services, you need to set up your AWS credentials. You can do this by configuring the Airflow connection for AWS:

  1. Go to the Airflow web UI (http://localhost:8080).
  2. Click on Admin -> Connections.
  3. Create a new connection:
    • Conn Id: aws_default
    • Conn Type: Amazon Web Services
    • Login: Your AWS Access Key ID
    • Password: Your AWS Secret Access Key

Step 2: Create an Airflow DAG

Now, create a DAG that runs a task on an EC2 instance. Below is an example of a DAG written in Python:

from airflow import DAG
from airflow.providers.amazon.aws.operators.ec2 import EC2StartInstanceOperator
from datetime import datetime

with DAG(
    'ec2_task_dag',
    default_args={'owner': 'airflow', 'start_date': datetime(2024, 9, 1)},
    schedule_interval='@daily',
    catchup=False,
) as dag:
    
    start_ec2_instance = EC2StartInstanceOperator(
        task_id='start_ec2',
        instance_id='i-xxxxxxxxxxxxxx',
        aws_conn_id='aws_default',
        region_name='us-west-2'
    )

Place this file inside the dags/ folder of your Airflow installation.

Step 3: Monitor Your DAG

Once the DAG is deployed, you can monitor its execution from the Airflow UI. The task will automatically start an EC2 instance in your specified region.


Conclusion

Apache Airflow is a powerful tool for automating tasks and orchestrating workflows at scale. Whether you’re managing complex data pipelines or scheduling cloud-based tasks, Airflow provides the flexibility and scalability you need. By following the steps above, you’ve learned how to install Airflow on Ubuntu, create your first DAG, and deploy a task to AWS. With these skills, you’re ready to streamline your workflows and take advantage of Airflow’s robust capabilities.

Are you ready to explore data analytics & visualization
to find out hidden insights?

Get in touch with our experts to know how modern tech stack could unearth hidden insights with our state of art analytical platform. Use highly sophisticated and customized cloud based dashboard to upgrade your companies technical & analytical strength.

Talk to Expert See how we make it possible!

Similar Posts