Apache Airflow connector

Pull your Apache Airflow scheduler state into the same warehouse as the pipelines it runs.

Data Panda lifts DAG runs, task instances, SLA misses, connection inventory and operator logs out of Airflow into your warehouse next to the data those pipelines load. From one place we feed dashboards, automations, AI workflows and internal apps that finally answer which DAGs your team relies on day to day.

Data Panda Reporting Automation AI Apps
Apache Airflow logo
About Apache Airflow

Where data teams schedule, run and monitor their pipelines.

Apache Airflow is the open-source platform data teams use to schedule, run and monitor their pipelines. Engineers write each pipeline as a Python DAG (a graph of tasks with dependencies, retries and timing rules), Airflow schedules it, runs it, and keeps a record of every run, every task and every retry. The community maintains a long list of providers for the warehouses, SaaS systems and cloud services those tasks touch, so the same scheduler can load Snowflake, fire a dbt run, and ping Slack on failure.

What teams build on Airflow runs from nightly ELT into the data warehouse to ML training jobs, data-quality checks, report generation and customer-facing exports. With Airflow's run history pulled into your warehouse, the question of which DAG is fragile, which team owns it, and what each run is costing you becomes a dashboard, not a tail of the web UI.

What your Apache Airflow data is for

What you get once Apache Airflow is connected.

DAG reliability you can show a steering committee

DAG runs, task instances and SLA misses land in the warehouse alongside the data the pipelines load downstream.

  • Failure rate per DAG and per owning team in one chart
  • SLA misses joined to the downstream report that needed the data
  • Top ten longest-running tasks ranked by week, not by gut feel

Failure routing that names the owner

Task failures and SLA misses fan out to the team that owns the DAG, not to a generic data-eng inbox.

  • Slack ping on failure routed by DAG tag, not by global webhook
  • Repeated failures escalated to the owning team's CRM record
  • Stale connection alerts before a credential silently expires

Pattern detection on the run history

AI features sit on warehouse history of dag_run and task_instance, not on the live web UI.

  • Anomaly detection on task duration trends per operator
  • Natural-language questions on which DAGs slipped this month
  • Run-cost forecasts based on actual scheduler and worker history

An ownership view nobody else built

Internal apps map every DAG to a team, an SLA and a downstream consumer.

  • DAG catalogue with owner, schedule, SLA and last successful run
  • Connection and variable inventory with last-used timestamp
  • On-call dashboard pulled from real Airflow state, not a wiki
Use cases

Use cases we deliver with Apache Airflow data.

A list of concrete reports, automations and AI features we have built on Apache Airflow data. Pick the one that matches your situation.

DAG reliability boardFailure rate, retry count and SLA misses per DAG and per owning team.
Owner routingRoute task failures to the right Slack channel and CRM owner by DAG tag.
SLA-miss to downstreamTie every SLA miss to the report or app that needed the data on time.
Connection inventorySurface stale connections and credentials before they expire silently.
Cost per DAG runWorker time and cloud cost attributed to the DAG that consumed it.
Long-task hotlistWeekly ranking of the longest task instances per operator.
Self-hosted to AstroMove from self-hosted Airflow to Astronomer Astro on the same warehouse.
MWAA or Cloud ComposerSame observability layer whether the scheduler runs on AWS or GCP.
ELT into SnowflakeSnowflake loads scheduled in Airflow, monitored from the same warehouse.
ELT into BigQueryBigQuery loads orchestrated in Airflow alongside GA4 and Ads exports.
On-call rotationOn-call dashboard pulled from live scheduler state instead of a wiki.
DAG ownership auditEvery DAG mapped to a team, an SLA and a downstream consumer.
Real business questions

Answers you will finally get.

Which of our Airflow DAGs broke more than twice this quarter, and who owns them?

The dag_run and task_instance tables hold the answer, but nobody on the team has time to query the metastore directly. Pull both into the warehouse, join on DAG tag and the team-ownership map you maintain in your CRM or HR tool, and the next quarterly review opens with a one-page list. Most BE/NL teams find that ten DAGs account for half the failures, and that two of them are not even run by the team listed in the wiki.

Why does our morning report still arrive late even though Airflow says every DAG succeeded?

Almost always because the SLA is set on the wrong task in a long DAG, or because a sensor is waiting on an upstream that succeeded after the report's window. Land the sla_misses table in the warehouse and join it to which downstream report or app consumed the data. The conversation moves from Airflow being the suspect to the actual upstream owner being named.

Should we move from self-hosted Airflow to Astro, MWAA or Cloud Composer?

Self-hosted on Kubernetes is the cheapest sticker price and the most expensive in operations time once the Airflow version drifts more than two minor releases behind upstream. Astronomer Astro removes the upgrade and scaling work and is run by the company that ships most of the upstream commits. MWAA and Cloud Composer are good fits if you are already on AWS or GCP and want one bill. The warehouse view we build works the same on all four paths, so the choice becomes about operations time, not about your data.

Value for everyone in the organisation

Where each function gets value.

For finance leaders

Finance gets a cost-per-DAG-run view that ties Airflow worker time and warehouse credit cost back to the team that scheduled the pipeline. Cloud bills stop being a single line on the IT P&L and become a number you can attribute to a product line.

For sales leaders

Sales and CS leads get notified when a pipeline that feeds the customer 360 view fails or slips an SLA, before they walk into the QBR with stale numbers. The notification names the owning data team and the downstream report instead of pasting a generic Airflow link.

For operations

Data and platform leads get a DAG ownership board, a connection-inventory with last-used timestamps, and an on-call dashboard pulled from the live scheduler. The hand-off when an engineer leaves stops being a wiki page that nobody updated since 2023.

Ideas

What you can automate with Apache Airflow.

Pair with Snowflake

Airflow DAG and Snowflake load logs side by side

DAG runs, task durations and Snowflake query history land in the same warehouse so you can finally see whether a slow ELT was Airflow waiting or Snowflake scanning. The warehouse-cost-per-DAG view replaces a monthly argument between the data team and finance.

Pair with BigQuery

Airflow run history joined to BigQuery slot usage

When BigQuery slot consumption spikes on a Tuesday morning, the warehouse view shows which DAG fired which query, on which dataset, with how much scan. Slot-cost attribution moves from a guess to a column on the DAG ownership table.

Pair with Slack

DAG-failure pings routed to the owning team channel

Task failures and SLA misses fan out to the right Slack channel by DAG tag instead of into a generic data-engineering firehose. The on-call engineer sees their failures, the analytics team sees theirs, and the wrong-team noise stops.

Pair with Salesforce

Repeated DAG failures escalated to the data-product owner

When the same DAG fails three times in a quarter on a customer-facing data product, the warehouse view opens or updates a case on the data-product owner in Salesforce. The conversation about that pipeline becomes a tracked record instead of a Slack thread that scrolls away.

Your existing tools

Your data lands in a warehouse. Your BI tools read from it.

You keep the reporting tool you already have. We connect it to the warehouse where your Apache Airflow data lives.

Power BI logo
Power BI Microsoft
Microsoft Fabric logo
Fabric Microsoft
Snowflake logo
Snowflake Data warehouse
Google BigQuery logo
BigQuery Google
Tableau logo
Tableau Visualisation
Microsoft Excel logo
Excel Sheets & pivots
Three steps

From Apache Airflow to answers in three steps.

01

Connect securely

OAuth authentication. Read-only by default. We sign a DPA and your admin keeps the keys.

02

Land in your warehouse

Data flows into your warehouse on your schedule. Near real time or nightly, your call. You own the data.

03

Reporting, automation, AI

We build the first dashboard, workflow or AI feature with you, then hand over the keys. Or we stay on for ongoing delivery.

Two ways to work with us

Pick the track that fits how you work.

Track 01

Self-serve

We set up the foundation. Your team builds on top.

  • Apache Airflow connector configured and running
  • Warehouse set up in your cloud account
  • Clean access for your Power BI, Fabric or Tableau team
  • Documentation on what's in the data model
  • Sync monitoring so you're warned before reports break

Best fit Teams that already have a BI analyst or data engineer and want to own the build.

Track 02

Done for you

We build the whole thing, end to end.

  • Everything in Self-serve
  • Dashboards built to the questions your team actually asks
  • Automations between your systems
  • AI workflows scoped to real tasks your team runs
  • Custom apps where a dashboard does not cut it
  • Ongoing delivery at a pace that fits your team

Best fit Teams without in-house BI or dev capacity. You tell us what you need and we deliver it.

Before you book

Frequently asked questions.

Who owns the data?

You do. It lands in your warehouse, on your cloud account. We don't resell or aggregate it. If you stop working with us, the warehouse stays yours and keeps running.

How fresh is the data?

Near real time for most operational systems. For heavier sources we schedule hourly or nightly. You pick based on what the reports need.

Do I need a warehouse already?

No. If you don't have one, we help you pick one and set it up as part of the first delivery. Common starting points are Snowflake, Microsoft Fabric, or a small Postgres start.

Does this work on self-hosted Airflow or do we need Astronomer Astro, MWAA or Cloud Composer?

All four. The Airflow metastore exposes the same dag_run, task_instance, sla_miss, connections and variables tables whether the scheduler runs on your own Kubernetes, on Astronomer Astro, on Amazon MWAA or on Google Cloud Composer. The warehouse view we build reads from that metastore, so the observability layer stays identical. The choice between self-hosted and managed becomes a question about who carries the upgrade and scaling work, not about your data.

Our self-hosted Airflow is two minor versions behind. Does that matter for what you build?

For the warehouse view, the schema we read from has been stable across the Airflow 2.x line and into Airflow 3, so a version-skewed instance still feeds the same dashboards. For the team itself it matters: a self-hosted Airflow more than two minor versions behind upstream is the most common reason BE/NL data teams move to Astronomer Astro or to a managed cloud flavour. The warehouse view is the same on either path.

Connections and variables can hold credentials. How do you handle that?

We pull the connection and variable inventory with the secret payloads excluded. The warehouse view sees connection identifiers, the connection type, the host, the schema and the last-used timestamp, but not the password or token. That is enough to flag stale or unused connections without copying secrets out of the metastore or out of your secrets backend.

GDPR-compliant
Data stays in the EU
You own the warehouse

A first deliverable live in four to six weeks.

We review your Apache Airflow setup and the systems around it. Together we pick the first thing worth building.