CASE-002 / DATA · MACHINE LEARNING

UAUBox

Choosing the data-pipeline workflow manager (Airflow vs. Luigi) that powered the subscription club's recommendation system and retention analytics.

PYTHONMLDATAPIPELINESRECOMMENDATION

Role: Data Scientist
Year: 2020
Company: UAUBox
Domain: Subscription club
Stack: Python · SQL · Luigi
Published on: Medium / UAUBox

// 01_CONTEXT

Why a WMS?

UAUBox is a monthly beauty-products subscription club. Product decisions depended on answering, quickly: who is likely to cancel?, which product combinations drive retention?, what's the next ideal box for each profile?

To scale those answers we needed a Workflow Management System: a layer that orchestrated data collection, transformation and delivery in an automated, repeatable and auditable way. Step one was picking the right tool.

Automate recurring ETL to free up analysis time.
Make reprocessing reliable when a source schema changed.
Serve as the foundation for the recommendation system to come.

Data Science Pipeline. — The context. A Data Science pipeline for the subscription club.

// 02_FUNDAMENTALS

What a pipeline is

A pipeline is a sequence of connected steps where one task's output feeds the next. In data terms: extract, transform, load, until information turns into something actionable (a dashboard, a model, a recommendation).

When those steps grow to dozens, all depending on each other, loose scripts don't scale. That's where the DAG (Directed Acyclic Graph) comes in: a directed, cycle-free graph describing who depends on whom.

ETL diagram. Source DB, Extract, Transform, Load, Destination DB. — ETL. Extract from source, transform, load into destination.

DAG illustrating pizza preparation, with tasks connected by arrows. — DAG. Pizzeria analogy. Each node is a task, each arrow a dependency.

// 03_TOOLS

Airflow and Luigi, side by side

Two of the most mature Python-based pipeline tools entered the comparison.

Airflow was born at Airbnb. It uses operators (BashOperator, PythonOperator, etc.) and orchestrates flow by task completion status. It is strong on UI, on its native scheduler and on ecosystem.

Luigi was born at Spotify, to power their massive music-recommendation pipelines. It uses classes that inherit from `luigi.Task` with three core methods: `requires()`, `run()` and `output()`. Here the dependency is data, not status: task B needs the file produced by A.

Airflow. Status-based orchestration, built-in scheduler, rich UI.
Luigi. Data-dependency orchestration (input/output), code reads more linearly.
Both open source, both Python, both with active communities.

Luigi. Born at Spotify for recommendation pipelines.

// 04_COMPARISON

What weighed in the decision

The comparison was hands-on. I implemented the same simple flow in both tools to feel the learning curve, code readability and maintenance cost.

External API integration, mainly the club's data sources. Luigi got there with less ceremony.
Readability. Luigi's input/output model makes the pipeline obvious to anyone who didn't write the code.
Reuse. Luigi tasks become pluggable building blocks naturally.
Scheduler and UI. Airflow wins here. But for our team size, cron + logs covered it.
Fit to use case. Luigi was built to feed a recommendation system. That was exactly where our pipeline was headed.

Comparison table between Airflow and Luigi across technical criteria. — Summary table. Airflow vs. Luigi, criterion by criterion.

// 05_CHOICE

Why UAUBox went with Luigi

We chose Luigi. Three reasons, in order of weight:

1. The data-dependency model made the pipeline more readable for the team and made task reuse in future projects easier.

2. Integration with our APIs ended up more direct.

3. The use case mirrored Spotify's: feeding a recommendation system. Adopting the tool built for that context lowered friction.

// 06_OUTCOME

What this unlocked

With the pipeline orchestrated by Luigi, the data team stopped spending time on manual execution and started investing it in analysis and modeling.

On top of that base we built the club's recommendation system in Python with Pandas, NumPy and SQL, using supervised learning to suggest the next ideal box per customer profile. The WMS choice was a small decision in scope, but a decisive one for the team's pace.

Reproducible pipelines with explicit dependencies.
Analysis time freed up for retention and churn questions.
A foundation ready for the recommendation system that followed.

// 07_WRITE-UP

The full article

I wrote the full comparison on the UAUBox Medium, with Airflow and Luigi code samples, the DAG analogy and the rationale behind the choice. This case is the executive cut. The technical walkthrough lives in the original article.

// STACK

Language

PythonSQL

Libraries

PandasNumPyScikit-learn

Pipelines

LuigiAirflow (evaluated)

Workflow

JupyterGitMedium

Next caseCASE-001 / PRODUCT · AI

Truw

Check before sharing. Fact-checking that is mathematically grounded, temporally compressed, and owned by the user, not the platform.