UAUBox
Choosing the data-pipeline workflow manager (Airflow vs. Luigi) that powered the subscription club's recommendation system and retention analytics.
- Role
- Data Scientist
- Year
- 2020
- Company
- UAUBox
- Domain
- Subscription club
- Stack
- Python · SQL · Luigi
- Published on
- Medium / UAUBox
Why a WMS?
UAUBox is a monthly beauty-products subscription club. Product decisions depended on answering, quickly: who is likely to cancel?, which product combinations drive retention?, what's the next ideal box for each profile?
To scale those answers we needed a Workflow Management System: a layer that orchestrated data collection, transformation and delivery in an automated, repeatable and auditable way. Step one was picking the right tool.
- Automate recurring ETL to free up analysis time.
- Make reprocessing reliable when a source schema changed.
- Serve as the foundation for the recommendation system to come.
What a pipeline is
A pipeline is a sequence of connected steps where one task's output feeds the next. In data terms: extract, transform, load, until information turns into something actionable (a dashboard, a model, a recommendation).
When those steps grow to dozens, all depending on each other, loose scripts don't scale. That's where the DAG (Directed Acyclic Graph) comes in: a directed, cycle-free graph describing who depends on whom.
Airflow and Luigi, side by side
Two of the most mature Python-based pipeline tools entered the comparison.
Airflow was born at Airbnb. It uses operators (BashOperator, PythonOperator, etc.) and orchestrates flow by task completion status. It is strong on UI, on its native scheduler and on ecosystem.
Luigi was born at Spotify, to power their massive music-recommendation pipelines. It uses classes that inherit from `luigi.Task` with three core methods: `requires()`, `run()` and `output()`. Here the dependency is data, not status: task B needs the file produced by A.
- Airflow. Status-based orchestration, built-in scheduler, rich UI.
- Luigi. Data-dependency orchestration (input/output), code reads more linearly.
- Both open source, both Python, both with active communities.
What weighed in the decision
The comparison was hands-on. I implemented the same simple flow in both tools to feel the learning curve, code readability and maintenance cost.
- External API integration, mainly the club's data sources. Luigi got there with less ceremony.
- Readability. Luigi's input/output model makes the pipeline obvious to anyone who didn't write the code.
- Reuse. Luigi tasks become pluggable building blocks naturally.
- Scheduler and UI. Airflow wins here. But for our team size, cron + logs covered it.
- Fit to use case. Luigi was built to feed a recommendation system. That was exactly where our pipeline was headed.
Why UAUBox went with Luigi
We chose Luigi. Three reasons, in order of weight:
1. The data-dependency model made the pipeline more readable for the team and made task reuse in future projects easier.
2. Integration with our APIs ended up more direct.
3. The use case mirrored Spotify's: feeding a recommendation system. Adopting the tool built for that context lowered friction.
What this unlocked
With the pipeline orchestrated by Luigi, the data team stopped spending time on manual execution and started investing it in analysis and modeling.
On top of that base we built the club's recommendation system in Python with Pandas, NumPy and SQL, using supervised learning to suggest the next ideal box per customer profile. The WMS choice was a small decision in scope, but a decisive one for the team's pace.
- Reproducible pipelines with explicit dependencies.
- Analysis time freed up for retention and churn questions.
- A foundation ready for the recommendation system that followed.
The full article
I wrote the full comparison on the UAUBox Medium, with Airflow and Luigi code samples, the DAG analogy and the rationale behind the choice. This case is the executive cut. The technical walkthrough lives in the original article.
Language
Libraries
Pipelines
Workflow
Truw
Check before sharing. Fact-checking that is mathematically grounded, temporally compressed, and owned by the user, not the platform.