What it is
A cost-optimized, event-driven system that reliably captures complete Spotify listening history, enriches metadata once per entity, auto-generates weekly playlists, and provides long-term analytics via a static dashboard.
Why it matters
- Demonstrates production-ready serverless architecture patterns at personal scale.
- Implements comprehensive idempotency guarantees for retry-safe data pipelines.
- Achieves near-zero operational cost through strategic hot/cold storage design.
- Solves real data loss problems inherent in Spotify’s 50-item API limitation.
How it works
- Cursor-based ingestion: Hourly Lambda fetches play history with overlap windows to prevent gaps, using conditional DynamoDB writes for deduplication.
- Hot/cold storage pattern: Recent data (7-30 days) in DynamoDB for fast queries, complete history in S3 partitioned JSONL for analytics.
- Metadata caching: One-time fetch per track/artist to minimize API calls and DynamoDB costs.
- Weekly automation: EventBridge-triggered playlist creation filters out recently played tracks using configurable lookback windows.
- Static dashboard: Nightly aggregation precomputes all analytics, serving via S3/CloudFront for zero backend query cost.
Tech
- Language: Python 3.11+
- Cloud: AWS Lambda, DynamoDB, S3, EventBridge, CloudFront
- IaC: Terraform
- Testing: pytest, ruff, black
- APIs: Spotify Web API (spotipy), boto3
My role & links
- Architected complete serverless pipeline with idempotent retry guarantees.
- Implemented cursor-based ingestion with overlap-dedup strategy to prevent data loss.
- Designed hot/cold storage pattern optimizing for both query speed and long-term cost.
- Built infrastructure-as-code with Terraform including CloudWatch alarms and budgets.
- Code: Repository private (available upon request)
Overview
A production-grade personal data engineering project demonstrating serverless architecture, event-driven design, and cost-conscious cloud resource management. The system solves Spotify’s API limitation of only providing the 50 most recent plays by implementing reliable continuous ingestion with gap detection.
Technical Details
Architecture Pattern
Event-driven pipeline with four decoupled stages:
- Ingest (hourly): Fetch plays from Spotify API with overlap window (DEFAULT_FETCH_LIMIT + OVERLAP_WINDOW_SIZE), write to DynamoDB (TTL=7-30 days) and S3 (partitioned by
dt=YYYY-MM-DD). - Enrich (on-demand): Cache track/artist metadata in DynamoDB, conditional writes prevent duplicate API calls.
- Aggregate (nightly): Precompute all dashboard analytics (top tracks, artists, hourly distributions, 30/7/1-day windows), write single JSON to S3.
- Playlist (weekly): Create “unplayed in last N days” playlist via set difference between source playlist and recent plays.
Idempotency Guarantees
- Deterministic keys:
hash((played_at, track_id, context))ensures same event always produces same identifier. - Conditional writes: DynamoDB
PutItemwithattribute_not_existscondition prevents duplicates on retry. - Overlap-dedup strategy: Always fetch more data than needed, filter duplicates at write time—eliminates gaps from stale cursors.
- State cursors: Track
last_seen_played_atin DynamoDB state table, atomic updates prevent race conditions.
Cost Optimization
- Hot store: DynamoDB on-demand pricing (~$1.25/million writes) with TTL auto-expiration for recent plays.
- Cold store: S3 Standard-IA for historical data (partition pruning reduces scan costs).
- No live queries: All dashboard data precomputed nightly, served as static JSON via CloudFront.
- Metadata caching: One-time fetch per entity, conditional writes prevent redundant API calls.
- Projected monthly cost: <$5 for 30,000 monthly plays (mostly storage, minimal compute).
Infrastructure
Terraform-managed resources:
- 4 Lambda functions (128-256MB memory, 30-60s timeout)
- 4 DynamoDB tables (on-demand billing, point-in-time recovery)
- 2 S3 buckets (raw events + dashboard, versioning enabled)
- EventBridge schedules (hourly ingest, nightly aggregate, weekly playlist)
- CloudWatch alarms (error rate, duration, throttling)
- AWS Budgets ($10 threshold with email alerts)
- IAM least-privilege roles (separate per function)
- SSM Parameter Store (secure OAuth token storage)
Data Models
Pydantic-validated schemas:
PlayEvent: Minimal ingestion payload (played_at, track_id, context)TrackMetadata: Cached track details (name, artists, album, duration_ms)ArtistMetadata: Cached artist details (name, genres, popularity)DashboardData: Precomputed analytics (top tracks/artists, trends, hourly distribution)IngestionState: Cursor tracking (last_cursor_unix_ms, last_run_timestamp)
Pipeline Safety Features
- Gap detection: Overlap window (5 events) alerts if fetch returns fewer items than requested.
- Retry logic: Exponential backoff for Spotify API rate limits (429 responses).
- Validation gates: Schema validation via Pydantic before writes, reject malformed events early.
- Monitoring: CloudWatch alarms on error rate >5%, Lambda duration >80% timeout, DynamoDB throttles.
- Backfill capability: Scripts to recover missing daily summaries from S3 cold storage.
Dashboard Implementation
Zero-backend static site:
- Fetches precomputed
dashboard_data.jsononce on load. - Chart.js for visualizations (top tracks, artists, listening trends).
- Hosted on S3 with CloudFront distribution (HTTPS + caching).
- No credentials exposed (all data pre-aggregated server-side).
Testing Strategy
- Unit tests: Mock Spotify API and boto3 clients, test deduplication logic and idempotency.
- Integration tests: Validate DynamoDB conditional writes, S3 partition structure, cursor advancement.
- Deterministic fixtures: Time-based tests use frozen timestamps, no randomness.
- CI checks:
pytest -q && ruff check . && black .runs on all commits.
Development Workflow
- Environment:
uvfor dependency management (no pip), virtual environments isolated per project. - Code quality: ruff (fast linter), black (formatter), pytest (testing)—zero tolerance for failures.
- Safe experimentation: All terminal Python uses
tmp/scripts (deleted after use), never long inline commands. - Git discipline: No commits without passing tests, no secrets in repository, pre-commit hooks enforce standards.
Learning Outcomes
This project demonstrates practical experience with:
- Serverless architecture: Designing stateless, event-driven pipelines that scale to zero.
- Idempotency patterns: Implementing deterministic keys, conditional writes, and overlap-dedup strategies for retry safety.
- Cost engineering: Strategic hot/cold storage, precomputed aggregates, and conditional caching to minimize cloud spend.
- Infrastructure-as-code: Managing complete AWS stack via Terraform with proper tagging, budgets, and monitoring.
- Data pipeline reliability: Gap detection, cursor management, and backfill capabilities for production-grade ingestion.
- API integration: OAuth flows, rate limit handling, and efficient metadata caching for third-party APIs.