Chess2FEN · Projects | Keenan Kalra

What it is

A production-ready computer vision API that converts top-down chessboard images into FEN (Forsyth-Edwards Notation) strings using per-square CNN classification. The system achieves 100% exact-match accuracy on clean images with 7-15ms inference latency on CPU, deployed as a serverless REST API with a React frontend.

Why it matters

Demonstrates full ML lifecycle from data generation to production deployment at personal scale.
Solves real accuracy vs. efficiency tradeoffs: 9 model architectures ranging from 5.8K to 63K parameters.
Achieves production-grade robustness (>95% accuracy) under realistic distortions: blur, JPEG artifacts, perspective warps, lighting variations.
Implements proper ML systems engineering: model registry, versioned artifacts with SHA256 checksums, ONNX export for cross-platform inference, comprehensive CI/CD with 126 automated tests.

How it works

Preprocessing: Tiles input image into 8×8 grid with configurable margin (default: 2% crop per square to eliminate borders), resizes each to 64×64.
Batched inference: Processes all 64 squares simultaneously via ONNX Runtime (10x faster than per-square loops), outputs 13-class logits per square (empty, 6 white pieces, 6 black pieces).
Sanity checks: Validates board state (exactly one king per side, no pawns on ranks 1/8), repairs low-confidence predictions only when invariants violated.
Model registry: JSON-based model index tracks 5 FP32 models with metrics (accuracy, latency, size), supports precision variants (INT8 for edge deployment).
Deployment: Docker container (~600MB) on Cloud Run with auto-scaling (0-10 instances), rate limiting (60 req/min), comprehensive monitoring.

Tech

ML: PyTorch for training, ONNX Runtime for inference (CPU-optimized, no CUDA dependency at runtime)
API: FastAPI with Gunicorn/Uvicorn workers, CORS middleware, rate limiting via in-memory IP tracking
Frontend: React 18 + TypeScript + Vite, Tailwind CSS, Framer Motion animations, Lighthouse score 93/100 performance
Infrastructure: Google Cloud Run (us-west2), Artifact Registry, Cloud Monitoring with error rate and latency alerts
CI/CD: GitHub Actions for tests (pytest, ruff, black), Docker build/push, automated deployment with smoke tests
Testing: pytest (126 tests), Playwright (15 UI tests), 100% endpoint coverage

My role & links

Designed and implemented 9 CNN architectures (depthwise separable, cascade, multitask) with squeeze-excite attention.
Built complete training infrastructure: synthetic dataset generation (10K images), augmentation pipeline (mixup, cutout, JPEG compression), early stopping on full-board FEN exact match.
Engineered inference pipeline with batched ONNX execution, conservative sanity repair (max 4 squares, confidence threshold 0.60), deterministic preprocessing.
Deployed production API on Cloud Run with monitoring, rollback procedures, cost optimization.
Authored 18+ technical docs covering architecture, model registry, deployment, cost reduction, rollback procedures.
Live Demo: chess2fen-api-2qkqblvvma-wl.a.run.app
Source: github.com/kklike32/chess2fen

Overview

A personal ML engineering project demonstrating end-to-end ownership of a production computer vision system. Started as a proof-of-concept in December 2024 (v1.0), evolved through production deployment in December 2025 (v2.0), and currently active in UI/UX enhancement phase (v3.0). The system processes chessboard images through a per-square classification pipeline, converting visual board state to machine-readable FEN strings used by chess engines and analysis tools.

The project solves a real problem in chess digitization: accurately recognizing piece positions from photographs or screenshots without requiring specialized hardware or manual annotation. Unlike board detection approaches that require complex perspective correction, this system assumes top-down orthogonal views (common in online chess diagrams and screenshot tools) and focuses on high-accuracy piece classification with minimal inference latency.

Technical Details

Architecture Pattern

Per-square classification pipeline with four decoupled stages:

Preprocessing (12ms): Load image, tile into 8×8 grid using exact square boundaries, apply 2% margin crop to exclude borders, resize each tile to 64×64 RGB. Uses PIL for image ops, NumPy for batching.
Inference (7ms): Batch all 64 crops into [64,3,64,64] tensor, run through ONNX session (CPUExecutionProvider default), output [64,13] logits. Apply softmax to get per-square class probabilities. Class mapping: 0=empty, 1-6=PNBRQK (white), 7-12=pnbrqk (black).
Sanity validation (0.1ms): Check invariants (exactly one K and one k, no pawns on ranks 1/8). If violated and confidence below 0.60 threshold, try top-2 class alternatives for low-confidence squares (max 4 repairs per board to prevent wild rewrites). Prefer fail-loudly over aggressive guessing.
FEN generation (0.1ms): Convert 8×8 grid to FEN piece-placement string via rank-by-rank serialization, compress consecutive empty squares (e.g., “3” for three blanks).

Model Zoo

9 Trained Architectures (v2.0):

Architecture	Parameters	Accuracy	Robustness†	Latency	Size
dwsep_se_a075 (default)	14.7K	100.00%	99.99%	7.31ms	162KB
multitask_dw_a075	17.4K	100.00%	99.92%	7.08ms	146KB
cascade_tiny	39.2K	100.00%	99.78%	7.81ms	211KB
dwsep_se_a050	7.3K	100.00%	97.56%	6.69ms	126KB
nanoconv_d02	22K	92.69%	99.43%	6.45ms	96KB

^{†Robustness: Mean accuracy across 7 distortion types (GaussianBlur, JPEGCompression, GaussianNoise, BrightnessContrast, Perspective, Rotate, Clean)}

Design Choices:

Depthwise separable convolutions: Reduces parameters vs. standard conv by 8-10x (e.g., 3×3 depthwise + 1×1 pointwise replaces 3×3 full conv).
Squeeze-Excite blocks: Channel attention mechanism improves accuracy +2-3% with minimal overhead (2 conv layers, <5% parameter increase).
Width multiplier: Scales channel counts (α=0.50, 0.75, 1.00) to explore accuracy/speed frontier. α=0.75 selected as default for best robustness-to-size ratio.
No residual connections: Input resolution (64×64) too small to benefit from skip connections, direct paths simpler and faster.

Training Pipeline

Synthetic Dataset Generation:

Base positions sourced from chess game databases (PGN files), converted to FEN.
Render each position via python-chess SVG export, rasterize to 512×512 PNG.
Generate 10,000 boards (9,000 train, 1,000 val) with diverse piece configurations.
Split recorded in JSON manifests (splits/train.json, splits/val.json) mapping image paths to FEN strings.

Augmentation Strategy:

Geometric: RandomResizedCrop (scale 0.8-1.0), small rotation (±5°).
Color: ColorJitter (brightness ±0.1, contrast ±0.1, saturation ±0.1, hue ±0.05).
Corruption: GaussianBlur (σ=0.5-2.0, p=0.3), JPEGCompression (quality 70-95, p=0.3).
Regularization: Mixup (α=0.2 for smooth label mixing), label smoothing (ε=0.1).
Normalization: ImageNet mean/std (transfer learning priors, though models trained from scratch).

Training Configuration:

Optimizer: AdamW (lr=1e-3 → 1e-5 cosine decay, weight decay=1e-4).
Batch size: 128 (all 64 squares from 2 boards).
Loss: CrossEntropyLoss with label smoothing.
Early stopping: Patience=5 epochs on FEN exact match (not per-square accuracy, product requires full-board correctness).
Device: MPS (Apple Silicon) with autocast FP16, fallback to CPU.
Typical runtime: 8-12 hours per model (50 epochs with early stop).

Evaluation Metrics:

square_acc_clean: Per-square accuracy on clean validation set (no distortions).
fen_exact_clean: Full-board FEN exact match (primary metric for early stopping).
acc_mean_dist: Mean accuracy across 7 distortion types (robustness indicator).
latency_cpu_ms: Batched 64-crop inference time on M4 Mac CPU (1000 runs, warmup excluded).

ONNX Export & Optimization

Export Pipeline:

Train in PyTorch, export via torch.onnx.export(opset_version=17).
Models output either [batch,13] or [batch,13,1,1] (architecture-dependent), inference code normalizes to [batch,13].
SHA256 checksums computed for all artifacts, stored in model cards for reproducibility.

Quantization (INT8):

Static quantization via ONNX Runtime: collect calibration data (512 samples from train set), quantize weights and activations to INT8.
Current status: INT8 models disabled (loading issues in production, 12/2025). Planned fix in v3.0 with updated ONNX Runtime.
Expected gains: 50-75% size reduction (162KB → 50KB for dwsep_se_a075), minimal accuracy loss (<0.5%).

Production API

FastAPI Application (api/app.py):

Endpoints:

GET /health: Returns status, loaded model count, runtime version, git commit.
GET /models: Lists all models from registry with metadata.
POST /infer: Accepts multipart image file + optional model parameter, returns FEN + confidence + timing.

Middleware:

CORS: Configurable origins (env: CHESS2FEN_ALLOWED_ORIGINS), allows credentials, all methods/headers.
Rate limiting: In-memory IP-based tracker (60 req/min per IP, 60s window). Returns 429 on exceed. Works with Cloud Run’s X-Forwarded-For header.

Input Validation:

Max upload size: 10MB (env: CHESS2FEN_MAX_UPLOAD_MB).
Max image dimensions: 8192×8192 pixels (env: CHESS2FEN_MAX_IMAGE_PIXELS).
Allowed formats: JPEG, PNG (validated via PIL).
EXIF stripping: Automatically removes metadata to prevent exploits.

Error Handling:

400: Invalid file format, oversized image, corrupted file.
404: Model not found in registry.
422: Missing required parameters (file).
429: Rate limit exceeded.
500: Internal server error (model loading failure, ONNX runtime crash).
503: Service unavailable (models not loaded at startup).

Lifespan Management:

Startup: Load model registry, validate index.json, cache default model session, log config.
Shutdown: Clear session cache, log graceful exit.

Cloud Run Deployment

Container Optimization:

Base image: python:3.11-slim (minimal Debian).
Production dependencies: onnxruntime, fastapi, uvicorn, gunicorn, pillow, numpy, python-chess (total ~200MB).
Final image size: ~600MB.

Container Configuration:

Non-root user: chess2fen (security best practice).
Health check: HTTP GET /health every 30s, 10s timeout.
Entry point: gunicorn api.app:app --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --workers 2.

Cloud Run Service:

Region: us-west2 (Los Angeles, low latency for US West Coast users).
CPU: 2 vCPU.
Memory: 2 GiB.
Concurrency: 80 requests/container.
Auto-scaling: 0-10 instances (scale-to-zero enabled, cold start ~200ms).
Timeout: 30s per request.

Monitoring & Operations

Cloud Monitoring Dashboards:

Request count by status code (200, 4xx, 5xx).
Latency distribution (P50, P95, P99).
Container CPU and memory utilization.
Model inference timing (preprocess, inference, postprocess breakdowns).

Alerts:

Error rate >5% sustained for 5 minutes → email notification.
P95 latency >500ms sustained for 5 minutes → email notification.

Rollback Procedures:

Automated: GitHub Actions tracks previous image tag, script scripts/rollback_deployment.sh reverts to last known good.
Manual: gcloud run services update-traffic --to-revisions=<previous_revision>=100.
Validation: Smoke tests after every deployment (health check, inference on test image, FEN validation).

React Frontend (v3.0)

Stack:

React 18 + TypeScript (strict mode) for type safety.
Vite 7.3 for fast HMR and optimized production builds.
Tailwind CSS with custom design system (8px grid, blue/purple gradients, dark/light themes).
Framer Motion for 60fps animations (drag-drop feedback, confetti on success).
react-chessboard for interactive board visualization (FEN string → rendered board).

Features:

Drag-and-drop file upload with preview thumbnail and size display.
Model selector with stats cards (accuracy, speed, size badges).
Progress steps indicator (Upload → Detect → Process → Results).
FEN output with copy-to-clipboard (animated feedback).
Confidence heatmap (8×8 grid, color-coded by softmax probability).
Performance metrics display (preprocessing, inference, postprocessing timing).
Responsive design (mobile-first, tablet and desktop breakpoints).

Performance:

Lighthouse score: 93/100 performance, 100/100 accessibility.
Bundle size: 125KB (gzipped), code splitting via React.lazy.
First Contentful Paint: <1s.
Playwright tests: 15/15 passing (theme toggle, file upload, responsive layout).

CI/CD Pipeline

GitHub Actions Workflows:

Tests (ci.yml):

Trigger: Every push, every PR.
Steps: Install deps (uv pip), run pytest (126 tests), ruff linting, black formatting check.
Fail conditions: Any test failure, linting errors, formatting violations.

Deployment (deploy-cloudrun.yml):

Trigger: Push to main branch.
Steps:
1. Checkout code.
2. Build Docker image: gcloud builds submit --tag us-west2-docker.pkg.dev/chess2fen/chess2fen/api:latest.
3. Deploy to Cloud Run: gcloud run deploy chess2fen-api --image <tag> --region us-west2.
4. Run smoke tests: health check, inference on test fixture, FEN validation.
5. Cleanup old images: Keep only latest image to minimize storage costs.
6. Send Slack notification (success/failure).

Pre-commit Hooks:

black (format code).
ruff (lint and auto-fix).
pytest (run tests locally before push).

Testing Strategy

Unit Tests (pytest, 126 tests):

test_infer_api.py (18 tests): Core inference, preprocessing, ONNX session loading.
test_model_registry.py (8 tests): Registry loading, validation, model lookup.
test_fen_utils.py (15 tests): FEN parsing, grid conversion, validation.
test_sanity.py (12 tests): Invariant checks, repair logic, confidence thresholding.
test_calibration.py (12 tests): Confidence calibration, ECE (Expected Calibration Error).

Integration Tests (pytest, 32 tests):

test_api_integration.py (32 tests): All endpoints, error handling, CORS headers, response schemas.
test_random_inference.py (13 tests): End-to-end with real images from test fixtures (8 boards, 256KB dataset).

UI Tests (Playwright, 15 tests):

Theme toggle (dark/light mode).
File upload (drag-drop, preview, remove).
Responsive design (mobile, tablet, desktop).
Navigation (GitHub link opens in new tab).
Error states (API unavailable, invalid file).

Test Fixtures:

Small dataset committed to git: data/test_fixtures/ (8 JPEG images, 256KB).
Used by CI when full data/train/ unavailable (prevents 10GB dataset download).
Representative board positions (starting position, endgame, complex middlegame).

Code Quality

Tooling:

black: Opinionated code formatter (88 char line length, PEP 8 compliance).
ruff: Fast linter combining flake8, isort, pyupgrade (selects: E, W, F, I, B, C4).
mypy: Static type checker (optional, not enforced in CI due to untyped third-party deps).

Configuration (pyproject.toml):

[tool.black]: line-length=88, target-version=py311.
[tool.ruff]: line-length=88, select=[E,W,F,I,B,C4], ignore=[E501] (black handles line length).

Pre-commit Checklist (Enforced):

Format code: black src/ tests/ scripts/ api/.
Lint and fix: ruff check --fix src/ tests/ scripts/ api/.
Run tests: pytest tests/ -v.
Verify imports: No unused imports, correct ordering.

Security Hardening