
My Projects
RoveMiles — ML Engineering + Financial Modeling Work
Goal:
Built end-to-end predictive analytics and financial modeling system for RoveMiles.
Problem Statement:
Built scalable ML pipelines on AWS to process large transactional datasets, performing anomaly detection, feature engineering, and dashboard development for key business metrics.
Result:
Financial model, Identified high-impact churn features. Built metrics dashboards for KPIs and revenue models.
Approach / Methodology:
-
Developed churn prediction and retention models using Scikit-Learn to identify at-risk customers.
-
Implemented anomaly detection on booking and transaction logs to surface irregular patterns in real time.
-
Built a comprehensive financial forecasting model in Excel covering GMV, blended take rate, revenue curves, partner payouts, and profit margins.
-
Conducted sensitivity analysis on mile issuance and redemption behavior to evaluate profitability under varying assumptions.
-
Designed a complete ML pipeline lifecycle — from data ingestion and validation to feature engineering, model training, and monitoring.
Key Learnings:
-
Business-focused machine learning
-
Working with ambiguous real-world data
-
Building financial models and ML systems together
Tech Stack:
Python, Pandas, Excel, Scikit-Learn, AWS S3, Jupyter
GitHub I View Report
Real-Time Customer Behavior Analytics Pipeline (AWS)
Goal:
Turning live data into instant business insight.
Problem Statement:
Businesses need immediate visibility into customer interactions, but traditional batch-processing systems cause delays in detecting anomalies, trends, and engagement patterns.
Results / Visuals:
-
Cut event-to-dashboard latency to under 2 minutes
-
Fixed Redshift cost spike (~$1.3k) and optimized compute use
-
Enabled BI team to visualize customer trends in near-real-time
-
Fully fault-tolerant streaming architecture
Approach / Methodology:
-
Designed full architecture: Kinesis → Lambda → S3 → Glue Crawler → Glue ETL → Glue DQ → Redshift → QuickSight
-
Built PySpark ETL scripts for cleaning, partitioning, transforming event data
-
Set up Glue Data Catalog tables, schema inference, and automated crawlers
-
Configured Redshift clusters, IAM roles, networking, VPC, subnets
-
Implemented AWS Budgets + cost controls after diagnosing Redshift overages
-
Created QuickSight dashboards for funnels, anomalies, activity timelines
Key Learnings:
-
Deep practical understanding of cloud data engineering architecture
-
Schema evolution, Glue DQ, IAM policy debugging, Redshift performance tuning
-
How real-time ETL differs from batch pipelines
Tech Stack:
AWS Kinesis, AWS Lambda, AWS Glue, Glue DQ, Redshift, S3, Athena, CloudWatch, PySpark, VPC, IAM
GitHub I View Report
AI-Powered Retrieval-Augmented Generation System
Goal:
Offline AI for secure, collaborative insight.
Problem Statement:
Teams managing sensitive or confidential data require a secure, locally hosted LLM solution that operates offline while maintaining high response quality, accurate knowledge retrieval, and support for multi-agent collaboration.
Results / Visuals:
● Reduced API cost by 100% via full offline inference
● Achieved faster hybrid retrieval through graph + vector search
● Delivered a modern UI suitable for internal research workflows
Approach / Methodology:
-
Built a full RAG architecture using Microsoft GraphRAG for structured retrieval
-
Integrated AutoGen agents for multi-agent reasoning
-
Used Ollama to run local models (Mistral, Nomic embeddings)
-
Added Lite-LLM proxy to enable OpenAI-style function calling with local models
-
Designed an interactive Chainlit UI supporting multi-threaded chats
-
Generated embeddings and stored them in local vector stores
Key Learnings:
-
Differences between graph-based and dense retrieval
-
Offline inference tradeoffs, local model memory/latency handling
-
Multi-agent orchestration patterns
Tech Stack:
Python, Chainlit, Microsoft GraphRAG, AutoGen, Lite-LLM, Ollama, Mistral, Nomic Embeddings, FAISS
GitHub I View Report
ChurnGuard — Customer Churn Prediction ML Platform
Goal:
ML-based churn prediction with +15% accuracy improvement.
Problem Statement:
Businesses need tools to forecast customer churn and proactively address retention weaknesses
Results / Visuals:
● 15% accuracy increase after tuning
● Clear ranking of features affecting churn risk
● Delivered a dashboard-ready dataset of churn probabilities
Approach / Methodology:
-
Cleaned and normalized customer datasets
-
Feature engineering + outlier detection
-
Used Logistic Regression, Random Forests, XGBoost
-
Tuned hyperparameters to maximize predictive score
-
Evaluated with accuracy, ROC-AUC, precision-recall
Key Learnings:
-
ML model experimentation workflows
-
Importance of preprocessing and data quality
-
Tradeoffs between different classification models
Tech Stack:
Python, Pandas, Scikit-Learn, Matplotlib, Jupyter
GitHub I View Report
Cancer Gene Classifier
Machine learning model for tumor subtype prediction using TCGA gene expression data.
Approach: Random Forest + TCGA dataset
Result: 91% accuracy, ROC-AUC 0.94
Tech Stack: Python, Pandas, Scikit-learn
GitHub I View Report
Approach / Methodology:
Key Learnings:
-
Business-focused machine learning
-
Working with ambiguous real-world data
-
Building financial models and ML systems together
Tech Stack: Python, Pandas, Excel, Scikit-Learn, AWS S3, Jupyter