Projects | Veomn

My Projects

RoveMiles — ML Engineering + Financial Modeling Work

Goal:

Built end-to-end predictive analytics and financial modeling system for RoveMiles.

Problem Statement:

Built scalable ML pipelines on AWS to process large transactional datasets, performing anomaly detection, feature engineering, and dashboard development for key business metrics.

Result:

Financial model, Identified high-impact churn features. Built metrics dashboards for KPIs and revenue models.

Approach / Methodology:

Developed churn prediction and retention models using Scikit-Learn to identify at-risk customers.
Implemented anomaly detection on booking and transaction logs to surface irregular patterns in real time.
Built a comprehensive financial forecasting model in Excel covering GMV, blended take rate, revenue curves, partner payouts, and profit margins.
Conducted sensitivity analysis on mile issuance and redemption behavior to evaluate profitability under varying assumptions.
Designed a complete ML pipeline lifecycle — from data ingestion and validation to feature engineering, model training, and monitoring.

Key Learnings:

Business-focused machine learning
Working with ambiguous real-world data
Building financial models and ML systems together

Tech Stack:

Python, Pandas, Excel, Scikit-Learn, AWS S3, Jupyter

GitHub I View Report

Real-Time Customer Behavior Analytics Pipeline (AWS)

Goal:

Turning live data into instant business insight.

Problem Statement:

Businesses need immediate visibility into customer interactions, but traditional batch-processing systems cause delays in detecting anomalies, trends, and engagement patterns.

Results / Visuals:

Cut event-to-dashboard latency to under 2 minutes
Fixed Redshift cost spike (~$1.3k) and optimized compute use
Enabled BI team to visualize customer trends in near-real-time
Fully fault-tolerant streaming architecture

Approach / Methodology:

Designed full architecture: Kinesis → Lambda → S3 → Glue Crawler → Glue ETL → Glue DQ → Redshift → QuickSight
Built PySpark ETL scripts for cleaning, partitioning, transforming event data
Set up Glue Data Catalog tables, schema inference, and automated crawlers
Configured Redshift clusters, IAM roles, networking, VPC, subnets
Implemented AWS Budgets + cost controls after diagnosing Redshift overages
Created QuickSight dashboards for funnels, anomalies, activity timelines

Key Learnings:

Deep practical understanding of cloud data engineering architecture
Schema evolution, Glue DQ, IAM policy debugging, Redshift performance tuning
How real-time ETL differs from batch pipelines

Tech Stack:

AWS Kinesis, AWS Lambda, AWS Glue, Glue DQ, Redshift, S3, Athena, CloudWatch, PySpark, VPC, IAM

GitHub I View Report

AI-Powered Retrieval-Augmented Generation System

Goal:

Offline AI for secure, collaborative insight.

Problem Statement:

Teams managing sensitive or confidential data require a secure, locally hosted LLM solution that operates offline while maintaining high response quality, accurate knowledge retrieval, and support for multi-agent collaboration.

Results / Visuals:

● Reduced API cost by 100% via full offline inference

● Achieved faster hybrid retrieval through graph + vector search

● Delivered a modern UI suitable for internal research workflows

Approach / Methodology:

Built a full RAG architecture using Microsoft GraphRAG for structured retrieval
Integrated AutoGen agents for multi-agent reasoning
Used Ollama to run local models (Mistral, Nomic embeddings)
Added Lite-LLM proxy to enable OpenAI-style function calling with local models
Designed an interactive Chainlit UI supporting multi-threaded chats
Generated embeddings and stored them in local vector stores

Key Learnings:

Differences between graph-based and dense retrieval
Offline inference tradeoffs, local model memory/latency handling
Multi-agent orchestration patterns

Tech Stack:

Python, Chainlit, Microsoft GraphRAG, AutoGen, Lite-LLM, Ollama, Mistral, Nomic Embeddings, FAISS

GitHub I View Report

ChurnGuard — Customer Churn Prediction ML Platform

Goal:

ML-based churn prediction with +15% accuracy improvement.

Problem Statement:

Businesses need tools to forecast customer churn and proactively address retention weaknesses

Results / Visuals:

● 15% accuracy increase after tuning

● Clear ranking of features affecting churn risk

● Delivered a dashboard-ready dataset of churn probabilities

Approach / Methodology:

Cleaned and normalized customer datasets
Feature engineering + outlier detection
Used Logistic Regression, Random Forests, XGBoost
Tuned hyperparameters to maximize predictive score
Evaluated with accuracy, ROC-AUC, precision-recall

Key Learnings:

ML model experimentation workflows
Importance of preprocessing and data quality
Tradeoffs between different classification models

Tech Stack:

Python, Pandas, Scikit-Learn, Matplotlib, Jupyter

GitHub I View Report

Cancer Gene Classifier

Machine learning model for tumor subtype prediction using TCGA gene expression data.

Approach: Random Forest + TCGA dataset

Result: 91% accuracy, ROC-AUC 0.94

Tech Stack: Python, Pandas, Scikit-learn

GitHub I View Report

Approach / Methodology:

Key Learnings:

Business-focused machine learning
Working with ambiguous real-world data
Building financial models and ML systems together

Tech Stack: Python, Pandas, Excel, Scikit-Learn, AWS S3, Jupyter