
My Projects
RoveMiles — Data Analysis + Financial Modeling Work
Goal:
Built end-to-end predictive analytics and financial modeling system for RoveMiles.
Problem Statement:
Analyzed large transactional datasets, engineered financial metrics, and contributed to dashboards for revenue modeling and key business KPIs.
Result:
Delivered a comprehensive financial model, identified high-impact churn and profitability drivers, and built dashboards for revenue and unit-economics analysis.
Approach / Methodology:
-
Built a multi-sheet financial forecasting model covering GMV, take rate, redemption cost, partner payouts, revenue curves, and profit margins.
-
Performed scenario and sensitivity analysis on mile issuance/redemption behavior to evaluate profitability.
-
Analyzed partner, redemption, and customer behavior datasets to identify churn factors and optimize pricing.
-
Created dashboards and tools for KPI tracking and revenue modeling.
Key Learnings:
-
Business-focused machine learning
-
Working with ambiguous real-world datasets
-
Integrating financial models with predictive analytics
Skills:
Excel financial modeling (GMV, take rate, unit economics), data analysis, scenario forecasting, partner-pricing modeling, revenue/cost analysis, dashboard creation
Real-Time Customer Behavior Analytics Pipeline (AWS)
Goal:
Turning live data into instant business insight.
Problem Statement:
Businesses need immediate visibility into customer interactions, but traditional batch-processing systems cause delays in detecting anomalies, trends, and engagement patterns.
Results / Visuals:
-
Cut event-to-dashboard latency to under 2 minutes
-
Fixed Redshift cost spike (~$1.3k) and optimized compute use
-
Enabled BI team to visualize customer trends in near-real-time
-
Fully fault-tolerant streaming architecture
Approach / Methodology:
-
Designed full architecture: Kinesis → Lambda → S3 → Glue Crawler → Glue ETL → Glue DQ → Redshift → QuickSight
-
Built PySpark ETL scripts for cleaning, partitioning, transforming event data
-
Set up Glue Data Catalog tables, schema inference, and automated crawlers
-
Configured Redshift clusters, IAM roles, networking, VPC, subnets
-
Implemented AWS Budgets + cost controls after diagnosing Redshift overages
-
Created QuickSight dashboards for funnels, anomalies, activity timelines
Key Learnings:
-
Deep practical understanding of cloud data engineering architecture
-
Schema evolution, Glue DQ, IAM policy debugging, Redshift performance tuning
-
How real-time ETL differs from batch pipelines
Skills:
AWS Kinesis, AWS Lambda, AWS Glue, Glue DQ, Redshift, S3, Athena, CloudWatch, PySpark, VPC, IAM
GitHub I View Report
AI-Powered Retrieval-Augmented Generation System
Goal:
Offline AI for secure, collaborative insight.
Problem Statement:
Teams handling sensitive or confidential data need a locally hosted LLM solution that runs fully offline while providing accurate retrieval, high-quality responses, and support for multi-agent collaboration.
Results / Visuals:
● Reduced API cost by 100% via full offline inference
● Achieved faster hybrid retrieval through graph + vector search
● Delivered a modern UI suitable for internal research workflows
Approach / Methodology:
-
Built a full RAG architecture using Microsoft GraphRAG for structured retrieval
-
Integrated AutoGen agents for multi-agent reasoning
-
Used Ollama to run local models (Mistral, Nomic embeddings)
-
Added Lite-LLM proxy to enable OpenAI-style function calling with local models
-
Designed an interactive Chainlit UI supporting multi-threaded chats
-
Generated embeddings and stored them in local vector stores
Key Learnings:
-
Differences between graph-based and dense retrieval
-
Offline inference tradeoffs, local model memory/latency handling
-
Multi-agent orchestration patterns
Skills:
Python, Chainlit, Microsoft GraphRAG, AutoGen, Lite-LLM, Ollama, Mistral, Nomic Embeddings, FAISS
ChurnGuard — Customer Churn Prediction ML Platform
Goal:
ML-based churn prediction with +15% accuracy improvement.
Problem Statement:
Businesses need tools to forecast customer churn and proactively address retention weaknesses
Results / Visuals:
● 15% accuracy increase after tuning
● Clear ranking of features affecting churn risk
● Delivered a dashboard-ready dataset of churn probabilities
Approach / Methodology:
-
Cleaned and normalized customer datasets
-
Feature engineering + outlier detection
-
Used Logistic Regression, Random Forests, Neural Networks, Gradient Boosting
-
Tuned hyperparameters to maximize predictive score
-
Evaluated with accuracy, ROC-AUC, precision-recall
Key Learnings:
-
ML model experimentation workflows
-
Importance of preprocessing and data quality
-
Tradeoffs between different classification models
Skills:
Python, Pandas, Numpy, Scikit-Learn, Matplotlib, Jupyter
Gene Expression Visualizer- RNA-seq & Microarray Analysis
Goal:
Simplify gene-expression visualization for high-dimensional datasets.
Problem Statement:
RNA-seq/microarray data contain thousands of genes, requiring clustering, dimensionality reduction, QC checks, and publication-quality plots — tasks that otherwise need extensive coding.
Approach / Methodology:
-
Loaded & normalized expression matrices
-
Performed PCA & hierarchical clustering
-
Generated correlation and QC plots
-
Automated report creation via unified API
Results / Visuals:
-
Clustered heatmaps
-
PCA sample-relationship plots
-
Correlation matrices
-
Expression distribution box plots* One-command automated reporting
Key Learnings:
-
Handling high-dimensional biological data
-
Normalization & QC strategies
-
Accessible visualization design
Skills:
Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn
DNA Analysis Toolkit — Comprehensive Sequence Analysis
Goal:
Provide an all-in-one DNA analysis tool for students and researchers.
Problem Statement:
DNA analysis typically requires multiple tools for GC content, ORFs, codon usage, translation, restriction sites, and molecular properties — making workflows slow and fragmented.
Approach / Methodology:
-
FASTA parsing and standardized structure
-
GC, ORF, codon usage, translation pipelines
-
Restriction-site scanning
-
Visualization with Matplotlib/Seaborn
-
Memory-efficient batch workflows
Results / Visuals:
-
GC-content plots, ORF maps
-
Codon-usage heatmaps
-
Restriction-site visualizations
-
Molecular weight/Tm calculations
-
Batch processing for 1000+ FASTA sequences
Key Learnings:
-
DNA analysis fundamentals
-
BioPython workflows
-
Scientific visualization best practices
Skills:
Python, Biopython, Pandas, Matplotlib, Seaborn
BLAST Result Parser — Automated Sequence Alignment Analysis
Goal:
Automate BLAST alignment parsing and filtering for faster biological insight.
Problem Statement:
BLAST outputs are large and difficult to interpret manually. Researchers need simpler tools to filter hits by e-value, identity, and bit score, extract top matches, and export results for downstream analysis.
Approach / Methodology:
-
Parsed XML/tabular BLAST with Biopython
-
Standardized alignment fields
-
Built chainable filtering pipeline
-
Computed statistics via pandas
-
Provided multi-format export
Results / Visuals:
-
Clean filtering by e-value, identity %, bit score
-
Summary statistics for alignment quality
-
Query-grouped and top-hit reports
-
CSV/JSON exports for further analysis
Key Learnings:
-
BLAST metric interpretation
-
Scientific data-pipeline design
-
Usable API design for researchers
Skills:
Python, Biopython, Pandas, JSON/CSV