Keyush

Keyush Shah

Resume | Google Scholar | Github | LinkedIn | Email ID | Portfolio of Projects

Hello, I work as an AI Researcher at the Computational Social Listening Lab where I am advised by Prof. Lyle Ungar and Prof. Sharath Guntuku.

My projects span high-impact domains including healthcare, finance, and social media, with a core emphasis on creating models that are not only accurate, but also explainable and deployable at scale. I’m passionate about leveraging ML, AI, and software engineering to deliver results that drive business value and societal impact.

🚀 I am currently actively seeking full-time opportunities in applied ML, AI-driven product development, or software engineering — especially within mission-driven and tech-forward teams.

A Sneak Peek into my Value Proposition

My academic journey began with a strong foundation Statistics and Mathematics where I developed the acumen in Analytics, Inference and data driven modeling - it laid the groundwork for my foray into Data Science. Now I am focused on exploring the vast potential of Generative AI and Machine Learning in solving real-world challenges and leveraging data to drive impactful insights.

  • I bring a strong track record of delivering real-world impact through data science and analytics across diverse industries. From optimizing marketing performance with media mix models to driving digital transformation through predictive modeling, my work has consistently translated data into strategic value.
  • I've built scalable data pipelines, deployed machine learning solutions in production, and developed dashboards that inform key business decisions. Whether it's enhancing campaign ROI, streamlining operations, or personalizing customer experiences, I focus on using data to solve high-impact problems and create measurable outcomes.
  • Besides, I have experience with data engineering tools and cloud platforms, particularly the Azure suite and AWS, utilizing services such as Azure Data Factory, Azure Databricks and AWS Sagemaker to build scalable data pipelines and ML models for AI-driven applications.
  • Kindly take a look at my Portfolio of Projects.

    Professional Experience

    1. Penn Medicine

      AI Researcher 2024
    2. Universal Media

      Data Science Intern 2024
    3. IIFL Finance Ltd

      Assistant Manager 2022 - 2023

    Research

    Click the link below to take a look at my research interests and some questions that interest me.

    Research Interests


    Check out my ongoing projects in the section below.

    Current Research Projects

    Publication: pre-prints

    (* denotes equal contribution)
    1. Enhancing Retrieval in QA Systems with Derived Feature Association
      Keyush Shah, Abhishek Goyal*, Isaac Wasserman*
      arXiv preprint, 2024
      paper | arxiv | code

    Portfolio of Selected Projects

    Machine Learning Systems/MLOps

    AI Engineering/LLMs

    Deep Learning

    Computer Systems & Data Engineering

    Probability & Statistical Modeling

    Project Descriptions

    1. Particle Agent

      particle pic

      Backend Architecture & Technologies

      Session Memory & Entity Resolution

      Data Ingestion & Privacy

      Frontend Development

      | code
    2. Ride Duration Prediction

      taxi pic

      🧱Model Development

      🧱 Pipeline Orchestration

      🧱 Experiment Management

      🧱 Deployment

      | code
    3. Image Reconstruction using Diffusion Transformers

      DiT pic
      I developed a PatchVAE model to encode facial features from the CelebA dataset, followed by training a Diffusion model using the VAE's latent representations. This approach successfully reconstructed and generated realistic human face images. The model achieved an impressive FID score of 14.2, highlighting its effectiveness in producing high-quality outputs.
      | code | Reference Paper
    4. Instance Segmentation: By location

      Intsance Segmentation Pic
      I implemented an advanced instance segmentation framework inspired by the SOLO (Segmenting Objects by Locations) model. It features a ResNet backbone for robust feature extraction and a Feature Pyramid Network (FPN) to handle multi-scale object representations efficiently. The architecture consists of two main branches. The Category Prediction Branch assigns pixels to grid-based instance categories, leveraging spatial information to effectively localize and distinguish objects of varying sizes.

      Meanwhile, the Mask Segmentation Branch generates accurate binary masks using a spatially sensitive, fully convolutional network, eliminating the need for traditional bounding boxes or complex post-processing. This end-to-end trainable system simplifies the segmentation pipeline, learning directly from mask annotations to enhance efficiency and deliver high performance across diverse object segmentation tasks.


      | code | Reference Paper
    5. Improving Depth Estimation of DinoV2

      depth estimation pic

      The research demonstrated that combining temporal information across frames reduces per-frame errors, enhancing the scaling accuracy of depth maps. The project explored methods to improve depth estimation in DINOv2 using iterative strategies. Initially, ORB features and phase correlation were applied to align depth maps of consecutive frames explicitly. This approach leveraged the spatial shifts between frames to average depth maps and achieved modest reductions in MSE with minimal latency. However, limitations in alignment accuracy due to perspective distortions motivated further refinement.

      Subsequently, a CNN-based adapter was integrated between the DINOv2 encoder and depth adapter. This adapter utilized phase-correlation-derived pixel shifts as additional input to adjust and combine DINO features from consecutive frames. The CNN introduced an inductive bias for local feature alignment, leading to a significant 23.8% reduction in MSE, outperforming vanilla DINOv2-base while being faster. Regularization techniques were later introduced to preserve high-resolution details in depth maps, improving output quality while maintaining accuracy.


      | code | Reference Paper
    6. FitBit

      healthcare bot
      Designed and developed a Django-based AI chatbot for health-related conversations, integrating PostgreSQL for robust patient data management and Langchain for LLM-agnostic model orchestration. Implemented dynamic entity extraction to capture key details like medications and appointment preferences, optimized memory usage for long conversations, and enabled automated escalation of appointment and treatment requests.
      | Github code

    7. Multithreaded Image Blurring with POSIX Threads

      threading img
      | Github code
    8. Scalable ETL Pipelines with Microsoft Azure

      data eng img

      In this project, I implemented a robust ETL (Extract, Transform, Load) pipeline using Azure cloud services to process and analyze data efficiently. The pipeline began with data ingestion from HTTP sources and SQL databases. Using Azure Data Factory (ADF), I created linked services to connect to these data sources and developed data pipelines to automate the extraction of raw data. The ingested data was stored in Azure Data Lake Storage Gen2 (ADLS Gen2) as the raw data layer.

      For data transformation, I utilized Azure Databricks, where the raw data was processed through data cleansing, aggregation, and feature engineering to prepare it for downstream analysis. The transformed data was then stored back in ADLS Gen2 in a structured format. Finally, the processed data was imported into Azure Synapse Analytics, where it was further analyzed. This comprehensive ETL pipeline enabled seamless data movement and transformation, leveraging Azure's ecosystem for efficient data processing and analysis.

      | Github code

    9. Analyzing Consumer Behavior in Mobile Plan Selection Using Statistical Modeling

      nbd picture

      AIM

      The objective of this project is to analyze the distribution of the number of lines included in a customer's primary mobile phone plan. By leveraging statistical modeling, particularly Negative Binomial Distribution (NBD) variations, we aim to understand consumer behavior in selecting family or individual mobile plans. The analysis also seeks to determine underlying patterns in purchasing decisions, quantify market potential, and explore the heterogeneity within the dataset.

      METHOD

      To analyze the dataset, we employed statistical modeling techniques focusing on count data. The primary methodology involved fitting different versions of the Negative Binomial Distribution (NBD) to account for observed trends:


      Various model evaluation techniques, including Q-Q plots, chi-square likelihood ratio tests, and p-value assessments, were used to determine the fit and effectiveness of the models.

      Conclusion

      Conclusion The analysis confirmed that most customers opt for one to four lines per mobile plan, with a significant number of users preferring single-line plans due to convenience. The presence of a right-censored dataset and high homogeneity suggested that purchasing additional lines is influenced by external factors, such as social contagion, promotions, or family needs.

      | Project Link

    10. Deepfake Detection

      deepfake pic
      Deepfake detection research focuses on identifying and analyzing manipulated video content generated using advanced generative models. By curating extensive video datasets and developing innovative annotation frameworks, researchers aim to refine the detection of both visual and temporal artifacts. State-of-the-art Video Vision-Language Models (VLMs) such as VideoLaMA, BLIP, and LLaVA are evaluated by integrating synthetic data and annotated explanations to enhance the categorization of artifacts and improve detection accuracy. Additionally, methodologies like Kendall Tau’s correlation and reliability analysis are employed to verify data and align annotations. These techniques help assess inter-annotator agreement on deformation labels, providing robust benchmarks for evaluating the performance of current state-of-the-art detection models.
      | Github code

    11. Traversability Estimation


      Terrain Image
      Developed a terrain classification model using Semantic Segmentation and an attention enhanced Fully Convolutional Network, achieving a 2% improvement in IoU. Enhanced off-road navigation for autonomous vehicles by optimizing path planning and terrain adaptability.
      paper | code

    12. Bangalore House Prediction


      house Image
      In this project, a machine learning model was developed to predict house prices in Bangalore using the sklearn library. The dataset was sourced from Kaggle and preprocessed using NumPy and Pandas for data cleaning, including outlier detection, feature engineering, and dimensionality reduction. The model was built using linear regression, with hyperparameter tuning implemented through GridSearchCV and performance evaluated using k-fold cross-validation.

      The application was powered by a Flask server, which acted as the backend to handle HTTP requests. The Flask server loaded the saved machine learning model and processed user inputs like square footage and number of bedrooms to return predicted prices. A front-end interface was built using HTML, CSS, and JavaScript, allowing users to input property details and retrieve price predictions dynamically via calls to the Flask API. The entire workflow integrated Python-based tools and libraries, offering a robust and user-friendly application for real estate price prediction.


      | code

    13. Topic-Modelling-with-Latent-Dirichlet-Allocation


      lda Image
      This project focuses on implementing Latent Dirichlet Allocation (LDA) for topic modeling, a natural language processing technique that classifies text into topics based on the corpus's underlying word distributions. Using libraries like NLTK, Gensim, and SpaCy, text preprocessing involved cleaning data with RegEx and preparing it for analysis. Key steps included generating word clouds to visualize word frequency distributions, computing coherence and perplexity scores to optimize the number of topics, and identifying the dominant topic for each document along with its percentage contribution. The LDA model revealed the proportion of documents belonging to the top six dominant topics, facilitating deeper insights into the dataset's structure. Interactive visualizations further enhanced the interpretability of the results, making the analysis accessible and insightful. This project demonstrated a robust approach to uncovering hidden thematic patterns in text data.
      | code