aicompletedJan 2024 – Jun 2024

LLM Incident Chatbot

Reducing mean incident query resolution time by 60%

Role: Software Engineering Program InternDuration: 6 months

Overview

When production incidents hit, every minute counts. Engineers were spending too long searching through runbooks and Slack threads for answers. I built an LLM-powered chatbot that understood incident context, searched historical data, and surfaced relevant solutions instantly. The 60% reduction in resolution time meant fewer late nights and faster recovery.

The Problem

When production incidents occurred, engineers spent significant time searching through runbooks, Slack threads, and documentation to find relevant solutions. Mean query resolution time was high, and institutional knowledge was scattered across multiple systems.

The Approach

I designed an LLM-powered chatbot using OpenAI APIs that understood incident context and searched historical incident data to surface relevant solutions. The system combined semantic search over past incidents with LLM-generated summaries and recommendations.

Built with Python for the backend inference pipeline, React and TypeScript for the frontend interface, and PostgreSQL for storing and indexing historical incident data. The architecture allowed for real-time streaming of responses, giving engineers immediate feedback while the system searched through historical records.

Key Decisions

Using OpenAI APIs for the LLM layer allowed rapid iteration on prompt engineering while maintaining high response quality. The system prompt was carefully crafted to prioritize accuracy and cite specific past incidents rather than generate speculative solutions.

I implemented a retrieval-augmented generation (RAG) pattern - the LLM always grounded its responses in actual historical incident data, reducing hallucination risk in a high-stakes operational context.

Impact

The chatbot reduced mean incident query resolution time by 60%. Engineers could get relevant context and suggested solutions within seconds instead of manually searching through multiple systems. The tool became a standard part of the incident response workflow.

Technology Stack

OpenAI APIPythonReactTypeScriptPostgreSQL