Verdigris LLM Chatbot: Usability Study

Verdigris LLM Chatbot: Usability Study
Qualitative Research
Usability Testing
Thematic analysis
Validated an LLM-based energy management chatbot for data center technicians through comprehensive usability testing, identifying 14 critical issues and achieving 100% user interest in product adoption.
Context: A sponsored research project by Verdigris at the University of Washington.
Role: Lead UX Researcher (Team of 5)
Timeline: Jan 2025 – Mar 2025
Collaboration: Spearheaded the research design and managed a team of 4 researchers to deliver prioritized friction points and trust calibration strategies to the engineering team.
100%
User Interest
All external participants 
expressed desire to integrate the chatbot 
into their workflow.
14
Issues Identified
Critical usability barriers discovered across interface, performance, 
and data categories.
7
Participants
Data center technicians and facility managers tested the prototype.
The Challenge: Validating AI Before Launch
Business Context
Verdigris needed to validate whether their LLM-powered chatbot prototype could effectively help data center technicians optimize energy management before product launch. Existing customers struggled to analyze energy data efficiently, requiring faster ways to monitor consumption and address maintenance issues.
The Stakes
Without proper validation, launching could mean wasted development resources, poor user adoption, and missed market opportunities in the competitive energy management space.
Research Goal 1
Validate the value proposition by determining if the chatbot meets customer needs.
Research Goal 2
Identify usability issues hindering users from addressing their pain points.
Research Goal 3
Explore additional use cases beyond energy anomaly detection and market entry opportunities.
Approach: Heuristics + User Testing
Given the tight product development cycle before launch, we designed a three-phase mixed methods study combining 
heuristic evaluation, moderated remote usability testing, and post-test interviews. 
This approach allowed us to catch obvious issues early with internal testing before validating with real users.
1
Heuristic Evaluation
Team assessed prototype against usability principles, identifying early issues before user testing
2
Internal Testing
4 Verdigris employees role-played as data center technicians to catch technical and functional issues
3
External Validation
3 real data center professionals provided authentic insights on workflow integration and data needs
4
Post-Test Interviews
Structured debriefs captured reflections on experience, expectations, and integration opportunities
Why this combination? Heuristic evaluation caught interface issues quickly and cheaply. Internal testing validated fixes before external sessions. Real users revealed workflow context that role-playing couldn't capture, like what data they actually need to identify root causes.
Research Analysis Framework
Our comprehensive evaluation strategy combines qualitative and quantitative methods to validate the chatbot prototype's value proposition, identify usability barriers, and guide product development.
The framework employs thematic analysis, sentiment analysis, and descriptive statistics across three primary research goals.
Goal 1: Validate Value Proposition
Methods: Thematic Analysis, Descriptive Statistics
Key Metrics:
Categorization of user needs and expectations
Number of users reporting efficiency improvements
Adoption willingness percentage
Information usefulness ratings
Goal 2: Identify Usability Issues
Methods: Sentiment Analysis, Thematic Analysis, Descriptive Statistics
Key Metrics:
List of frustration points during navigation
Types of usability barriers encountered
Expectation gaps in chatbot responses
Goal 3: Guide Product Direction
Methods: Thematic Analysis, Comparative Analysis
Key Metrics:
Additional use cases identified
User group adoption patterns
Specific improvement recommendations
Data Collection & Analysis Pipeline
The UW team aggregates responses in Excel for synthesis, focusing on qualitative methods alongside descriptive statistics. Data collection combines observation and debrief interviews to capture both behavioral patterns and user perceptions.
Analysis Techniques:
Thematic Analysis: Identifies recurring themes and patterns in user responses
Sentiment Analysis: Assesses positive, neutral, or negative feedback sentiment
Comparative Analysis: Segments insights based on workflows and user needs
Descriptive Statistics: Measures frequency trends and user struggle percentages
Participant Recruitment
Participant Demographics & Criteria
Industry
Data Centers, specifically for 
single-tenant data centers.
Geography
North America
Role
Decision-makers in energy management, operations, and IT, such as: Facility Managers, Energy Managers, Data Center Operators.
Recruitment
Reaching the right participants in limited timeline was challenging. 
We contacted 24 candidates total: 
4 internal employees, 
10 in-network outreaches by Verdigris (20% conversion), 
and 10 cold LinkedIn outreaches by our team (10% conversion).
Participant 01
Service Line Manager, 15 years industry experience
Participant 02
Director of Facility Management, 
14 years in role
Participant 03
Data Center Operation Engineer, 2.5 years industry experience
Top 3 Critical Insights That Changed Everything
1
Users Can't Find Where to Continue Conversations
7 out of 7 participants struggled to locate the entry point for follow-up queries. The chatbot's single-query design conflicted with modern chat expectations, forcing users to scroll back up—breaking their flow and causing frustration.
"I keep looking for where to type my next question... why do I have to scroll all the way back up?" 
— Participant during testing
2
Relevant Answers Still Disappoint Users
While 89% of AI responses were technically relevant, 43% failed to meet user expectations. Even relevant answers lacked the contextual data needed for troubleshooting—like ambient conditions, electrical information, and root cause analysis.
"The chatbot tells me there's short cycling, but it doesn't give me the building information or ambient conditions I need to actually fix the problem." — External participant
3
Performance Kills Trust Before Content Can Build It
27 out of 55 queries took over 15 seconds to respond (average: 40 seconds). Combined with technical errors affecting 4 of 7 users, slow performance undermined confidence in the tool before users could evaluate answer quality.
"I'm sitting here waiting... is it working? Did it break? This is too slow for my workflow." 
— Facility manager during testing
Three Categories of Usability Issues
Interface Usability
Issues related to the chatbot's user interface, such as unclear navigation, 
confusing layouts, and inefficient flows 
that hinder user interaction.
Performance Usability
Issues affecting the chatbot's responsiveness and reliability, including load times, technical errors, and system stability concerns.
Data Usability
Issues affecting data retention and trust, including scannability, validity, and retrieval of responses that impact user confidence.
Our testing framework categorizes findings into these three areas to provide actionable insights for improvement. Each category addresses specific pain points that emerged during user testing sessions.
Our findings are categorized by severity levels: High (observed with almost all participants, large impact), Medium (40-70% of participants, creates inconvenience), and Low (few participants, minimal impact).
Micro-Level Cross-Analysis of AI Reliability
"Even when AI provides accurate facts, factors like response formatting or latency can destabilize user trust. 
I conducted this deep-dive to identify the precise inflection points of Trust Calibration."
The Approach: Multidimensional Correlation Mapping
To uncover the nuanced relationship between AI Relevance and User Satisfaction, I performed a comprehensive re-analysis of all moderated sessions. Moving beyond binary success/failure metrics, I mapped Answer Load Time (latency) against Emotional Cues and real-time Workflow Expectancy to visualize the user’s cognitive journey.
The Rigor: Decoding Implicit Feedback
Timestamp-Based Latency Analysis: By analyzing waiting periods and immediate user reactions on a second-by-second basis, I identified the "Threshold of Patience" and the exact moments where trust began to erode due to system delays.
Emotional & Verbal Coding: I meticulously coded implicit feedback—such as micro-expressions of uncertainty when a user verbally said "Helpful," or skeptical reactions to overly detailed responses—to capture the gap between what users say and what they actually feel.
Critical Interface & Performance Issues
Unclear Entry Point
7 out of 7 users struggled to find where to type follow-up queries after the first interaction. Users had to scroll up to locate the input field.
Recommendation: Place entry point after each generated query to match modern chatbot conventions.
Return Key Confusion
20 out of 55 interactions used the "return" key expecting to submit, but it created a new line instead.
Recommendation: Make "Enter" submit and "⌘+Enter" create new lines, mirroring standard chat interfaces.
51%
Fast Responses
28 of 55 answers averaged 10 seconds
49%
Slow Responses
27 of 55 answers averaged 40 seconds
4 out of 7 users expressed explicit dissatisfaction with response times exceeding 15 seconds. Optimize model performance and enable progressive response display.
Data Usability Challenges
1
Data Not Always Useful
6 of 6 users found information relevant to fixing issues was not always present. Data didn't use the same measurement units as building operators.
Fix: Identify and present info useful to technicians—timestamps, root causes, prescriptions, contact information. Allow customization of measurement units.
2
Data Not Scannable
4 of 6 users took up to 40 seconds just to scan responses for relevance. Dense tables and unstructured text hindered quick comprehension.
Fix: Meaningfully group data, present chronologically, and add visualization and graphics for easier scanning.
3
Doubts About Validity
4 of 6 users questioned whether they could trust the generated responses. One participant stated: "All of these need to be taken with a grain of salt."
Fix: Show source data to build user confidence and trust in the system's outputs.
4
Difficult Retrieval
5 of 7 users had difficulty retrieving previous query responses. History stored only one query at a time without context maintenance.
Fix: Maintain context for queries in each session and group chat history by date, issues, or allow naming sessions.
User Satisfaction & Technical Stability
Answer Relevance vs. 
User Satisfaction
Despite 89% of answers showing clear relevance to queries, users were dissatisfied with 36% of relevant answers. Overall, 43% of answers did not meet user expectations, and 6 of 7 users experienced answers not meeting their needs.
This gap between relevance and satisfaction highlights the importance of not just providing accurate information, but presenting it in a way that aligns with user expectations and workflows.
89%
Relevant Answers
64%
User Satisfied
11%
Irrelevant Answers
Technical Errors Encountered
Attribute Error
3 out of 7 users experienced attribute errors during answer generation. This occurred in 3 out of 55 trials.
AttributeError: 'str' object has no attribute 'get'
Fatal Error
1 out of 7 users experienced a fatal error causing session termination due to context length exceeding 128,000 tokens.
Critical: Enhance system stability through comprehensive QA and user-friendly error messages.
The Outcome: Validated Direction, 
Prevented Costly Launch
100% User Interest
All 3 external participants expressed desire to integrate the chatbot into their workflows and continue involvement in future studies
14 Issues Prioritized
Categorized by severity (4 high, 7 medium, 3 low) across interface, performance, and data usability
Product Roadmap Informed
Research directly shaped pre-launch improvements, preventing costly post-launch fixes and user churn
"I have been absolutely begging our consultants to help us develop something like this!" 
— Blair Richardson, Service Line Manager
The study validated strong market demand while identifying critical barriers that would have undermined adoption. By catching these issues pre-launch, Verdigris could refine the product with confidence, knowing exactly what users needed for successful integration.
Integration Clarity
6 of 7 participants wanted chatbot integrated with existing fault detection systems as a virtual assistant
Market Expansion
Identified additional use cases: energy penalties, cost analysis, consumption prediction, and proactive issue detection
Trust Requirements
Discovered need for source data transparency and contextual information to build user confidence
Key Learnings: Balancing Business and Research
Three Takeaways
Research Velocity
Mixed methods allowed us to move fast without sacrificing quality. 
Heuristic evaluation and internal testing caught 60% of issues before expensive external recruitment.
Stakeholder Communication
Categorizing issues by severity (high/medium/low) helped Verdigris prioritize fixes within their launch timeline, demonstrating research's direct business impact.
User Empathy
External participants revealed that "relevant" answers aren't enough. 
Users need contextual data and actionable insights to actually solve problems in their workflow.
What I'd Do Differently
Provide participants with concrete troubleshooting scenarios instead of free exploration. Users spent too much time learning the dataset rather than evaluating the chatbot's effectiveness. A realistic prompt like "On Oct 2nd, there was an outage—find out what went wrong" would have yielded deeper insights faster.
The PM-Researcher Balance
This project taught me how to balance business urgency with research rigor. Verdigris needed quick validation before launch, but rushing would have missed critical issues. By starting with heuristic evaluation and internal testing, we caught obvious problems early, allowing external sessions to focus on deeper workflow and integration questions.
Looking Forward: 
Recommendations for Next Phase
1
Test After Integration
Validate the next iteration after integrating with the fault detection platform 
to assess real 
workflow impact.
2
Expand Recruitment
Reach out to 
50+ candidates to achieve target sample size, 
given 10% cold outreach conversion rate.
3
Investigate Trust
Include explicit trust-related questions to understand why users question AI-generated data accuracy.
4
Scenario-Based Tasks
Provide realistic troubleshooting scenarios instead of free exploration to evaluate chatbot effectiveness.
This study demonstrated the value of validating AI products before launch. By identifying critical usability barriers early, Verdigris can refine their chatbot with confidence, knowing they're building something users actually want and will adopt.
Impact Beyond This Project: The research framework and severity categorization system we developed for Verdigris has been adopted for their ongoing FDD Copilot development, ensuring user-centered design continues throughout the product lifecycle.