Can AI help product management summarize customer feedback?

Summarizing customer feedback is one of the most common “you’ve got to try this” AI-for-product-managers cases I’ve seen, so I did an experiment, and while it’s a potentially good tool, you need to keep reading the feedback yourself.

Reading feedback builds character, and I’d argue it’s a crucial part of any good product manager’s quest to better understand their customers. You’re looking for sentiment, yes, and also patterns of complaints, but the truly great finds are in strange outliers when you discover someone’s using your product in a new way, or there’s one complaint that makes the hair on your neck stand up.

I was concerned going in that LLMs are good at generating sentences word by most probable word, they’re about what the consensus is, and often a past, broader consensus. In my own experience, if you ask about an outdated but extremely popular and long-held fitness belief, the answers you’ll get will reflect the outdated knowledge. And I’ve run into problems with text summarization also resulting in plausible confabulations where re-writing a description of a project suddenly includes a dramatic stakeholder conflict of the type that often does occur.

So given a huge set of user comments, will summarization find unique insights, or sand off the edges and miss the very things a good product manager will spot? Is it going to make up some feedback to fill a gap, or add some feedback that fits in?

Let’s go!

The test

I imagined a subscription-based family recipe and meal planning tool, ToolX. It’s is generally pretty good but the Android client doesn’t have all the features, and with functional but ugly design that doesn’t handle metric units well.

I wrote just under 40 one-line comments you’d get in a common “thumbs up/thumbs down & ask for a sentence” dialogue. I tried to make them as like actual comments I’ve seen from users before, a couple feature suggestions, some people just typing “idk” in the text box… and then threw in a couple things I’d want a good product manager to catch.

  1. POISON. Actual poison! Snuck in after a positive comment opening: “Works great for recipe storage, AI suggestions for alterations are sometimes unhealthy or poisonous which could be better.“ You should drop everything and see what this is about. Do not stop and see if poisoning results in increased engagement from social media. This should be “you’re reaching out to this person however you can while the quest to find a repro case kicks off” level.
  2. Specific UX issue: there’s one complaint about needing color blind mode. If you’ve missed accessibility in your design, that should be a big deal, you should also put this on the list (below the poison issue)
  3. Irrelevant commentary: I have someone complaining about coming into a sandwich shop and they can’t get served because the shop is closing. (Who knows where these come from – bots? People copy and pasting or typing into the wrong window, or otherwise being confused?) You just gotta toss these.
  4. Interesting threads to pull on: someone’s using this family tool for themselves and it makes them feel alone. Someone’s using it for drink mixing. Someone thinks it’s not for the whole family if it doesn’t do recipes that are pet-friendly.
    The prompt was “I’m going to upload a file containing user feedback for an app, every line is a separate piece of feedback. Can you summarize the major areas of feedback for me?”

(yes, it’s bare-bones and inelegant, patches welcome)

What happened

ChatGPT 4o won the coin toss to go first (link to the thread)

This looks immediately useful. You could turn this into an exec and probably get away with it (look forward to that “sometimes meets expectations” rating in a year!)

Organization’s fine:

  1. General Sentiment
  2. Features People Like
  3. Feature Requests and Suggestions
  4. Technical & Pricing Issues
  5. Outliers

As you scan, they seem filled with useful points. A little unorganized and the weighting of what to emphasize is off (calling out ‘drink mixing’ as a feature someone likes, when that’s not a feature and it’s only mentioned once), but generally:

The good

  • almost everything in the sample set that could be considered a complaint or request is captured in either feature requests or issues
  • the summaries and grouping of those are decent in each category
  • the mention of someone using it solo and feeling lonely is caught (“One user mentioned the app working well but feeling lonely using it solo—potentially valuable feedback if targeting more than just families.”)

The bad

  • Misses poison! POISON!!! Does not bring up the poison at all. Does not surface that someone is telling you there’s poison in the AI substitution — the closest it gets is “People want help with substitutions when ingredients are unavailable” which is a different piece of feedback
  • It represents one phone complaint as “doesn’t work on some devices” when it’s one device. So “Device compatibility” is a bullet point in technical & pricing issues for one mention, at the same level of consideration as other, more-prevelant comments. This is going to be a persistent issue.

I’d wonder if the poison is being ignored because the prompt said “major areas of feedback” and it’s just one thing — but then why are other one-offs being surfaced?

(If I was of a slightly more paranoid mind, I might wonder if it’s becaus it’s a complaint about AI, so it’s downplaying the potentially fatal screw-up. It’d be interesting to test this by feeding complaints about AI and humans together and seeing if there’s bias in what’s surfaced.)

Trying with other models

They did about the same overall. Some of them caught the poison!

Running this again, specifying ChatGPT 4o explicitly in Perplexity: this time 4o did call out the AI substition (“AI suggestions for recipe alterations are sometimes unhealthy or inappropriate”) but again did not mention poisoning. Did the same turning one comment into “users want…”. Did not note it was throwing out the irrelevant one. (link)

Gemni 2.5 Pro did note the poison in a way that reads almost dismissively to me (“AI-driven recipe alterations were sometimes seen as unhealthy or potentially unsafe (“poisonous”).”) Yeah! Stupid “humans” with their complaints about “poisons.” Otherwise same generally good-with-overstating-single-comments. Did note the irrelevant comment. (link)

Claude 3.7 Sonnet. Does bring up the poison, also softened significantly (“Concerns about AI-suggested recipe alterations being unhealthy or even dangerous”). Same major beats, different bullet point organization, same issue making one piece of feedback seem like it’s a wide problem (“performance problems on specific devices” when there’s only one device-specific). Noted the review it tossed, noted the chunk of “very brief, non-specific feedback”.

Interestingly, one piece of feedback “Why use this when a refridgerator note is seen by everyone and free? $10 way too high per month for family plan” is lumped into pricing/subscription elsewhere, and here Claude brings this up as “Questions about value compared to free alternatives” which made me laugh. (link)

Grok-2 treated the poison seriously! Organized into positive/Areas of Improvement/Neutral/Suggestions for Development, the first item in Areas for Improvement was “Health and Safety: There are concerns about AI suggestions for recipe alterations being potentially unhealthy or even poisonous.” Woo! Subjectively, I felt like this did the best summary of the neutral comments just be noting there (“Some users find the app decent or pretty good but not exceptional, suggesting it’s adequate for their needs but not outstanding.”) (link)

Commonalities

If I shuffled these, I think I’d only be able to identify ChatGPT because of the poison — they all read the same in terms of generic organization, detail, level of insight offered, effectiveness in summarization. (If you’ve got a clear favorite, please, I’d love to hear why). And they all essentially made the same points, sometimes grouped a little differently, or in different sections.

None of them had confabulation (that I caught) in any of the answers, which was great, especially after yesterday’s debacle.

None of them took the sandwich shop complaints seriously. I found it interesting some would note that they saw that irrelevant comment, others elided it entirely.

Useful, but don’t give up reading it yourself

I can see where a good product manager could do a reading pass where they’re noting the really interesting stuff that pops out to them, leaving the bulk group-and-summarize to a tool, saving themselves the grind of per-comment categorizing or tagging, returning to validate the summary against their own reading, and re-writing to suit. I wouldn’t suggest it as a first pass, as it would be difficult to the bias it’ll introduce when you approach the actual feedback.

(Or I can see with additional follow-up questions that you could probably whip any of these into better shape, and as you saw from the prompt, that is intentionally bare bones, you could also just start off better.)

If I had a junior product manager turn in any of those summaries to me, and I’d also done the reading, I’d be disappointed at the misses and the superficial level of insight. What if I hadn’t, though? Would I sense that they hadn’t done the legwork? I worry I might not.

My concern is it’s so tempting, and if you only threw your feedback into one of the tools and called it a day, you’d be doing the customers, your team, and yourself a disservice. I don’t know a good product manager who isn’t forever time-crunched, and it’s going to be easy to justify not investing in doing the reading first, and then leaving it for later in-depth follow-up that doesn’t happen, and never building those empathy muscles, the connection, and meanwhile your customers are all dying from AI ingredient substitutions and the team can’t figure out why your most active and satisfied customers aren’t using the app as much.

So please: do the reading, whatever tools you’re employing.

One thought on “Can AI help product management summarize customer feedback?

  1. E.H.

    I appreciate the care and consideration you bring to the topic. I remain pretty firmly in the f*AI camp, primarily due to fairly clear association with potential to erode human cognitive functions over time with regular use, and encroachment on legitimate creative endeavors, but I get where it may have limited value in workload management in some use-cases, and I value your thoughtful (and skillful) analysis in practical scenarios. Thanks for that.

Leave a Reply

Your email address will not be published. Required fields are marked *