Instant Messaging and Socialized Evaluation

Feb 21, 2026

#agent #im #collaboration #evaluation #benchmark #bub

This is a short post about human-agent collaboration in instant messaging, and why we need socialized evaluation more than more benchmarks.

We are currently raising bub in group chats, and watching the skill transfer process between bub and another agent friend, kapy. That experience triggered this piece. In one multi-person group chat, we collaborate with three to four agents at the same time. Tasks are concurrent and fragmented, and we always need to be ready to switch. Some people care more about video and image analysis. Some only want a bedtime story. Some are stuck on how to use a live CD to repair a Linux desktop environment. And all of that is mixed with holiday greetings and local culture roundups. Also, these agents were built from scratch, not adapted from already mainstream implementations.

The "Honey Pot" of IM

The popularity of bots like OpenClaw is probably not because IM is inherently better at task dispatch. For a long time, most agent systems have appeared as chat interfaces. But I prefer to call this a victory of entry points. Humans are most familiar with input boxes and chat boxes. They are the most convenient and most frequently used UI.

IM is one step closer than that, because it is almost part of modern life. Interaction is naturally asynchronous, and expectations are lower as a result. After all, there is now a partner that can respond 24/7 and help complete tasks. In an interface you already know, native notifications and the "just ask casually" mindset make everything feel reachable. Sometimes one short message can cover tasks that previously required opening web pages and editors.

But IM is not a natural collaboration model. It is made of fragmented, multi-turn asynchronous conversations. A single thread can hardly carry a complete collaboration and relationship network. When you are suddenly excited about meaningful progress in an old task, an agent may jump in naturally and cheer for a thread that no longer matters. It has already gotten lost in context.

"We need a new IM" is a tempting misunderstanding. IM is only part of the channel. It is a lower-cost and friendlier entry point. A more talkative bot, or an IM that looks more like a collaboration tool, still cannot give you everything you want. Context should not be historical baggage we must carry all the time. It should be a working set constructed on demand, over and over. Collaboration objects and states need clearer representations and interaction patterns, just like how we still keep creating boards and todo lists.

The world of agents should not be trapped inside chat logs.

Toward Socialized Evaluation

Every time a new foundation model is released, we are pushed to celebrate or feel disappointed. The model beats humans again on difficult tasks like coding and math, and keeps refreshing records.

I am not trying to reject that. Benchmarks clearly have value: reproducibility, comparability, standardization, and scalability. We should recognize and respect these results. In human history, we have always had similar mechanisms as references for capability evaluation, just like exams we must go through before doodling on the paper.

But when agents enter IM, group chats, and human society, they face "real tasks." They are no longer dealing with clean questions, but with messy life. Benchmarks rarely cover this: incomplete context, vague intent, parallel topics, and key coordination that may happen somewhere else. An agent cannot know everything, and does not need to answer everything. But it must still earn people's preference. Capability then becomes a coexistence problem.

Socialized evaluation is ordinary and grounded. Put agents into daily human life. Let them face interruptions, awkward silence, digressions, emotional expression, and asymmetric information. Watch how they struggle through uncertain context while still pushing most tasks forward, just like we do in human collaboration. Socialized evaluation is hard to summarize with a score and hard to turn into rankings. But it provides direct insight into human-agent collaboration. You will know how willing you really are to let an agent into your life.

Forms like OpenClaw move agents into a new position. They are no longer tools you jump to. They become part of communication itself. Problems will get solved. But whether it is "worth it" becomes a new criterion. Even the smartest models must keep reducing assumptions about background context. Socialized evaluation will force agents to admit the shape of the real world, and then survive in it.