AuditoryBench++

Abstract

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge.Experiments demonstrate that AIR-CoT outperforms off-the-shelf and augmented models. We believe our work provides a strong foundation for building language models that can imagine auditory information without direct audio input, ultimately enabling more natural and human-like multimodal reasoning.

Task Definition

AuditoryBench++ comprises 5 tasks evaluating a spectrum of auditory knowledge, from fundamental comparisons to complex, contextually grounded reasoning:

Pitch Comparison: The model selects which of two sounds has a higher pitch, formulated as a binary decision task.
Duration Comparison: The model compares two described sounds and identifies the one with longer duration.
Loudness Comparison: This task asks the model to select the louder sound between two options based on prompts.
Animal Sound Recognition: This task requires predicting the correct animal corresponding to a given onomatopoeic expression (e.g., 'meow'). Each sample is presented as a multiple-choice question with four options.
Auditory Context Reasoning: This component evaluates a model’s ability to perform contextual auditory reasoning, focusing on interpreting nuanced auditory cues and situational contexts in a multiple-choice format.

Dataset Statistics

Answer Distribution

#Data

Dataset Samples

Pitch Comparison

The sound of Bass is [MASK] than the sound of Alto.

AnswerLower

The sound of Cello is [MASK] than the sound of Violin.

AnswerLower

The sound of Harp is [MASK] than the sound of bass saxophone.

AnswerHigher

The sound of flute is [MASK] than the sound of bass trumpet.

AnswerHigher

Duration Comparison

The sound of clicking is [MASK] than the sound of aircraft engine.

AnswerShorter

The sound of quack is [MASK] than the sound of alarm clock.

AnswerShorter

The sound of meow is [MASK] than the sound of ambulance (siren).

AnswerShorter

The sound of oink is [MASK] than the sound of vehicle horn.

AnswerLonger

Loudness Comparison

The sound of applause is [MASK] than the sound of knock.

AnswerLouder

The sound of baby cry, infant cry is [MASK] than the sound of sigh.

AnswerLouder

The sound of tick is [MASK] than the sound of cough.

AnswerQuieter

The sound of drill is [MASK] than the sound of printer.

AnswerLouder

Animal Sound Recognition

[MASK] make a chirp sound.

AFish

BBird

CSnake

DKangaroo

The sound of buzzing is usually associated with a [MASK].

AEagle

BDolphin

CCheetah

DFly

The sound of [MASK] chewing grass is a common farm sound.

ACheetah

BGazelle

CZebra

DHorse

At night, in the summer, you can hear the [MASK] chirping.

ACricket

BRaccoon

COwl

DBat

Auditory Context Reasoning

What type of natural sound is described as water splashing continuously?

ARainfall

BFire crackling

CWind blowing

DAnimal sounds

Where is the sound of someone snoring most likely coming from?

ABathroom

BOffice

CStreet

DBedroom

During a street busking performance with bells and a xylophone, what background noise is most likely heard?

ACrowd

BSolo singer

CWind

DAnimal sounds

What musical genre is described as a piece with a positive vibe suitable for a road trip scene?

ARock

BJazz

CClassical

DBlues

Method

Pipeline of the proposed AIR-CoT:

Data Preparation: Training data is augmented with [imagine] tokens to mark spans requiring auditory reasoning.
Stage 1 – Span Detection: The model is fine-tuned to detect the spans by generating the special tokens during decoding.
Stage 2 – Knowledge Injection: When encountering the [/imagine] token, the model pauses to generate the embedding using CLAP and injects it for auditory reasoning.

BibTeX

@article{ok2025auditorybench++,
  title={AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?},
  author={Ok, Hyunjong and Yoo, Suho and Kim, Hyeonjun and Lee, Jaeho},
  journal={arXiv preprint arXiv:2509.17641},
  year={2025}
}

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Abstract

Dataset Pipeline

Task Definition

Dataset Statistics

Answer Distribution

#Data

Dataset Samples

Pitch Comparison

Duration Comparison

Loudness Comparison

Animal Sound Recognition

Auditory Context Reasoning

Method

BibTeX