AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Hyunjong Ok1,2*, Suho Yoo2,3*, Hyeonjun Kim1, Jaeho Lee1
1POSTECH 2HJ AILAB 3KAIST
ICASSP 2026 Submission

*Indicates Equal Contribution
AuditoryBench++ fig1

Abstract

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge.Experiments demonstrate that AIR-CoT outperforms off-the-shelf and augmented models. We believe our work provides a strong foundation for building language models that can imagine auditory information without direct audio input, ultimately enabling more natural and human-like multimodal reasoning.

Dataset Pipeline

Details of the Pipeline

The construction pipeline is presented below, with tasks grouped by their original resource.

  • AuditoryBench
    • Pitch Comparison: Derived from the AuditoryBench wiki set on sound pitch comparison. This subset primarily consists of instrument-based pairs, allowing for objective and unambiguous evaluations of relative pitch differences (Range (music)).
    • Animal Sound Recognition: Constructed from the AuditoryBench wiki and test sets. Three authors independently scored each sample on a 0–2 scale, and any item receiving a score of 0 from at least one annotator was removed. Inter-rater agreement was substantial (Kendall’s W = 0.666; ICC(2,k) = 0.75; Cronbach’s α = 0.75), ensuring dataset consistency and reliability.
  • AudioTime
    • Duration Comparison: Built from segment-level annotations. Classes with fewer than 30 samples were excluded, outliers were removed using the IQR rule, and only statistically significant contrasts (p < 0.01) were retained.
    • Loudness Comparison: Derived from the same source, with loudness measured by peak decibel levels in each segment to ensure reliable intensity distinctions. The same statistical filtering process as in the duration task was then applied.
    • Final filtering was applied by the authors to remove classes deemed unsuitable for either task.
  • MMAU
    • Auditory Context Reasoning: Adapted from the open MMAU set. Audio clips were first captioned using Qwen2-Audio to capture salient auditory cues. The captions, together with the original questions, were reformulated by GPT-4o into text-only problems while preserving reasoning objectives. Human verification and refinement were applied to discard incoherent items and ensure naturalness in a purely text-based setting.

Task Definition

AuditoryBench++ comprises 5 tasks evaluating a spectrum of auditory knowledge, from fundamental comparisons to complex, contextually grounded reasoning:

  1. Pitch Comparison: The model selects which of two sounds has a higher pitch, formulated as a binary decision task.
  2. Duration Comparison: The model compares two described sounds and identifies the one with longer duration.
  3. Loudness Comparison: This task asks the model to select the louder sound between two options based on prompts.
  4. Animal Sound Recognition: This task requires predicting the correct animal corresponding to a given onomatopoeic expression (e.g., 'meow'). Each sample is presented as a multiple-choice question with four options.
  5. Auditory Context Reasoning: This component evaluates a model’s ability to perform contextual auditory reasoning, focusing on interpreting nuanced auditory cues and situational contexts in a multiple-choice format.

Dataset Statistics

Answer Distribution

answer_dist

#Data

number

Dataset Samples

Method

comparision

Pipeline of the proposed AIR-CoT:

  1. Data Preparation: Training data is augmented with [imagine] tokens to mark spans requiring auditory reasoning.
  2. Stage 1 – Span Detection: The model is fine-tuned to detect the spans by generating the special tokens during decoding.
  3. Stage 2 – Knowledge Injection: When encountering the [/imagine] token, the model pauses to generate the embedding using CLAP and injects it for auditory reasoning.

BibTeX

TBD