Edward Choi - New Related Research

Mar 9, 2025 by ADMIN 35 views

Recent Research in Language Models and Vision-Language Models

Multidimensional Consistency Improves Reasoning in Language Models

H Lai, X Zhang, M Nissim - arXiv preprint arXiv:2503.02670, 2025

While Large language models (LLMs) have proved able to address some complex reasoning tasks, we also know that they are highly sensitive to input variation, which can lead to different solution paths and final answers. Answer consistency across different input variations is crucial for reliable reasoning. In this paper, we propose a novel approach to improve the consistency of LLMs by introducing multidimensional consistency. Our method involves training the model to produce consistent answers across multiple input variations, which leads to improved reasoning performance. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of answer consistency and reasoning accuracy.

Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization

P Pattnayak, HL Patel, A Agarwal, B Kumar, S Panda… - arXiv preprint arXiv:2502.13108, 2025

Clinical Question Answering (CQA) plays a crucial role in medical decision-making, enabling physicians to extract relevant information from Electronic Medical Records (EMRs). While transformer-based models such as BERT, BioBERT, and RoBERTa have achieved state-of-the-art performance in CQA, they often require large amounts of labeled data and can be computationally expensive. In this paper, we propose a multi-task learning approach that jointly trains the model for answer extraction and medical categorization. Our approach involves using a shared encoder for both tasks and a task-specific decoder for each task. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of answer extraction and medical categorization accuracy.

Mapping 1,000+ Language Models via the Log-Likelihood Vector

M Oyama, H Yamagiwa, Y Takase, H Shimodaira - arXiv preprint arXiv:2502.16173, 2025

To compare autoregressive language models at scale, we propose using log-likelihood vectors computed on a predefined text set as model features. This approach has a solid theoretical basis: when treated as model coordinates, their log-likelihood values can be used to visualize the model space. We evaluate our approach on a large-scale dataset of 1,000+ language models and show that it can effectively capture the model space and identify clusters of similar models. Our results demonstrate the potential of log-likelihood vectors as a tool for model comparison and selection.

CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers

AD Le, T Vu, NL Hai, NTN Diep, LN Van, T Le… - arXiv preprint arXiv:2502.16806, 2025

Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from a large teacher model to a smaller student model. However, existing KD methods often require a shared tokenizer between the teacher and student models, which can limit their applicability. In this paper, we propose a novel approach to KD that can handle different tokenizers between the teacher and student models. Our approach involves using optimal transport alignment to align the token distributions between the two models, which enables effective knowledge transfer. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of accuracy and efficiency.

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

A Deng, T Cao, Z Chen, B Hooi - arXiv preprint arXiv:2503.02199, 2025

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual and textual information. Our results show that VLMs tend to rely heavily on text information, even when visual information is available. This suggests that VLMs may have a "blind faith" in text, which can lead to errors in visually grounded tasks. Our findings highlight the need for more robust and multimodal VLMs that can effectively integrate visual and textual information.

Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling

G Wu, H Song, Y Wang, Q Yan, Y Tian, LL Cheong… - arXiv preprint arXiv:2503.01754, 2025

Multi-modal large language models have seen rapid advancement alongside large language models. However, while language models can effectively leverage chain-of-thought prompting for zero or few-shot learning, similar prompting strategies are not as effective for vision-language models. In this paper, we propose a novel approach to enhance multi-hop reasoning in vision-language models via self-distillation with multi-prompt ensembling. Our approach involves using a self-distillation framework to refine the model's reasoning capabilities and a multi-prompt ensembling strategy to improve the model's robustness. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of multi-hop reasoning accuracy.

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

S Wu, FY Sun, K Wen, N Haber - arXiv preprint arXiv:2502.13928, 2025

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises from the lack of effective alignment between the visual and language modalities. In this paper, we propose a novel approach to align VLMs with minimal contrastive images. Our approach involves using a symmetrical visual contrastive optimization framework to align the visual and language modalities, which enables effective knowledge transfer. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of accuracy and robustness.

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

Y Meng, K Li, C Huang, C Gao, X Chen, Y Li, X Zhang - arXiv preprint arXiv:2502.14504, 2025

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this issue, we propose a novel approach to prune visual tokens in LVLMs. Our approach involves using a per-layer per-head pruning strategy to selectively remove visual tokens, which enables efficient inference. We evaluate our approach on several benchmark datasets and show that it outperforms state-of-the-art methods in terms of inference efficiency and accuracy.

Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models

A Albalak, D Phung, N Lile, R Rafailov, K Gandhi… - arXiv preprint arXiv:2502.17387, 2025

Increasing interest in reasoning models has led math to become a prominent testing ground for algorithmic and methodological improvements. However, existing open math datasets either contain a small collection of high-quality, human-written math problems or are limited to specific math domains. In this paper, we propose a large-scale, high-quality math dataset for reinforcement learning in language models. Our dataset, Big-Math, contains a diverse collection of math problems across various domains and difficulty levels. We evaluate our dataset on several benchmark models and show that it can effectively support reinforcement learning in language models.

Compositional Causal Reasoning Evaluation in Language Models

JRMA Maasch, A Hüyük, X Xu, AV Nori, J Gonzalez - arXiv preprint arXiv:2503.04556, 2025

Causal reasoning and compositional reasoning are two core aspirations in generative AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors and propose a novel approach to evaluate compositional causal reasoning in language models. Our approach involves using a combination of automated and human evaluation methods to assess the model's ability to reason about complex causal relationships. We evaluate our approach on several benchmark models and show that it can effectively capture the model's compositional causal reasoning capabilities.
Q&A: Recent Research in Language Models and Vision-Language Models

Q: What is the main focus of the recent research in language models and vision-language models?

A: The main focus of the recent research in language models and vision-language models is to improve their performance and efficiency in various NLP tasks, such as reasoning, question answering, and multimodal understanding.

Q: What is the significance of multidimensional consistency in language models?

A: Multidimensional consistency is crucial in language models as it ensures that the model produces consistent answers across different input variations, which is essential for reliable reasoning.

Q: How does the proposed approach in "Multidimensional Consistency Improves Reasoning in Language Models" improve the model's performance?

A: The proposed approach introduces a novel method to improve the consistency of language models by training the model to produce consistent answers across multiple input variations. This leads to improved reasoning performance and outperforms state-of-the-art methods in terms of answer consistency and reasoning accuracy.

Q: What is the main challenge in improving clinical question answering with multi-task learning?

A: The main challenge in improving clinical question answering with multi-task learning is to effectively train the model to jointly perform answer extraction and medical categorization tasks.

Q: How does the proposed approach in "Improving Clinical Question Answering with Multi-Task Learning" address this challenge?

A: The proposed approach uses a shared encoder for both tasks and a task-specific decoder for each task, which enables effective joint training and improves the model's performance in both tasks.

Q: What is the significance of log-likelihood vectors in comparing language models?

A: Log-likelihood vectors provide a solid theoretical basis for comparing language models, as they can be used to visualize the model space and identify clusters of similar models.

Q: How does the proposed approach in "Mapping 1,000+ Language Models via the Log-Likelihood Vector" improve the model comparison process?

A: The proposed approach uses log-likelihood vectors to effectively capture the model space and identify clusters of similar models, which enables more accurate model comparison and selection.

Q: What is the main challenge in enhancing multi-hop reasoning in vision-language models?

A: The main challenge in enhancing multi-hop reasoning in vision-language models is to effectively leverage chain-of-thought prompting for zero or few-shot learning.

Q: How does the proposed approach in "Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling" address this challenge?

A: The proposed approach uses a self-distillation framework to refine the model's reasoning capabilities and a multi-prompt ensembling strategy to improve the model's robustness, which enables effective multi-hop reasoning.

Q: What is the significance of symmetrical visual contrastive optimization in aligning vision-language models?

A: Symmetrical visual contrastive optimization is crucial in aligning vision-language models as it enables effective knowledge transfer and improves the model's accuracy and robustness.

Q: How does the proposed approach in "Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images" improve the model's performance?

A: The proposed approach uses a symmetrical visual contrastive optimization framework to align the visual and language modalities, which enables effective knowledge transfer and improves the model's accuracy and robustness.

Q: What is the main challenge in pruning visual tokens in large vision-language models?

A: The main challenge in pruning visual tokens in large vision-language models is to selectively remove visual tokens without compromising the model's performance.

Q: How does the proposed approach in "PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models" address this challenge?

A: The proposed approach uses a per-layer per-head pruning strategy to selectively remove visual tokens, which enables efficient inference and improves the model's performance.

Q: What is the significance of Big-Math in supporting reinforcement learning in language models?

A: Big-Math is a large-scale, high-quality math dataset that provides a diverse collection of math problems across various domains and difficulty levels, which enables effective reinforcement learning in language models.

Q: How does the proposed approach in "Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models" improve the model's performance?

A: The proposed approach uses Big-Math to effectively support reinforcement learning in language models, which enables the model to learn complex math concepts and improve its performance.

Q: What is the main challenge in evaluating compositional causal reasoning in language models?

A: The main challenge in evaluating compositional causal reasoning in language models is to effectively capture the model's ability to reason about complex causal relationships.

Q: How does the proposed approach in "Compositional Causal Reasoning Evaluation in Language Models" address this challenge?

A: The proposed approach uses a combination of automated and human evaluation methods to assess the model's ability to reason about complex causal relationships, which enables effective evaluation of compositional causal reasoning in language models.