每日论文推送（有中文摘或源码地址或项目地址）

发布时间：2024年01月09日

***VX搜索“晓理紫”既可以每日获取最新论文***

标题: “It’s not like Jarvis, but it’s pretty close!” – Examining ChatGPT’s
Usage among Undergraduate Students in Computer Science
作者: Ishika Joshi, Ritvik Budhiraja, Harshal D Akolekar
摘要: Large language models (LLMs) such as ChatGPT and Google Bard have garnered
significant attention in the academic community. Previous research has
evaluated these LLMs for various applications such as generating programming
exercises and solutions. However, these evaluations have predominantly been
conducted by instructors and researchers, not considering the actual usage of
LLMs by students. This study adopts a student-first approach to comprehensively
understand how undergraduate computer science students utilize ChatGPT, a
popular LLM, released by OpenAI. We employ a combination of student surveys and
interviews to obtain valuable insights into the benefits, challenges, and
suggested improvements related to ChatGPT. Our findings suggest that a majority
of students (over 57%) have a convincingly positive outlook towards adopting
ChatGPT as an aid in coursework-related tasks. However, our research also
highlights various challenges that must be resolved for long-term acceptance of
ChatGPT amongst students. The findings from this investigation have broader
implications and may be applicable to other LLMs and their role in computing
education.
中文摘要: 大型语言模型（LLM），如ChatGPT和Google Bard，在学术界引起了极大的关注。先前的研究已经评估了这些LLM的各种应用，如生成编程练习和解决方案。然而，这些评估主要由讲师和研究人员进行，没有考虑学生对LLM的实际使用情况。本研究采用学生优先的方法，全面了解计算机科学本科生如何利用OpenAI发布的流行LLM ChatGPT。我们结合学生调查和访谈，对ChatGPT的好处、挑战和改进建议获得有价值的见解。我们的研究结果表明，大多数学生（超过57%）对采用ChatGPT作为作业相关任务的辅助手段持令人信服的积极态度。然而，我们的研究也强调了学生长期接受ChatGPT必须解决的各种挑战。这项调查的结果具有更广泛的意义，可能适用于其他LLM及其在计算机教育中的作用
[论文下载:]http://arxiv.org/abs/2311.09651v2
[项目页面:]https://aceconference2024.github.io/aceconference2024/|

标题: PromptBench: A Unified Library for Evaluation of Large Language Models
作者: Kaijie Zhu, Qinlin Zhao, Hao Chen
摘要: The evaluation of large language models (LLMs) is crucial to assess their
performance and mitigate potential security risks. In this paper, we introduce
PromptBench, a unified library to evaluate LLMs. It consists of several key
components that are easily used and extended by researchers: prompt
construction, prompt engineering, dataset and model loading, adversarial prompt
attack, dynamic evaluation protocols, and analysis tools. PromptBench is
designed to be an open, general, and flexible codebase for research purposes
that can facilitate original study in creating new benchmarks, deploying
downstream applications, and designing new evaluation protocols. The code is
available at: https://github.com/microsoft/promptbench and will be continuously
supported.
中文摘要: 大型语言模型（LLM）的评估对于评估其性能和降低潜在的安全风险至关重要。在本文中，我们介绍了PromptBench，一个用于评估LLM的统一库。它由几个研究人员易于使用和扩展的关键组件组成：即时构建、即时工程、数据集和模型加载、对抗性即时攻击、动态评估协议和分析工具。PromptBench被设计成一个开放、通用和灵活的代码库，用于研究目的，可以在创建新的基准、部署下游应用程序和设计新的评估协议方面促进原始研究。代码位于：https://github.com/microsoft/promptbench并将得到持续支持
[论文下载:]http://arxiv.org/abs/2312.07910v2
[GitHub:]https://github.com/microsoft/promptbench|https://github.com/microsoft/promptbench|

标题: PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language
Models for Medical Visual Question Answering
作者: Jinlong He, Pengfei Li, Gang Liu
摘要: Multimodal large language models (MLLMs) represent an evolutionary expansion
in the capabilities of traditional large language models, enabling them to
tackle challenges that surpass the scope of purely text-based applications. It
leverages the knowledge previously encoded within these language models,
thereby enhancing their applicability and functionality in the reign of
multimodal contexts. Recent works investigate the adaptation of MLLMs to
predict free-form answers as a generative task to solve medical visual question
answering (Med-VQA) tasks. In this paper, we propose a parameter efficient
framework for fine-tuning MLLM specifically tailored to Med-VQA applications,
and empirically validate it on a public benchmark dataset. To accurately
measure the performance, we employ human evaluation and the results reveal that
our model achieves an overall accuracy of 81.9%, and outperforms the GPT-4v
model by a significant margin of 26% absolute accuracy on closed-ended
questions. The code will be available here: https://github.com/jinlHe/PeFoMed.
中文摘要: 多模式大型语言模型（MLLMs）代表了传统大型语言模型功能的进化扩展，使其能够应对超越纯文本应用范围的挑战。它利用了这些语言模型中先前编码的知识，从而增强了它们在多模式上下文中的适用性和功能。最近的工作研究了MLLMs对预测自由形式答案的适应性，将其作为解决医学视觉问答（Med-VQA）任务的生成任务。在本文中，我们提出了一个专门针对Med-VQA应用程序的参数有效的MLLM微调框架，并在公共基准数据集上进行了实证验证。为了准确衡量性能，我们采用了人工评估，结果显示，我们的模型总体准确率为81.9%，在封闭式问题上的绝对准确率显著优于GPT-4v模型26%。代码将在此处提供：https://github.com/jinlHe/PeFoMed.
[论文下载:]http://arxiv.org/abs/2401.02797v1
[GitHub:]https://github.com/jinlHe/PeFoMed.|

标题: Subjective and Objective Analysis of Indian Social Media Video Quality
作者: Sandeep Mishra, Mukul Jha, Alan C. Bovik
摘要: We conducted a large-scale subjective study of the perceptual quality of
User-Generated Mobile Video Content on a set of mobile-originated videos
obtained from the Indian social media platform ShareChat. The content viewed by
volunteer human subjects under controlled laboratory conditions has the benefit
of culturally diversifying the existing corpus of User-Generated Content (UGC)
video quality datasets. There is a great need for large and diverse UGC-VQA
datasets, given the explosive global growth of the visual internet and social
media platforms. This is particularly true in regard to videos obtained by
smartphones, especially in rapidly emerging economies like India. ShareChat
provides a safe and cultural community oriented space for users to generate and
share content in their preferred Indian languages and dialects. Our subjective
quality study, which is based on this data, offers a boost of cultural, visual,
and language diversification to the video quality research community. We expect
that this new data resource will also allow for the development of systems that
can predict the perceived visual quality of Indian social media videos, to
control scaling and compression protocols for streaming, provide better user
recommendations, and guide content analysis and processing. We demonstrate the
value of the new data resource by conducting a study of leading blind video
quality models on it, including a new model, called MoEVA, which deploys a
mixture of experts to predict video quality. Both the new LIVE-ShareChat
dataset and sample source code for MoEVA are being made freely available to the
research community at https://github.com/sandeep-sm/LIVE-SC
中文摘要: 我们对从印度社交媒体平台ShareChat获得的一组源自移动的视频进行了用户生成的移动视频内容的感知质量的大规模主观研究。志愿者人类受试者在受控的实验室条件下观看的内容有利于使现有的用户生成内容（UGC）视频质量数据集在文化上多样化。鉴于视觉互联网和社交媒体平台在全球的爆炸性增长，人们非常需要大型和多样化的UGC-VQA数据集。智能手机获取的视频尤其如此，尤其是在印度等快速新兴经济体。ShareChat为用户提供了一个安全且面向文化社区的空间，用户可以用自己喜欢的印度语言和方言生成和共享内容。我们基于这些数据进行的主观质量研究为视频质量研究界提供了文化、视觉和语言多样性的推动力。我们预计，这一新的数据资源还将允许开发能够预测印度社交媒体视频感知视觉质量的系统，控制流媒体的缩放和压缩协议，提供更好的用户推荐，并指导内容分析和处理。我们通过对领先的盲视频质量模型进行研究来证明新数据资源的价值，其中包括一个名为MoEVA的新模型，该模型部署了多种专家来预测视频质量。MoEVA的新LIVE ShareChat数据集和示例源代码都免费提供给研究社区，网址为https://github.com/sandeep-sm/LIVE-SC
[论文下载:]http://arxiv.org/abs/2401.02794v1
[GitHub:]https://github.com/sandeep-sm/LIVE-SC|

标题: Code-Style In-Context Learning for Knowledge-Based Question Answering
作者: Zhijie Nie, Richong Zhang, Zhongyuan Wang
摘要: Current methods for Knowledge-Based Question Answering (KBQA) usually rely on
complex training techniques and model frameworks, leading to many limitations
in practical applications. Recently, the emergence of In-Context Learning (ICL)
capabilities in Large Language Models (LLMs) provides a simple and
training-free semantic parsing paradigm for KBQA: Given a small number of
questions and their labeled logical forms as demo examples, LLMs can understand
the task intent and generate the logic form for a new question. However,
current powerful LLMs have little exposure to logic forms during pre-training,
resulting in a high format error rate. To solve this problem, we propose a
code-style in-context learning method for KBQA, which converts the generation
process of unfamiliar logical form into the more familiar code generation
process for LLMs. Experimental results on three mainstream datasets show that
our method dramatically mitigated the formatting error problem in generating
logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the
few-shot setting. The code and supplementary files are released at
https://github.com/Arthurizijar/KB-Coder .
中文摘要: 当前基于知识的问答（KBQA）方法通常依赖于复杂的训练技术和模型框架，导致在实际应用中存在许多局限性。最近，大语言模型（LLM）中上下文学习（ICL）功能的出现为KBQA提供了一种简单且无需训练的语义解析范式：给定少量问题及其标记的逻辑形式作为演示示例，LLM可以理解任务意图并生成新问题的逻辑形式。然而，当前强大的LLM在预训练期间很少接触逻辑形式，导致高格式错误率。为了解决这个问题，我们提出了一种用于KBQA的代码风格上下文学习方法，该方法将不熟悉的逻辑形式的生成过程转换为LLM更熟悉的代码生成过程。在三个主流数据集上的实验结果表明，我们的方法极大地缓解了生成逻辑表单时的格式化错误问题，同时在WebQSP、GrailQA和GraphQ上实现了一种新的SOTA，在少量镜头设置下。代码和补充文件发布于https://github.com/Arthurizijar/KB-Coder.
[论文下载:]http://arxiv.org/abs/2309.04695v2
[GitHub:]https://github.com/Arthurizijar/KB-Coder|

标题: AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse
Datasets
作者: Ernest Perkowski, Rui Pan, Tuan Dung Nguyen
摘要: We explore the potential of enhancing LLM performance in astronomy-focused
question-answering through targeted, continual pre-training. By employing a
compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of
astronomy corpora – comprising abstracts, introductions, and conclusions – we
achieve notable improvements in specialized topic comprehension. While general
LLMs like GPT-4 excel in broader question-answering scenarios due to superior
reasoning capabilities, our findings suggest that continual pre-training with
limited resources can still enhance model performance on specialized topics.
Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B
LLaMA model on a domain-specific conversational dataset, culminating in the
release of the chat-enabled AstroLLaMA for community use. Comprehensive
quantitative benchmarking is currently in progress and will be detailed in an
upcoming full paper. The model, AstroLLaMA-Chat, is now available at
https://huggingface.co/universeTBD, providing the first open-source
conversational AI tool tailored for the astronomy community.
中文摘要: 我们探索了通过有针对性的、持续的预培训来提高LLM在以天文学为重点的问答中的表现的潜力。通过使用紧凑的7B参数LLaMA-2模型，并专门关注一组精心策划的天文学语料库——包括摘要、引言和结论——我们在专业主题理解方面取得了显著进步。虽然像GPT-4这样的通用LLM由于其卓越的推理能力而在更广泛的问答场景中表现出色，但我们的研究结果表明，在有限的资源下持续的预训练仍然可以提高模型在专业主题上的性能。此外，我们还介绍了AstroLLaMA的扩展：在特定领域的会话数据集上对7B LLaMA模型进行微调，最终发布了支持聊天的AstroLLa供社区使用。全面的量化基准测试目前正在进行中，将在即将发表的全文中详细介绍。这款名为AstroLLaMA Chat的型号现已在https://huggingface.co/universeTBD，提供了第一个为天文学社区量身定制的开源对话式人工智能工具
[论文下载:]http://arxiv.org/abs/2401.01916v2
[项目页面:]https://huggingface.co/universeTBD,|https://huggingface.co/universeTBD,|

标题: DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
作者: DeepSeek-AI, :, Xiao Bi
摘要: The rapid development of open-source large language models (LLMs) has been
truly remarkable. However, the scaling law described in previous literature
presents varying conclusions, which casts a dark cloud over scaling LLMs. We
delve into the study of scaling laws and present our distinctive findings that
facilitate scaling of large scale models in two commonly used open-source
configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek
LLM, a project dedicated to advancing open-source language models with a
long-term perspective. To support the pre-training phase, we have developed a
dataset that currently consists of 2 trillion tokens and is continuously
expanding. We further conduct supervised fine-tuning (SFT) and Direct
Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the
creation of DeepSeek Chat models. Our evaluation results demonstrate that
DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in
the domains of code, mathematics, and reasoning. Furthermore, open-ended
evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance
compared to GPT-3.5.
中文摘要: 开源大型语言模型（LLM）的快速发展确实引人注目。然而，先前文献中描述的缩放定律给出了不同的结论，这给缩放LLM蒙上了一层乌云。我们深入研究了缩放定律，并提出了我们的独特发现，这些发现有助于在两种常用的开源配置7B和67B中缩放大规模模型。在缩放定律的指导下，我们引入了DeepSeek LLM，这是一个致力于从长远角度推进开源语言模型的项目。为了支持预训练阶段，我们开发了一个数据集，该数据集目前由2万亿个代币组成，并正在不断扩展。我们进一步对DeepSeek LLM-Base模型进行监督微调（SFT）和直接偏好优化（DPO），从而创建了DeepSeekChat模型。我们的评估结果表明，DeepSeek LLM 67B在各种基准测试上超过了LLaMA-2 70B，尤其是在代码、数学和推理领域。此外，开放式评估显示，与GPT-3.5相比，DeepSeek LLM 67B Chat表现出卓越的性能。
[论文下载:]http://arxiv.org/abs/2401.02954v1

标题: Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks
作者: Kevin Everson, Yile Gu, Huck Yang
摘要: In the realm of spoken language understanding (SLU), numerous natural
language understanding (NLU) methodologies have been adapted by supplying large
language models (LLMs) with transcribed speech instead of conventional written
text. In real-world scenarios, prior to input into an LLM, an automated speech
recognition (ASR) system generates an output transcript hypothesis, where
inherent errors can degrade subsequent SLU tasks. Here we introduce a method
that utilizes the ASR system’s lattice output instead of relying solely on the
top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU
outcomes. Our in-context learning experiments, covering spoken question
answering and intent classification, underline the LLM’s resilience to noisy
speech transcripts with the help of word confusion networks from lattices,
bridging the SLU performance gap between using the top ASR hypothesis and an
oracle upper bound. Additionally, we delve into the LLM’s robustness to varying
ASR performance conditions and scrutinize the aspects of in-context learning
which prove the most influential.
中文摘要: 在口语理解（SLU）领域，许多自然语言理解（NLU）方法已通过向大型语言模型（LLM）提供转录语音而非传统书面文本进行了调整。在真实世界的场景中，在输入LLM之前，自动语音识别（ASR）系统生成输出转录本假设，其中固有错误会降低后续SLU任务的性能。在这里，我们介绍了一种方法，该方法利用ASR系统的晶格输出，而不是仅仅依赖于顶部假设，旨在封装语音歧义并增强SLU结果。我们的上下文学习实验涵盖了口语问题回答和意图分类，在来自格的单词混淆网络的帮助下，强调了LLM对嘈杂语音记录的弹性，弥合了使用顶级ASR假设和预言上限之间的SLU性能差距。此外，我们深入研究了LLM对不同ASR性能条件的鲁棒性，并仔细研究了上下文学习中最具影响力的方面
[论文下载:]http://arxiv.org/abs/2401.02921v1

标题: Introducing Bode: A Fine-Tuned Large Language Model for Portuguese
Prompt-Based Task
作者: Gabriel Lino Garcia, Pedro Henrique Paiola, Luis Henrique Morelli
摘要: Large Language Models (LLMs) are increasingly bringing advances to Natural
Language Processing. However, low-resource languages, those lacking extensive
prominence in datasets for various NLP tasks, or where existing datasets are
not as substantial, such as Portuguese, already obtain several benefits from
LLMs, but not to the same extent. LLMs trained on multilingual datasets
normally struggle to respond to prompts in Portuguese satisfactorily,
presenting, for example, code switching in their responses. This work proposes
a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode in two
versions: 7B and 13B. We evaluate the performance of this model in
classification tasks using the zero-shot approach with in-context learning, and
compare it with other LLMs. Our main contribution is to bring an LLM with
satisfactory results in the Portuguese language, as well as to provide a model
that is free for research or commercial purposes.
中文摘要: 大型语言模型（LLM）越来越多地为自然语言处理带来了进步。然而，低资源语言，那些在各种NLP任务的数据集中缺乏广泛突出地位的语言，或者现有数据集不那么丰富的语言，如葡萄牙语，已经从LLM中获得了一些好处，但程度不同。在多语言数据集上训练的LLM通常很难对葡萄牙语提示做出令人满意的响应，例如，在响应中出现代码切换。这项工作为葡萄牙语提示提出了一个基于LLaMA 2的微调模型，命名为Bode，有两个版本：7B和13B。我们使用零样本方法和上下文学习来评估该模型在分类任务中的性能，并将其与其他LLM进行比较。我们的主要贡献是用葡萄牙语带来了一个效果令人满意的LLM，并提供了一个免费用于研究或商业目的的模型
[论文下载:]http://arxiv.org/abs/2401.02909v1

标题: MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
作者: Renjie Pi, Tianyang Han, Yueqi Xie
摘要: The deployment of multimodal large language models (MLLMs) has brought forth
a unique vulnerability: susceptibility to malicious attacks through visual
inputs. We delve into the novel challenge of defending MLLMs against such
attacks. We discovered that images act as a “foreign language” that is not
considered during alignment, which can make MLLMs prone to producing harmful
responses. Unfortunately, unlike the discrete tokens considered in text-based
LLMs, the continuous nature of image signals presents significant alignment
challenges, which poses difficulty to thoroughly cover the possible scenarios.
This vulnerability is exacerbated by the fact that open-source MLLMs are
predominantly fine-tuned on limited image-text pairs that is much less than the
extensive text-based pretraining corpus, which makes the MLLMs more prone to
catastrophic forgetting of their original abilities during explicit alignment
tuning. To tackle these challenges, we introduce MLLM-Protector, a
plug-and-play strategy combining a lightweight harm detector and a response
detoxifier. The harm detector’s role is to identify potentially harmful outputs
from the MLLM, while the detoxifier corrects these outputs to ensure the
response stipulates to the safety standards. This approach effectively
mitigates the risks posed by malicious visual inputs without compromising the
model’s overall performance. Our results demonstrate that MLLM-Protector offers
a robust solution to a previously unaddressed aspect of MLLM security.
中文摘要: 多模式大型语言模型（MLLMs）的部署带来了一个独特的漏洞：易受视觉输入的恶意攻击。我们深入探讨了保护MLLMs免受此类攻击的新挑战。我们发现，图像作为一种“外语”，在对齐过程中不被考虑，这会使MLLMs容易产生有害的反应。不幸的是，与基于文本的LLM中考虑的离散标记不同，图像信号的连续性带来了显著的对齐挑战，这给彻底覆盖可能的场景带来了困难。开源MLLMs主要在有限的图像-文本对上进行微调，这比广泛的基于文本的预训练语料库要少得多，这使得MLLMs在显式对齐调整过程中更容易灾难性地忘记其原始能力，这一事实加剧了这一漏洞。为了应对这些挑战，我们引入了MLLM Protector，这是一种即插即用的策略，结合了轻量级的伤害检测器和反应解毒器。危害检测器的作用是识别MLLM的潜在有害输出，而解毒器则纠正这些输出，以确保响应符合安全标准。这种方法在不影响模型整体性能的情况下，有效地降低了恶意视觉输入带来的风险。我们的结果表明，MLLM Protector为以前未解决的MLLM安全方面提供了一个强大的解决方案
[论文下载:]http://arxiv.org/abs/2401.02906v1

标题: AFSPP: Agent Framework for Shaping Preference and Personality with Large
Language Models
作者: Zihong He, Changwang Zhang
摘要: The evolution of Large Language Models (LLMs) has introduced a new paradigm
for investigating human behavior emulation. Recent research has employed
LLM-based Agents to create a sociological research environment, in which agents
exhibit behavior based on the unfiltered characteristics of large language
models. However, these studies overlook the iterative development within a
human-like setting - Human preferences and personalities are complex, shaped by
various factors and subject to ongoing change as a result of environmental and
subjective influences. In light of this observation, we propose Agent Framework
for Shaping Preference and Personality (AFSPP), exploring the multifaceted
impact of social networks and subjective consciousness on LLM-based Agents’
preference and personality formation. With AFSPP, we have, for the first time,
successfully replicated several key findings from human personality
experiments. And other AFSPP-based experimental results indicate that plan
making, sensory perceptions and social networking with subjective information,
wield the most pronounced influence on preference shaping. AFSPP can
significantly enhance the efficiency and scope of psychological experiments,
while yielding valuable insights for Trustworthy Artificial Intelligence
research for strategies to prevent undesirable preference and personality
development.
中文摘要: 大型语言模型（LLM）的发展为研究人类行为模拟引入了一种新的范式。最近的研究采用了基于LLM的Agent来创建一个社会学研究环境，在这个环境中，Agent表现出基于大型语言模型未经过滤的特征的行为。然而，这些研究忽略了类人环境中的迭代发展——人类的偏好和个性是复杂的，由各种因素塑造，并因环境和主观影响而不断变化。基于这一观察结果，我们提出了形成偏好和个性的Agent框架（AFSPP），探讨了社会网络和主观意识对基于LLM的Agent偏好和个性形成的多方面影响。通过AFSPP，我们首次成功地复制了人类人格实验的几个关键发现。其他基于AFSPP的实验结果表明，计划制定、感官感知和带有主观信息的社交网络对偏好形成的影响最为显著。AFSPP可以显著提高心理实验的效率和范围，同时为值得信赖的人工智能研究提供有价值的见解，以防止不良偏好和个性发展
[论文下载:]http://arxiv.org/abs/2401.02870v1

标题: Generative Large Language Models are autonomous practitioners of
evidence-based medicine
作者: Akhil Vaid, Joshua Lampert, Juhee Lee
摘要: Background: Evidence-based medicine (EBM) is fundamental to modern clinical
practice, requiring clinicians to continually update their knowledge and apply
the best clinical evidence in patient care. The practice of EBM faces
challenges due to rapid advancements in medical research, leading to
information overload for clinicians. The integration of artificial intelligence
(AI), specifically Generative Large Language Models (LLMs), offers a promising
solution towards managing this complexity.
Methods: This study involved the curation of real-world clinical cases across
various specialties, converting them into .json files for analysis. LLMs,
including proprietary models like ChatGPT 3.5 and 4, Gemini Pro, and
open-source models like LLaMA v2 and Mixtral-8x7B, were employed. These models
were equipped with tools to retrieve information from case files and make
clinical decisions similar to how clinicians must operate in the real world.
Model performance was evaluated based on correctness of final answer, judicious
use of tools, conformity to guidelines, and resistance to hallucinations.
Results: GPT-4 was most capable of autonomous operation in a clinical
setting, being generally more effective in ordering relevant investigations and
conforming to clinical guidelines. Limitations were observed in terms of model
ability to handle complex guidelines and diagnostic nuances. Retrieval
Augmented Generation made recommendations more tailored to patients and
healthcare systems.
Conclusions: LLMs can be made to function as autonomous practitioners of
evidence-based medicine. Their ability to utilize tooling can be harnessed to
interact with the infrastructure of a real-world healthcare system and perform
the tasks of patient management in a guideline directed manner. Prompt
engineering may help to further enhance this potential and transform healthcare
for the clinician and the patient.
中文摘要: 背景：循证医学（EBM）是现代临床实践的基础，要求临床医生不断更新知识，并在患者护理中应用最佳临床证据。由于医学研究的快速发展，循证医学的实践面临挑战，导致临床医生的信息过载。人工智能（AI），特别是生成大型语言模型（LLM）的集成，为管理这种复杂性提供了一个很有前途的解决方案。方法：本研究涉及不同专业的真实世界临床病例的管理，并将其转换为.json文件进行分析。LLM，包括ChatGPT 3.5和4、Gemini Pro等专有模型，以及LLaMA v2和Mixtral-8x7B等开源模型。这些模型配备了从病例文件中检索信息并做出临床决策的工具，类似于临床医生在现实世界中的操作方式。模型性能的评估基于最终答案的正确性、工具的明智使用、对指南的遵守以及对幻觉的抵抗力。结果：GPT-4在临床环境中最能自主操作，通常在安排相关调查和符合临床指南方面更有效。在模型处理复杂指南和诊断细微差别的能力方面观察到了局限性。检索增强生成使建议更适合患者和医疗保健系统。结论：LLM可以作为循证医学的自主实践者发挥作用。他们利用工具的能力可以用来与现实世界的医疗保健系统的基础设施交互，并以指导方针的方式执行患者管理任务。及时的工程可能有助于进一步增强这一潜力，并改变临床医生和患者的医疗保健
[论文下载:]http://arxiv.org/abs/2401.02851v1

标题: Thousands of AI Authors on the Future of AI
作者: Katja Grace, Harlan Stewart, Julia Fabienne Sandkühler
摘要: In the largest survey of its kind, 2,778 researchers who had published in
top-tier artificial intelligence (AI) venues gave predictions on the pace of AI
progress and the nature and impacts of advanced AI systems The aggregate
forecasts give at least a 50% chance of AI systems achieving several milestones
by 2028, including autonomously constructing a payment processing site from
scratch, creating a song indistinguishable from a new song by a popular
musician, and autonomously downloading and fine-tuning a large language model.
If science continues undisrupted, the chance of unaided machines outperforming
humans in every possible task was estimated at 10% by 2027, and 50% by 2047.
The latter estimate is 13 years earlier than that reached in a similar survey
we conducted only one year earlier [Grace et al., 2022]. However, the chance of
all human occupations becoming fully automatable was forecast to reach 10% by
2037, and 50% as late as 2116 (compared to 2164 in the 2022 survey).
Most respondents expressed substantial uncertainty about the long-term value
of AI progress: While 68.3% thought good outcomes from superhuman AI are more
likely than bad, of these net optimists 48% gave at least a 5% chance of
extremely bad outcomes such as human extinction, and 59% of net pessimists gave
5% or more to extremely good outcomes. Between 38% and 51% of respondents gave
at least a 10% chance to advanced AI leading to outcomes as bad as human
extinction. More than half suggested that “substantial” or “extreme” concern is
warranted about six different AI-related scenarios, including misinformation,
authoritarian control, and inequality. There was disagreement about whether
faster or slower AI progress would be better for the future of humanity.
However, there was broad agreement that research aimed at minimizing potential
risks from AI systems ought to be prioritized more.
中文摘要: 在同类调查中，2778名在顶级人工智能（AI）场所发表文章的研究人员对人工智能的进展速度以及先进人工智能系统的性质和影响进行了预测。总预测显示，到2028年，人工智能系统至少有50%的机会实现几个里程碑，包括从头开始自主构建支付处理网站，创建与流行音乐人的新歌无法区分的歌曲，以及自主下载和微调大型语言模型。如果科学继续不受干扰，到2027年，无人辅助机器在每一项可能的任务中超过人类的几率估计为10%，到2047年为50%。后一个估计比我们仅一年前进行的类似调查早了13年[Grace et al.，2022]。然而，预计到2037年，所有人类职业完全自动化的几率将达到10%，最晚到2116年将达到50%（而2022年的调查为2164）。大多数受访者对人工智能进步的长期价值表示了极大的不确定性：虽然68.3%的人认为超人人工智能的好结果比坏结果更有可能，但在这些净乐观主义者中，48%的人认为人类灭绝等极坏结果的可能性至少为5%，59%的净悲观主义者认为极好结果的可能性为5%或更多。38%至51%的受访者至少有10%的机会使用先进的人工智能，导致与人类灭绝一样糟糕的结果。超过一半的人表示，有必要对六种不同的人工智能相关场景进行“实质性”或“极端”关注，包括错误信息、独裁控制和不平等。对于更快还是更慢的人工智能进步对人类的未来更好，存在着分歧。然而，人们普遍认为，旨在最大限度地减少人工智能系统潜在风险的研究应该更加优先
[论文下载:]http://arxiv.org/abs/2401.02843v1

标题: Pheme: Efficient and Conversational Speech Generation
作者: Pawe? Budzianowski, Taras Sereda, Tomasz Cichy
摘要: In recent years, speech generation has seen remarkable progress, now
achieving one-shot generation capability that is often virtually
indistinguishable from real human voice. Integrating such advancements in
speech generation with large language models might revolutionize a wide range
of applications. However, certain applications, such as assistive
conversational systems, require natural and conversational speech generation
tools that also operate efficiently in real time. Current state-of-the-art
models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs,
require large neural components and extensive training data to work well. In
contrast, MQTTS aims to build more compact conversational TTS models while
capitalizing on smaller-scale real-life conversational speech data. However,
its autoregressive nature yields high inference latency and thus limits its
real-time usage. In order to mitigate the current limitations of the
state-of-the-art TTS models while capitalizing on their strengths, in this work
we introduce the Pheme model series that 1) offers compact yet high-performing
models, 2) allows for parallel speech generation of 3) natural conversational
speech, and 4) it can be trained efficiently on smaller-scale conversational
data, cutting data demands by more than 10x but still matching the quality of
the autoregressive TTS models. We also show that through simple teacher-student
distillation we can meet significant improvements in voice quality for
single-speaker setups on top of pretrained Pheme checkpoints, relying solely on
synthetic speech generated by much larger teacher models. Audio samples and
pretrained models are available online.
中文摘要: 近年来，语音生成取得了显著进展，现在实现了与真实人声几乎无法区分的一次性生成能力。将语音生成的这些进步与大型语言模型相结合，可能会彻底改变广泛的应用。然而，某些应用程序，如辅助会话系统，需要实时高效运行的自然和会话语音生成工具。目前最先进的模型，如VALL-E和SoundStorm，由分层神经音频编解码器提供动力，需要大型神经组件和大量的训练数据才能正常工作。相比之下，MQTT旨在构建更紧凑的会话TTS模型，同时利用较小规模的真实会话语音数据。然而，它的自回归性质产生了高的推理延迟，从而限制了它的实时使用。为了缓解目前最先进的TTS模型的局限性，同时充分利用它们的优势，在这项工作中，我们介绍了Pheme模型系列，它1）提供了紧凑但高性能的模型，2）允许并行语音生成3）自然会话语音，4）它可以在较小规模的会话数据上有效地训练，将数据需求减少了10倍以上，但仍与自回归TTS模型的质量相匹配。我们还表明，通过简单的师生提炼，我们可以在预训练的Pheme检查点之上，仅依靠更大的教师模型生成的合成语音，实现单扬声器设置的语音质量的显著提高。在线提供音频样本和预训练模型
[论文下载:]http://arxiv.org/abs/2401.02839v1

标题: Retrieval-Augmented Text-to-Audio Generation
作者: Yi Yuan, Haohe Liu, Xubo Liu
摘要: Despite recent progress in text-to-audio (TTA) generation, we show that the
state-of-the-art models, such as AudioLDM, trained on datasets with an
imbalanced class distribution, such as AudioCaps, are biased in their
generation performance. Specifically, they excel in generating common audio
classes while underperforming in the rare ones, thus degrading the overall
generation performance. We refer to this problem as long-tailed text-to-audio
generation. To address this issue, we propose a simple retrieval-augmented
approach for TTA models. Specifically, given an input text prompt, we first
leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve
relevant text-audio pairs. The features of the retrieved audio-text data are
then used as additional conditions to guide the learning of TTA models. We
enhance AudioLDM with our proposed approach and denote the resulting augmented
system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a
state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the
existing approaches by a large margin. Furthermore, we show that Re-AudioLDM
can generate realistic audio for complex scenes, rare audio classes, and even
unseen audio types, indicating its potential in TTA tasks.
中文摘要: 尽管最近在文本到音频（TTA）生成方面取得了进展，但我们发现，在类分布不平衡的数据集（如AudioCaps）上训练的最先进的模型（如AudioLDM）在生成性能方面存在偏差。具体来说，它们擅长生成常见的音频类，而在罕见的音频类中表现不佳，从而降低了整体生成性能。我们将此问题称为长尾文本到音频生成。为了解决这个问题，我们为TTA模型提出了一种简单的检索增强方法。具体来说，在给定输入文本提示的情况下，我们首先利用对比语言音频预训练（CLAP）模型来检索相关的文本-音频对。检索到的音频文本数据的特征然后被用作指导TTA模型的学习的附加条件。我们用我们提出的方法增强了AudioLDM，并将得到的增强系统表示为Re-AudioLDM。在AudioCaps数据集上，Re-AudioLDM实现了1.37的最先进的Frechet音频距离（FAD），大大优于现有方法。此外，我们还表明，Re-AudioLDM可以为复杂场景、罕见的音频类，甚至是看不见的音频类型生成逼真的音频，这表明了它在TTA任务中的潜力
[论文下载:]http://arxiv.org/abs/2309.08051v2

标题: Object-Centric Instruction Augmentation for Robotic Manipulation
作者: Junjie Wen, Yichen Zhu, Minjie Zhu
摘要: Humans interpret scenes by recognizing both the identities and positions of
objects in their observations. For a robot to perform tasks such as
\enquote{pick and place}, understanding both what the objects are and where
they are located is crucial. While the former has been extensively discussed in
the literature that uses the large language model to enrich the text
descriptions, the latter remains underexplored. In this work, we introduce the
\textit{Object-Centric Instruction Augmentation (OCI)} framework to augment
highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of
object locations into natural language instruction, thus aiding the policy
network in mastering actions for versatile manipulation. Additionally, we
present a feature reuse mechanism to integrate the vision-language features
from off-the-shelf pre-trained MLLM into policy networks. Through a series of
simulated and real-world robotic tasks, we demonstrate that robotic manipulator
imitation policies trained with our enhanced instructions outperform those
relying solely on traditional language instructions.
中文摘要: 人类通过识别物体在观测中的身份和位置来解释场景。对于机器人执行诸如“拾取和放置”之类的任务来说，了解物体是什么以及它们的位置至关重要。虽然前者在使用大语言模型来丰富文本描述的文献中已经得到了广泛的讨论，但后者仍然没有得到充分的探索。在这项工作中，我们介绍了\textit{以对象为中心的指令增强（OCI）}框架，以利用位置线索增强高度语义和信息密集的语言指令。我们利用多模态大语言模型（MLLM）将对象位置的知识编织到自然语言教学中，从而帮助策略网络掌握多功能操作的动作。此外，我们提出了一种特征重用机制，将现成的预先训练的MLLM中的视觉语言特征集成到策略网络中。通过一系列模拟和真实世界的机器人任务，我们证明了使用增强指令训练的机器人操纵器模仿策略优于仅依赖传统语言指令的策略
[论文下载:]http://arxiv.org/abs/2401.02814v1

标题: Large Language Models in Plant Biology
作者: Hilbert Yuen In Lam, Xing Er Ong, Marek Mutwil
摘要: Large Language Models (LLMs), such as ChatGPT, have taken the world by storm
and have passed certain forms of the Turing test. However, LLMs are not limited
to human language and analyze sequential data, such as DNA, protein, and gene
expression. The resulting foundation models can be repurposed to identify the
complex patterns within the data, resulting in powerful, multi-purpose
prediction tools able to explain cellular systems. This review outlines the
different types of LLMs and showcases their recent uses in biology. Since LLMs
have not yet been embraced by the plant community, we also cover how these
models can be deployed for the plant kingdom.
中文摘要: 大型语言模型（LLM），如ChatGPT，已经席卷全球，并通过了某些形式的图灵测试。然而，LLM并不局限于人类语言，并分析序列数据，如DNA、蛋白质和基因表达。由此产生的基础模型可以重新用于识别数据中的复杂模式，从而产生能够解释细胞系统的强大、多用途的预测工具。这篇综述概述了LLM的不同类型，并展示了它们在生物学中的最新应用。由于LLM尚未被植物界所接受，我们还将介绍如何在植物王国中部署这些模型
[论文下载:]http://arxiv.org/abs/2401.02789v1

标题: AI Control: Improving Safety Despite Intentional Subversion
作者: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan
摘要: As large language models (LLMs) become more powerful and are deployed more
autonomously, it will be increasingly important to prevent them from causing
harmful outcomes. Researchers have investigated a variety of safety techniques
for this purpose, e.g. using models to review the outputs of other models, or
red-teaming techniques to surface subtle failure modes. However, researchers
have not evaluated whether such techniques still ensure safety if the model is
itself intentionally trying to subvert them. In this paper, we develop and
evaluate pipelines of safety techniques (“protocols”) that are robust to
intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming
problems, using access to a powerful but untrusted model (in our case, GPT-4),
access to a less powerful trusted model (in our case, GPT-3.5), and limited
access to high-quality trusted labor. We investigate protocols that aim to
never submit solutions containing backdoors, which we operationalize here as
logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the
untrusted model could use to subvert them. One protocol is what we call trusted
editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to
rate the suspiciousness of that code. If the code is below some suspiciousness
threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove
parts that seem suspicious and then submits the edited code. Another protocol
is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks
another instance of GPT-4 whether the code is backdoored, using various
techniques to prevent the GPT-4 instances from colluding. These protocols
improve substantially on simple baselines.
中文摘要: 随着大型语言模型（LLM）变得越来越强大，部署也越来越自主，防止它们造成有害结果将变得越来越重要。研究人员为此研究了各种安全技术，例如使用模型来审查其他模型的输出，或者使用红团队技术来揭示微妙的故障模式。然而，研究人员尚未评估，如果模型本身有意颠覆这些技术，这些技术是否仍能确保安全。在本文中，我们开发并评估了对故意颠覆具有鲁棒性的安全技术管道（“协议”）。我们研究了一个场景，在该场景中，我们希望解决一系列编程问题，使用对功能强大但不受信任的模型（在我们的案例中为GPT-4）的访问，对功能较弱的受信任模型（在我的案例中，为GPT-3.5）的访问以及对高质量受信任劳动力的有限访问。我们研究的协议旨在永远不提交包含后门的解决方案，我们在这里将后门作为测试用例没有发现的逻辑错误来操作。我们研究了一系列协议，并针对不可信模型可能用来破坏它们的策略对每种协议进行测试。一个协议就是我们所说的可信编辑。该协议首先要求GPT-4编写代码，然后要求GPT-3.5对该代码的可疑性进行评级。如果代码低于某个可疑阈值，则提交该代码。否则，GPT-3.5会编辑解决方案以删除看起来可疑的部分，然后提交编辑后的代码。另一个协议是不受信任的监视。该协议要求GPT-4编写代码，然后询问GPT-4的另一个实例该代码是否被后门，使用各种技术来防止GPT-4实例串通。这些协议在简单的基线上有了很大的改进
[论文下载:]http://arxiv.org/abs/2312.06942v3

标题: FlashDecoding++: Faster Large Language Model Inference on GPUs
作者: Ke Hong, Guohao Dai, Jiaming Xu
摘要: As the Large Language Model (LLM) becomes increasingly important in various
domains. However, the following challenges still remain unsolved in
accelerating LLM inference: (1) Synchronized partial softmax update. The
softmax operation requires a synchronized update operation among each partial
softmax result, leading to ~20% overheads for the attention computation in
LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices
performing GEMM in LLM inference is flat, leading to under-utilized computation
and >50% performance loss after padding zeros in previous designs. (3)
Performance loss due to static dataflow. Kernel performance in LLM depends on
varied input data features, hardware configurations, etc. A single and static
dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in
LLM inference.
We present FlashDecoding++, a fast LLM inference engine supporting mainstream
LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++
creatively proposes: (1) Asynchronized softmax with unified max value.
FlashDecoding++ introduces a unified max value technique for different partial
softmax computations to avoid synchronization. (2) Flat GEMM optimization with
double buffering. FlashDecoding++ points out that flat GEMMs with different
shapes face varied bottlenecks. Then, techniques like double buffering are
introduced. (3) Heuristic dataflow with hardware resource adaptation.
FlashDecoding++ heuristically optimizes dataflow using different hardware
resource considering input dynamics. Due to the versatility of optimizations in
FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on
both NVIDIA and AMD GPUs compared to Hugging Face implementations.
FlashDecoding++ also achieves an average speedup of 1.37x compared to
state-of-the-art LLM inference engines on mainstream LLMs.
中文摘要: 随着大型语言模型（LLM）在各个领域变得越来越重要。然而，在加速LLM推理方面，以下挑战仍未解决：（1）同步部分softmax更新。softmax操作需要在每个部分softmax结果之间进行同步更新操作，导致LLM中注意力计算的开销约为20%。（2）平面GEMM的计算利用不足。在LLM推理中执行GEMM的矩阵的形状是平坦的，导致计算利用不足，并且在以前的设计中填充零后性能损失>50%。（3）静态数据流导致性能损失。LLM中的内核性能取决于不同的输入数据特征、硬件配置等。单个静态数据流可能导致LLM推理中不同形状的GEMM的性能损失50.25%。我们介绍了FlashDecoding++，一个支持主流LLM和硬件后端的快速LLM推理引擎。为了应对上述挑战，FlashDecoding++创造性地提出：（1）具有统一最大值的异步softmax。FlashDecoding++为不同的部分softmax计算引入了统一的最大值技术，以避免同步。（2）具有双缓冲的平坦GEMM优化。FlashDecoding++指出，不同形状的平面GEMM面临着不同的瓶颈。然后，介绍了双缓冲等技术。（3）具有硬件资源自适应的启发式数据流。FlashDecoding++在考虑输入动态的情况下，使用不同的硬件资源启发式地优化数据流。由于FlashDecoding++中优化的多功能性，与Hugging Face实现相比，FlashDeciding++在NVIDIA和AMD GPU上可以实现高达4.86倍和2.18倍的加速。与主流LLM上最先进的LLM推理引擎相比，FlashDecoding++还实现了1.37倍的平均加速
[论文下载:]http://arxiv.org/abs/2311.01282v4

标题: From LLM to Conversational Agent: A Memory Enhanced Architecture with
Fine-Tuning of Large Language Models
作者: Na Liu, Liangyu Chen, Xiaoyu Tian
摘要: This paper introduces RAISE (Reasoning and Acting through Scratchpad and
Examples), an advanced architecture enhancing the integration of Large Language
Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of
the ReAct framework, incorporates a dual-component memory system, mirroring
human short-term and long-term memory, to maintain context and continuity in
conversations. It entails a comprehensive agent construction scenario,
including phases like Conversation Selection, Scene Extraction, CoT Completion,
and Scene Augmentation, leading to the LLMs Training phase. This approach
appears to enhance agent controllability and adaptability in complex,
multi-turn dialogues. Our preliminary evaluations in a real estate sales
context suggest that RAISE has some advantages over traditional agents,
indicating its potential for broader applications. This work contributes to the
AI field by providing a robust framework for developing more context-aware and
versatile conversational agents.
中文摘要: 本文介绍了RAISE（Reasoning and Acting through Scratchpad and Examples），这是一种高级架构，可增强GPT-4等大型语言模型（LLM）与会话代理的集成。RAISE是ReAct框架的增强，它包含了一个双成分记忆系统，反映了人类的短期和长期记忆，以保持对话的上下文和连续性。它需要一个全面的代理构建场景，包括会话选择、场景提取、CoT完成和场景增强等阶段，从而进入LLM训练阶段。这种方法似乎增强了智能体在复杂、多回合对话中的可控性和适应性。我们在房地产销售背景下的初步评估表明，RAISE与传统代理商相比具有一些优势，这表明其具有更广泛的应用潜力。这项工作为开发更多上下文感知和通用的会话代理提供了一个强大的框架，为人工智能领域做出了贡献
[论文下载:]http://arxiv.org/abs/2401.02777v1

标题: mFACE: Multilingual Summarization with Factual Consistency Evaluation
作者: Roee Aharoni, Shashi Narayan, Joshua Maynez
摘要: Abstractive summarization has enjoyed renewed interest in recent years,
thanks to pre-trained language models and the availability of large-scale
datasets. Despite promising results, current models still suffer from
generating factually inconsistent summaries, reducing their utility for
real-world application. Several recent efforts attempt to address this by
devising models that automatically detect factual inconsistencies in machine
generated summaries. However, they focus exclusively on English, a language
with abundant resources. In this work, we leverage factual consistency
evaluation models to improve multilingual summarization. We explore two
intuitive approaches to mitigate hallucinations based on the signal provided by
a multilingual NLI model, namely data filtering and controlled generation.
Experimental results in the 45 languages from the XLSum dataset show gains over
strong baselines in both automatic and human evaluation.
中文摘要: 近年来，由于预先训练的语言模型和大规模数据集的可用性，抽象摘要重新引起了人们的兴趣。尽管取得了有希望的结果，但当前的模型仍然存在生成事实不一致的摘要的问题，从而降低了其在现实应用中的效用。最近的几项努力试图通过设计自动检测机器生成的摘要中事实不一致的模型来解决这一问题。然而，他们只关注英语，这是一种资源丰富的语言。在这项工作中，我们利用事实一致性评估模型来改进多语言摘要。我们探索了两种基于多语言NLI模型提供的信号来减轻幻觉的直观方法，即数据过滤和控制生成。XLSum数据集的45种语言的实验结果显示，在自动评估和人工评估方面都优于强基线
[论文下载:]http://arxiv.org/abs/2212.10622v2

标题: Detection and Classification of Diabetic Retinopathy using Deep Learning
Algorithms for Segmentation to Facilitate Referral Recommendation for Test
and Treatment Prediction
作者: Manoj S H, Arya A Bosale
摘要: This research paper addresses the critical challenge of diabetic retinopathy
(DR), a severe complication of diabetes leading to potential blindness. The
proposed methodology leverages transfer learning with convolutional neural
networks (CNNs) for automatic DR detection using a single fundus photograph,
demonstrating high effectiveness with a quadratic weighted kappa score of
0.92546 in the APTOS 2019 Blindness Detection Competition. The paper reviews
existing literature on DR detection, spanning classical computer vision methods
to deep learning approaches, particularly focusing on CNNs. It identifies gaps
in the research, emphasizing the lack of exploration in integrating pretrained
large language models with segmented image inputs for generating
recommendations and understanding dynamic interactions within a web application
context.Objectives include developing a comprehensive DR detection methodology,
exploring model integration, evaluating performance through competition
ranking, contributing significantly to DR detection methodologies, and
identifying research gaps.The methodology involves data preprocessing, data
augmentation, and the use of a U-Net neural network architecture for
segmentation. The U-Net model efficiently segments retinal structures,
including blood vessels, hard and soft exudates, haemorrhages, microaneurysms,
and the optical disc. High evaluation scores in Jaccard, F1, recall, precision,
and accuracy underscore the model’s potential for enhancing diagnostic
capabilities in retinal pathology assessment.The outcomes of this research hold
promise for improving patient outcomes through timely diagnosis and
intervention in the fight against diabetic retinopathy, marking a significant
contribution to the field of medical image analysis.
中文摘要: 这篇研究论文解决了糖尿病视网膜病变（DR）的关键挑战，DR是糖尿病的一种严重并发症，可导致潜在的失明。所提出的方法利用卷积神经网络（CNNs）的迁移学习，使用单个眼底照片进行DR自动检测，在APTOS 2019失明检测竞赛中以0.92546的二次加权kappa分数证明了高效性。本文综述了现有的DR检测文献，从经典的计算机视觉方法到深度学习方法，特别是对神经网络的研究。它确定了研究中的差距，强调了在将预先训练的大型语言模型与分割的图像输入相结合以生成推荐和理解网络应用程序上下文中的动态交互方面缺乏探索。目标包括开发一种全面的DR检测方法，探索模型集成，通过竞争排名评估性能，为DR检测方法做出重大贡献，以及确定研究差距。该方法包括数据预处理、数据扩充和使用U-Net神经网络架构进行分割。U-Net模型有效地分割视网膜结构，包括血管、硬渗出物和软渗出物、出血、微动脉瘤和光盘。Jaccard、F1、召回率、精确度和准确性的高评估分数强调了该模型在增强视网膜病理学评估诊断能力方面的潜力。这项研究的结果有望通过及时诊断和干预糖尿病视网膜病变来改善患者的预后，标志着对医学图像分析领域的重大贡献
[论文下载:]http://arxiv.org/abs/2401.02759v1

标题: Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
作者: Haoyuan Wu, Haisheng Zheng, Bei Yu
摘要: Large Language Models (LLMs) have demonstrated considerable proficiency in
general natural language processing (NLP) tasks. Instruction tuning, a
successful paradigm, enhances the ability of LLMs to follow natural language
instructions and exhibit robust generalization across a wide range of tasks.
However, these models often encounter performance limitations across multiple
tasks due to constrained model capacity. Expanding this capacity during the
instruction tuning phase poses significant challenges. To address this issue,
we introduce a novel approach, Parameter-Efficient Sparsity Crafting (PESC),
which transitions dense models to sparse models using a Mixture of Experts
(MoE) architecture. PESC integrates adapters into the MoE layers of sparse
models, differentiating experts without altering the individual weights within
these layers. This method significantly reduces computational costs and GPU
memory requirements, facilitating model capacity expansion through a minimal
increase in parameters via the inserted adapters. Our empirical evaluation
demonstrates the effectiveness of the PESC method. Using PESC during
instruction tuning, our sparse models, dubbed Camelidae outperform all other
opensource sparse models and exhibit superior general capabilities compared to
GPT3.5.
中文摘要: 大型语言模型（LLM）在一般自然语言处理（NLP）任务中表现出相当的熟练程度。指令调优是一种成功的范式，它增强了LLM遵循自然语言指令的能力，并在广泛的任务中表现出强大的泛化能力。然而，由于模型容量有限，这些模型经常在多个任务中遇到性能限制。在指令调整阶段扩展此容量带来了重大挑战。为了解决这个问题，我们引入了一种新的方法，即参数高效稀疏性制作（PESC），该方法使用混合专家（MoE）架构将密集模型转换为稀疏模型。PESC将适配器集成到稀疏模型的MoE层中，在不改变这些层中的单个权重的情况下区分专家。这种方法显著降低了计算成本和GPU内存需求，通过插入的适配器使参数的增加最小，从而促进了模型容量的扩展。我们的实证评估证明了PESC方法的有效性。在指令调整过程中使用PESC，与GPT3.5相比，我们的稀疏模型Camellidae优于所有其他开源稀疏模型，并表现出卓越的通用功能。
[论文下载:]http://arxiv.org/abs/2401.02731v1

标题: XUAT-Copilot: Multi-Agent Collaborative System for Automated User
Acceptance Testing with Large Language Model
作者: Zhitao Wang, Wei Wang, Zirao Li
摘要: In past years, we have been dedicated to automating user acceptance testing
(UAT) process of WeChat Pay, one of the most influential mobile payment
applications in China. A system titled XUAT has been developed for this
purpose. However, there is still a human-labor-intensive stage, i.e, test
scripts generation, in the current system. Therefore, in this paper, we
concentrate on methods of boosting the automation level of the current system,
particularly the stage of test scripts generation. With recent notable
successes, large language models (LLMs) demonstrate significant potential in
attaining human-like intelligence and there has been a growing research area
that employs LLMs as autonomous agents to obtain human-like decision-making
capabilities. Inspired by these works, we propose an LLM-powered multi-agent
collaborative system, named XUAT-Copilot, for automated UAT. The proposed
system mainly consists of three LLM-based agents responsible for action
planning, state checking and parameter selecting, respectively, and two
additional modules for state sensing and case rewriting. The agents interact
with testing device, make human-like decision and generate action command in a
collaborative way. The proposed multi-agent system achieves a close
effectiveness to human testers in our experimental studies and gains a
significant improvement of Pass@1 accuracy compared with single-agent
architecture. More importantly, the proposed system has launched in the formal
testing environment of WeChat Pay mobile app, which saves a considerable amount
of manpower in the daily development work.
中文摘要: 过去几年，我们一直致力于实现微信支付的用户接受测试（UAT）流程自动化，微信支付是中国最具影响力的移动支付应用之一。为此开发了一个名为XUAT的系统。然而，在当前的系统中，仍然存在一个人力密集的阶段，即测试脚本的生成。因此，在本文中，我们专注于提高当前系统自动化水平的方法，特别是测试脚本生成阶段。随着最近的显著成功，大型语言模型（LLM）在获得类人智能方面表现出了巨大的潜力，并且越来越多的研究领域将LLM作为自主主体来获得类人决策能力。受这些工作的启发，我们提出了一个LLM驱动的多智能体协作系统，名为XUAT Copilot，用于自动化UAT。所提出的系统主要由三个基于LLM的代理组成，分别负责动作规划、状态检查和参数选择，以及两个额外的状态感知和案例重写模块。代理与测试设备交互，做出类似人类的决策，并以协作的方式生成动作命令。在我们的实验研究中，所提出的多智能体系统与人类测试人员的效果非常接近，并在Pass@1与单代理体系结构相比的准确性。更重要的是，该系统已在微信支付手机应用程序的正式测试环境中推出，在日常开发工作中节省了大量人力
[论文下载:]http://arxiv.org/abs/2401.02705v1

标题: VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language
Model
作者: Pengying Wu, Yao Mu, Bingxian Wu
摘要: In the realm of household robotics, the Zero-Shot Object Navigation (ZSON)
task empowers agents to adeptly traverse unfamiliar environments and locate
objects from novel categories without prior explicit training. This paper
introduces VoroNav, a novel semantic exploration framework that proposes the
Reduced Voronoi Graph to extract exploratory paths and planning nodes from a
semantic map constructed in real time. By harnessing topological and semantic
information, VoroNav designs text-based descriptions of paths and images that
are readily interpretable by a large language model (LLM). Our approach
presents a synergy of path and farsight descriptions to represent the
environmental context, enabling the LLM to apply commonsense reasoning to
ascertain the optimal waypoints for navigation. Extensive evaluation on the
HM3D and HSSD datasets validates that VoroNav surpasses existing ZSON
benchmarks in both success rates and exploration efficiency (+2.8% Success and
+3.7% SPL on HM3D, +2.6% Success and +3.8% SPL on HSSD). Additionally
introduced metrics that evaluate obstacle avoidance proficiency and perceptual
efficiency further corroborate the enhancements achieved by our method in ZSON
planning.
中文摘要: 在家用机器人领域，零样本物体导航（ZSON）任务使代理能够熟练地穿越陌生环境，并在没有事先明确训练的情况下定位新类别的物体。本文介绍了VoroNav，这是一种新的语义探索框架，它提出了简化Voronoi图来从实时构建的语义图中提取探索路径和规划节点。通过利用拓扑和语义信息，VoroNav设计了基于文本的路径和图像描述，这些描述很容易被大型语言模型（LLM）解释。我们的方法提供了路径和远景描述的协同作用，以表示环境背景，使LLM能够应用常识推理来确定最佳导航路线点。对HM3D和HSSD数据集的广泛评估证实，VoroNav在成功率和勘探效率方面都超过了现有的ZSON基准（HM3D的成功率为+2.8%，SPL为+3.7%，HSSD的成功率和SPL为+2.6%）。此外，引入的评估避障能力和感知效率的指标进一步证实了我们的方法在ZSON规划中实现的增强
[论文下载:]http://arxiv.org/abs/2401.02695v1

标题: Training Diffusion Models with Reinforcement Learning
作者: Kevin Black, Michael Janner, Yilun Du
摘要: Diffusion models are a class of flexible generative models trained with an
approximation to the log-likelihood objective. However, most use cases of
diffusion models are not concerned with likelihoods, but instead with
downstream objectives such as human-perceived image quality or drug
effectiveness. In this paper, we investigate reinforcement learning methods for
directly optimizing diffusion models for such objectives. We describe how
posing denoising as a multi-step decision-making problem enables a class of
policy gradient algorithms, which we refer to as denoising diffusion policy
optimization (DDPO), that are more effective than alternative reward-weighted
likelihood approaches. Empirically, DDPO is able to adapt text-to-image
diffusion models to objectives that are difficult to express via prompting,
such as image compressibility, and those derived from human feedback, such as
aesthetic quality. Finally, we show that DDPO can improve prompt-image
alignment using feedback from a vision-language model without the need for
additional data collection or human annotation. The project’s website can be
found at http://rl-diffusion.github.io .
中文摘要: 扩散模型是一类灵活的生成模型，其训练近似于对数似然目标。然而，大多数扩散模型的用例并不关注可能性，而是关注下游目标，如人类感知的图像质量或药物有效性。在本文中，我们研究了用于直接优化此类目标的扩散模型的强化学习方法。我们描述了将去噪作为一个多步骤决策问题如何实现一类策略梯度算法，我们称之为去噪扩散策略优化（DDPO），该算法比其他奖励加权似然方法更有效。从经验上讲，DDPO能够使文本到图像的扩散模型适应难以通过提示表达的目标，如图像压缩性，以及来自人类反馈的目标，例如审美质量。最后，我们展示了DDPO可以使用来自视觉语言模型的反馈来改进即时图像对齐，而不需要额外的数据收集或人工注释。该项目的网站位于http://rl-diffusion.github.io.
[论文下载:]http://arxiv.org/abs/2305.13301v4
[项目页面:]http://rl-diffusion.github.io|

标题: Learning to Prompt with Text Only Supervision for Vision-Language Models
作者: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer
摘要: Foundational vision-language models such as CLIP are becoming a new paradigm
in vision, due to their excellent generalization abilities. However, adapting
these models for downstream tasks while maintaining their generalization
remains a challenge. In literature, one branch of methods adapts CLIP by
learning prompts using visual information. While effective, most of these works
require labeled data which is not practical, and often struggle to generalize
towards new datasets due to over-fitting on the source data. An alternative
approach resorts to training-free methods by generating class descriptions from
large language models (LLMs) and perform prompt ensembling. However, these
methods often generate class specific prompts that cannot be transferred to
other classes, which incur higher costs by generating LLM descriptions for each
class separately. In this work, we propose to combine the strengths of these
both streams of methods by learning prompts using only text data derived from
LLMs. As supervised training of prompts is not trivial due to absence of
images, we develop a training approach that allows prompts to extract rich
contextual knowledge from LLM data. Moreover, with LLM contextual data mapped
within the learned prompts, it enables zero-shot transfer of prompts to new
classes and datasets potentially cutting the LLM prompt engineering cost. To
the best of our knowledge, this is the first work that learns generalized
prompts using text only data. We perform extensive evaluations on 4 benchmarks
where our method improves over prior ensembling works while being competitive
to those utilizing labeled images. Our code and pre-trained models are
available at https://github.com/muzairkhattak/ProText.
[论文下载:]http://arxiv.org/abs/2401.02418v1
[项目页面:]https://muzairkhattak.github.io/ProText/|
[GitHub:]https://github.com/muzairkhattak/ProText.|

标题: SGFormer: Simplifying and Empowering Transformers for Large-Graph
Representations
作者: Qitian Wu, Wentao Zhao, Chenxiao Yang
摘要: Learning representations on large-sized graphs is a long-standing challenge
due to the inter-dependence nature involved in massive data points.
Transformers, as an emerging class of foundation encoders for graph-structured
data, have shown promising performance on small graphs due to its global
attention capable of capturing all-pair influence beyond neighboring nodes.
Even so, existing approaches tend to inherit the spirit of Transformers in
language and vision tasks, and embrace complicated models by stacking deep
multi-head attentions. In this paper, we critically demonstrate that even using
a one-layer attention can bring up surprisingly competitive performance across
node property prediction benchmarks where node numbers range from
thousand-level to billion-level. This encourages us to rethink the design
philosophy for Transformers on large graphs, where the global attention is a
computation overhead hindering the scalability. We frame the proposed scheme as
Simplified Graph Transformers (SGFormer), which is empowered by a simple
attention model that can efficiently propagate information among arbitrary
nodes in one layer. SGFormer requires none of positional encodings,
feature/graph pre-processing or augmented loss. Empirically, SGFormer
successfully scales to the web-scale graph ogbn-papers100M and yields up to
141x inference acceleration over SOTA Transformers on medium-sized graphs.
Beyond current results, we believe the proposed methodology alone enlightens a
new technical path of independent interest for building Transformers on large
graphs.
中文摘要: 由于海量数据点的相互依赖性，在大尺寸图上学习表示是一个长期存在的挑战。Transformers作为一种新兴的图形结构化数据基础编码器，由于其能够捕捉相邻节点之外的所有对影响的全局注意力，在小型图形上显示出了良好的性能。即便如此，现有的方法倾向于在语言和视觉任务中继承变形金刚的精神，并通过堆叠深入的多头注意力来拥抱复杂的模型。在本文中，我们批判性地证明，即使使用一层注意力，也可以在节点数量从千级到十亿级的节点属性预测基准中带来令人惊讶的竞争性能。这鼓励我们重新思考大型图上Transformers的设计理念，因为全局注意力是阻碍可扩展性的计算开销。我们将所提出的方案定义为简化图变换器（SGFormer），它由一个简单的注意力模型授权，该模型可以在一层中的任意节点之间有效地传播信息。SGFormer不需要位置编码、特征/图预处理或增广损失。从经验上讲，SGFormer成功地扩展到了网络规模的图ogbn-paper100M，并在中型图上产生了比SOTA Transformers高达141倍的推理加速。除了目前的结果之外，我们相信所提出的方法本身就为在大型图上构建变压器提供了一条独立感兴趣的新技术途径
[论文下载:]http://arxiv.org/abs/2306.10759v4
[GitHub:]https://github.com/qitianwu/SGFormer|

标题: Unleashing the Emergent Cognitive Synergy in Large Language Models: A
Task-Solving Agent through Multi-Persona Self-Collaboration
作者: Zhenhailong Wang, Shaoguang Mao, Wenshan Wu
摘要: Human intelligence thrives on cognitive synergy, where collaboration among
different minds yield superior outcomes compared to isolated individuals. In
this work, we propose Solo Performance Prompting (SPP), which transforms a
single LLM into a cognitive synergist by engaging in multi-turn
self-collaboration with multiple personas. A cognitive synergist is an
intelligent agent that collaboratively combines multiple minds’ strengths and
knowledge to enhance problem-solving in complex tasks. By dynamically
identifying and simulating different personas based on task inputs, SPP
unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis
shows that assigning multiple fine-grained personas in LLMs improves
problem-solving abilities compared to using a single or fixed number of
personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing,
Codenames Collaborative, and Logic Grid Puzzle, encompassing both
knowledge-intensive and reasoning-intensive types. Unlike previous works, such
as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs,
experimental results demonstrate that SPP effectively reduces factual
hallucination, and maintains strong reasoning capabilities. Additionally,
comparative experiments show that cognitive synergy only emerges in GPT-4 and
does not appear in less capable models, such as GPT-3.5-turbo and
Llama2-13b-chat, which draws an interesting analogy to human development. Code,
data, and prompts can be found at:
https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.
中文摘要: 人类的智力依赖于认知协同作用，与孤立的个体相比，不同思想之间的协作会产生更好的结果。在这项工作中，我们提出了独奏表演提示（SPP），它通过与多个人物角色进行多回合的自我协作，将单个LLM转化为认知增效剂。认知增效剂是一种智能体，它协同结合多个大脑的力量和知识，以增强复杂任务中的问题解决能力。通过基于任务输入动态识别和模拟不同的人物角色，SPP释放了LLM中认知协同的潜力。我们的深入分析表明，与使用单个或固定数量的人物角色相比，在LLM中分配多个细粒度的人物角色可以提高解决问题的能力。我们在三个具有挑战性的任务上评估SPP：Trivia Creative Writing、Codenames Collaborative和Logic Grid Puzzle，包括知识密集型和推理密集型。与之前的工作（如思想链）不同，它们只增强LLM的推理能力，实验结果表明SPP有效地减少了事实幻觉，并保持了强大的推理能力。此外，比较实验表明，认知协同作用只出现在GPT-4中，而不会出现在能力较差的模型中，如GPT-3.5-turbo和Llama2-13b-cat，这与人类发展有着有趣的相似之处。代码、数据和提示可在以下位置找到：https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.
[论文下载:]http://arxiv.org/abs/2307.05300v3
[GitHub:]https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.|

标题: Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using
EmotionBench
作者: Jen-tse Huang, Man Ho Lam, Eric John Li
摘要: Evaluating Large Language Models’ (LLMs) anthropomorphic capabilities has
become increasingly important in contemporary discourse. Utilizing the emotion
appraisal theory from psychology, we propose to evaluate the empathy ability of
LLMs, i.e., how their feelings change when presented with specific situations.
After a careful and comprehensive survey, we collect a dataset containing over
400 situations that have proven effective in eliciting the eight emotions
central to our study. Categorizing the situations into 36 factors, we conduct a
human evaluation involving more than 1,200 subjects worldwide. With the human
evaluation results as references, our evaluation includes five LLMs, covering
both commercial and open-source models, including variations in model sizes,
featuring the latest iterations, such as GPT-4 and LLaMA-2. We find that,
despite several misalignments, LLMs can generally respond appropriately to
certain situations. Nevertheless, they fall short in alignment with the
emotional behaviors of human beings and cannot establish connections between
similar situations. Our collected dataset of situations, the human evaluation
results, and the code of our testing framework, dubbed EmotionBench, is made
openly accessible via https://github.com/CUHK-ARISE/EmotionBench. We aspire to
contribute to the advancement of LLMs regarding better alignment with the
emotional behaviors of human beings, thereby enhancing their utility and
applicability as intelligent assistants.
中文摘要: 评估大型语言模型的拟人化能力在当代话语中变得越来越重要。利用心理学的情绪评价理论，我们建议评估LLM的移情能力，即当他们的情绪在特定情况下发生变化时。经过仔细而全面的调查，我们收集了一个包含400多种情况的数据集，这些情况已被证明能有效地激发我们研究的八种核心情绪。我们将这些情况分为36个因素，对全球1200多名受试者进行了人体评估。以人类评估结果为参考，我们的评估包括五个LLM，涵盖商业和开源模型，包括模型大小的变化，以最新迭代为特色，如GPT-4和LLaMA-2。我们发现，尽管存在一些不一致，LLM通常可以对某些情况做出适当的反应。然而，它们与人类的情感行为不一致，无法在类似的情况之间建立联系。我们收集的情况数据集、人类评估结果和我们的测试框架（称为EmotionBench）的代码通过https://github.com/CUHK-ARISE/EmotionBench.我们渴望为LLM的发展做出贡献，更好地与人类的情感行为保持一致，从而提高其作为智能助手的实用性和适用性
[论文下载:]http://arxiv.org/abs/2308.03656v3
[GitHub:]https://github.com/CUHK-ARISE/EmotionBench.|

标题: Location Aware Modular Biencoder for Tourism Question Answering
作者: Haonan Li, Martin Tomko, Timothy Baldwin
摘要: Answering real-world tourism questions that seek Point-of-Interest (POI)
recommendations is challenging, as it requires both spatial and non-spatial
reasoning, over a large candidate pool. The traditional method of encoding each
pair of question and POI becomes inefficient when the number of candidates
increases, making it infeasible for real-world applications. To overcome this,
we propose treating the QA task as a dense vector retrieval problem, where we
encode questions and POIs separately and retrieve the most relevant POIs for a
question by utilizing embedding space similarity. We use pretrained language
models (PLMs) to encode textual information, and train a location encoder to
capture spatial information of POIs. Experiments on a real-world tourism QA
dataset demonstrate that our approach is effective, efficient, and outperforms
previous methods across all metrics. Enabled by the dense retrieval
architecture, we further build a global evaluation baseline, expanding the
search space by 20 times compared to previous work. We also explore several
factors that impact on the model’s performance through follow-up experiments.
Our code and model are publicly available at https://github.com/haonan-li/LAMB.
中文摘要: 回答寻求兴趣点（POI）推荐的真实世界旅游问题具有挑战性，因为它需要在大型候选库中进行空间和非空间推理。当候选者的数量增加时，对每一对问题和POI进行编码的传统方法变得低效，这使得它在现实世界的应用中不可行。为了克服这一点，我们建议将QA任务视为密集向量检索问题，其中我们分别对问题和POI进行编码，并通过利用嵌入空间相似性来检索问题的最相关POI。我们使用预训练的语言模型（PLM）对文本信息进行编码，并训练位置编码器来捕获POI的空间信息。在真实世界的旅游QA数据集上的实验表明，我们的方法有效、高效，在所有指标上都优于以前的方法。在密集检索架构的支持下，我们进一步构建了一个全局评估基线，与之前的工作相比，搜索空间扩展了20倍。我们还通过后续实验探讨了影响模型性能的几个因素。我们的代码和模型可在https://github.com/haonan-li/LAMB.
[论文下载:]http://arxiv.org/abs/2401.02187v1
[GitHub:]https://github.com/haonan-li/LAMB.|

标题: Self-supervised Pretraining for Decision Foundation Model: Formulation,
Pipeline and Challenges
作者: Xiaoqian Liu, Jianbin Jiao, Junge Zhang
摘要: Decision-making is a dynamic process requiring perception, memory, and
reasoning to make choices and find optimal policies. Traditional approaches to
decision-making suffer from sample efficiency and generalization, while
large-scale self-supervised pretraining has enabled fast adaptation with
fine-tuning or few-shot learning in language and vision. We thus argue to
integrate knowledge acquired from generic large-scale self-supervised
pretraining into downstream decision-making problems. We propose
Pretrain-Then-Adapt pipeline and survey recent work on data collection,
pretraining objectives and adaptation strategies for decision-making
pretraining and downstream inference. Finally, we identify critical challenges
and future directions for developing decision foundation model with the help of
generic and flexible self-supervised pretraining.
中文摘要: 决策是一个动态过程，需要感知、记忆和推理来做出选择并找到最佳策略。传统的决策方法缺乏样本效率和泛化能力，而大规模的自监督预训练通过语言和视觉的微调或少镜头学习实现了快速适应。因此，我们主张将从一般的大规模自我监督预训练中获得的知识整合到下游决策问题中。我们提出了预训练然后自适应管道，并调查了最近在数据收集、预训练目标和决策预训练和下游推理的自适应策略方面的工作。最后，我们确定了在通用和灵活的自监督预训练的帮助下开发决策基础模型的关键挑战和未来方向
[论文下载:]http://arxiv.org/abs/2401.00031v2

标题: LMaaS: Exploring Pricing Strategy of Large Model as a Service for
Communication
作者: Panlong Wu, Qi Liu, Yanjie Dong
摘要: The next generation of communication is envisioned to be intelligent
communication, that can replace traditional symbolic communication, where
highly condensed semantic information considering both source and channel will
be extracted and transmitted with high efficiency. The recent popular large
models such as GPT4 and the boosting learning techniques lay a solid foundation
for the intelligent communication, and prompt the practical deployment of it in
the near future. Given the characteristics of “training once and widely use” of
those multimodal large language models, we argue that a pay-as-you-go service
mode will be suitable in this context, referred to as Large Model as a Service
(LMaaS). However, the trading and pricing problem is quite complex with
heterogeneous and dynamic customer environments, making the pricing
optimization problem challenging in seeking on-hand solutions. In this paper,
we aim to fill this gap and formulate the LMaaS market trading as a Stackelberg
game with two steps. In the first step, we optimize the seller’s pricing
decision and propose an Iterative Model Pricing (IMP) algorithm that optimizes
the prices of large models iteratively by reasoning customers’ future rental
decisions, which is able to achieve a near-optimal pricing solution. In the
second step, we optimize customers’ selection decisions by designing a robust
selecting and renting (RSR) algorithm, which is guaranteed to be optimal with
rigorous theoretical proof. Extensive experiments confirm the effectiveness and
robustness of our algorithms.
中文摘要: 下一代通信被设想为智能通信，它可以取代传统的符号通信，在符号通信中，考虑到源和信道的高度浓缩的语义信息将被高效提取和传输。最近流行的GPT4等大型模型和助推学习技术为智能通信奠定了坚实的基础，并促使其在不久的将来得到实际部署。鉴于这些多模式大型语言模型“一次性培训并广泛使用”的特点，我们认为现收现付服务模式将适合这种情况，称为大型服务模型（LMaaS）。然而，交易和定价问题非常复杂，具有异构和动态的客户环境，这使得定价优化问题在寻求现有解决方案方面具有挑战性。在本文中，我们旨在填补这一空白，并将LMaaS市场交易公式化为两步的Stackelberg对策。在第一步中，我们优化了卖家的定价决策，并提出了一种迭代模型定价（IMP）算法，该算法通过推理客户未来的租赁决策来迭代优化大型模型的价格，从而能够实现接近最优的定价解决方案。在第二步中，我们通过设计一个稳健的选择和租赁（RSR）算法来优化客户的选择决策，该算法在严格的理论证明下保证是最优的。大量实验证实了我们算法的有效性和稳健性
[论文下载:]http://arxiv.org/abs/2401.02675v1

标题: Subjectivity in Unsupervised Machine Learning Model Selection
作者: Wanyi Chen, Mary L. Cummings
摘要: Model selection is a necessary step in unsupervised machine learning. Despite
numerous criteria and metrics, model selection remains subjective. A high
degree of subjectivity may lead to questions about repeatability and
reproducibility of various machine learning studies and doubts about the
robustness of models deployed in the real world. Yet, the impact of modelers’
preferences on model selection outcomes remains largely unexplored. This study
uses the Hidden Markov Model as an example to investigate the subjectivity
involved in model selection. We asked 33 participants and three Large Language
Models (LLMs) to make model selections in three scenarios. Results revealed
variability and inconsistencies in both the participants’ and the LLMs’
choices, especially when different criteria and metrics disagree. Sources of
subjectivity include varying opinions on the importance of different criteria
and metrics, differing views on how parsimonious a model should be, and how the
size of a dataset should influence model selection. The results underscore the
importance of developing a more standardized way to document subjective choices
made in model selection processes.
中文摘要: 模型选择是无监督机器学习的必要步骤。尽管有许多标准和指标，但模型选择仍然是主观的。高度的主观性可能会导致对各种机器学习研究的可重复性和再现性的质疑，以及对现实世界中部署的模型的稳健性的质疑。然而，建模师的偏好对模型选择结果的影响在很大程度上仍未被探索。本研究以隐马尔可夫模型为例，探讨模型选择的主观性。我们要求33名参与者和三个大型语言模型（LLM）在三个场景中进行模型选择。结果显示，参与者和LLM的选择存在可变性和不一致性，尤其是当不同的标准和指标不一致时。主观性的来源包括对不同标准和指标的重要性的不同意见，对模型应该如何简约的不同看法，以及数据集的大小应该如何影响模型选择。研究结果强调了开发一种更标准化的方法来记录模型选择过程中做出的主观选择的重要性
[论文下载:]http://arxiv.org/abs/2309.00201v2

标题: Training and Serving System of Foundation Models: A Comprehensive Survey
作者: Jiahang Zhou, Yanyu Chen, Zicong Hong
摘要: Foundation models (e.g., ChatGPT, DALL-E, PengCheng Mind, PanGu- $\Sigma$ )
have demonstrated extraordinary performance in key technological areas, such as
natural language processing and visual recognition, and have become the
mainstream trend of artificial general intelligence. This has led more and more
major technology giants to dedicate significant human and financial resources
to actively develop their foundation model systems, which drives continuous
growth of these models’ parameters. As a result, the training and serving of
these models have posed significant challenges, including substantial computing
power, memory consumption, bandwidth demands, etc. Therefore, employing
efficient training and serving strategies becomes particularly crucial. Many
researchers have actively explored and proposed effective methods. So, a
comprehensive survey of them is essential for system developers and
researchers. This paper extensively explores the methods employed in training
and serving foundation models from various perspectives. It provides a detailed
categorization of these state-of-the-art methods, including finer aspects such
as network, computing, and storage. Additionally, the paper summarizes the
challenges and presents a perspective on the future development direction of
foundation model systems. Through comprehensive discussion and analysis, it
hopes to provide a solid theoretical basis and practical guidance for future
research and applications, promoting continuous innovation and development in
foundation model systems.
中文摘要: 基础模型（如ChatGPT、DALL-e、鹏程心智、PanGu- $\Sigma$ ）在自然语言处理和视觉识别等关键技术领域表现非凡，已成为通用人工智能的主流趋势。这导致越来越多的主要科技巨头投入大量人力和财力，积极开发其基础模型系统，从而推动这些模型参数的持续增长。因此，这些模型的训练和服务带来了重大挑战，包括巨大的计算能力、内存消耗、带宽需求等。因此，采用高效的训练和提供服务策略变得尤为重要。许多研究者积极探索并提出了有效的方法。因此，对它们进行全面的调查对于系统开发人员和研究人员来说是至关重要的。本文从多个角度广泛探讨了训练和服务基础模型的方法。它提供了这些最先进方法的详细分类，包括网络、计算和存储等更精细的方面。此外，本文还总结了基础模型系统面临的挑战，并展望了未来的发展方向。通过全面的讨论和分析，希望为未来的研究和应用提供坚实的理论基础和实践指导，促进基础模型系统的不断创新和发展
[论文下载:]http://arxiv.org/abs/2401.02643v1

标题: KwaiAgents: Generalized Information-seeking Agent System with Large
Language Models
作者: Haojie Pan, Zepeng Zhai, Hao Yuan
摘要: Driven by curiosity, humans have continually sought to explore and understand
the world around them, leading to the invention of various tools to satiate
this inquisitiveness. Despite not having the capacity to process and memorize
vast amounts of information in their brains, humans excel in critical thinking,
planning, reflection, and harnessing available tools to interact with and
interpret the world, enabling them to find answers efficiently. The recent
advancements in large language models (LLMs) suggest that machines might also
possess the aforementioned human-like capabilities, allowing them to exhibit
powerful abilities even with a constrained parameter count. In this paper, we
introduce KwaiAgents, a generalized information-seeking agent system based on
LLMs. Within KwaiAgents, we propose an agent system that employs LLMs as its
cognitive core, which is capable of understanding a user’s query, behavior
guidelines, and referencing external documents. The agent can also update and
retrieve information from its internal memory, plan and execute actions using a
time-aware search-browse toolkit, and ultimately provide a comprehensive
response. We further investigate the system’s performance when powered by LLMs
less advanced than GPT-4, and introduce the Meta-Agent Tuning (MAT) framework,
designed to ensure even an open-sourced 7B or 13B model performs well among
many agent systems. We exploit both benchmark and human evaluations to
systematically validate these capabilities. Extensive experiments show the
superiority of our agent system compared to other autonomous agents and
highlight the enhanced generalized agent-abilities of our fine-tuned LLMs.
中文摘要: 在好奇心的驱使下，人类不断寻求探索和了解周围的世界，从而发明了各种工具来满足这种好奇心。尽管人类没有能力处理和记忆大脑中的大量信息，但他们擅长批判性思维、计划、反思，以及利用现有工具与世界互动和解释世界，使他们能够高效地找到答案。大型语言模型（LLM）的最新进展表明，机器也可能具有上述类似人类的能力，使它们即使在参数计数有限的情况下也能表现出强大的能力。在本文中，我们介绍了KwaiAgents，一个基于LLM的广义信息搜索代理系统。在KwaiAgents中，我们提出了一个以LLM为认知核心的代理系统，该系统能够理解用户的查询、行为准则和引用外部文档。代理还可以从其内部内存中更新和检索信息，使用时间感知搜索浏览工具包计划和执行操作，并最终提供全面的响应。我们进一步研究了当LLM不如GPT-4先进时系统的性能，并引入了元代理优化（MAT）框架，旨在确保即使是开源的7B或13B模型也能在许多代理系统中表现良好。我们利用基准评估和人工评估来系统地验证这些能力。大量实验表明，与其他自治代理相比，我们的代理系统具有优势，并突出了我们微调LLM增强的广义代理能力
[论文下载:]http://arxiv.org/abs/2312.04889v2

标题: Applications of Large Scale Foundation Models for Autonomous Driving
作者: Yu Huang, Yue Chen, Zhu Li
摘要: Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007,
autonomous driving has been the most active field of AI applications. Recently
powered by large language models (LLMs), chat systems, such as chatGPT and
PaLM, emerge and rapidly become a promising direction to achieve artificial
general intelligence (AGI) in natural language processing (NLP). There comes a
natural thinking that we could employ these abilities to reformulate autonomous
driving. By combining LLM with foundation models, it is possible to utilize the
human knowledge, commonsense and reasoning to rebuild autonomous driving
systems from the current long-tailed AI dilemma. In this paper, we investigate
the techniques of foundation models and LLMs applied for autonomous driving,
categorized as simulation, world model, data annotation and planning or E2E
solutions etc.
中文摘要: 自2004/05年DARPA重大挑战（农村）和2007年城市挑战以来，自动驾驶一直是人工智能应用中最活跃的领域。最近，在大型语言模型（LLM）的支持下，聊天系统（如chatGPT和PaLM）出现并迅速成为在自然语言处理（NLP）中实现通用人工智能（AGI）的一个有前途的方向。人们自然而然地认为，我们可以利用这些能力来重新制定自动驾驶。通过将LLM与基础模型相结合，可以利用人类的知识、常识和推理来从当前的长尾人工智能困境中重建自动驾驶系统。在本文中，我们研究了应用于自动驾驶的基础模型和LLM技术，分为模拟、世界模型、数据注释和规划或E2E解决方案等。
[论文下载:]http://arxiv.org/abs/2311.12144v7

标题: Large Language Models for Social Networks: Applications, Challenges, and
Solutions
作者: Jingying Zeng, Richard Huang, Waleed Malik
摘要: Large Language Models (LLMs) are transforming the way people generate,
explore, and engage with content. We study how we can develop LLM applications
for online social networks. Despite LLMs’ successes in other domains, it is
challenging to develop LLM-based products for social networks for numerous
reasons, and it has been relatively under-reported in the research community.
We categorize LLM applications for social networks into three categories. First
is knowledge tasks where users want to find new knowledge and information, such
as search and question-answering. Second is entertainment tasks where users
want to consume interesting content, such as getting entertaining notification
content. Third is foundational tasks that need to be done to moderate and
operate the social networks, such as content annotation and LLM monitoring. For
each task, we share the challenges we found, solutions we developed, and
lessons we learned. To the best of our knowledge, this is the first
comprehensive paper about developing LLM applications for social networks.
中文摘要: 大型语言模型（LLM）正在改变人们生成、探索和参与内容的方式。我们研究如何为在线社交网络开发LLM应用程序。尽管LLM在其他领域取得了成功，但由于多种原因，为社交网络开发基于LLM的产品具有挑战性，而且在研究界的报道相对较少。我们将社交网络的LLM应用程序分为三类。首先是知识任务，用户希望在其中找到新的知识和信息，如搜索和问答。其次是用户想要消费有趣内容的娱乐任务，例如获取娱乐通知内容。第三是调节和运营社交网络所需的基本任务，如内容注释和LLM监控。对于每项任务，我们都会分享我们发现的挑战、制定的解决方案和吸取的教训。据我们所知，这是第一篇关于为社交网络开发LLM应用程序的综合性论文
[论文下载:]http://arxiv.org/abs/2401.02575v1

标题: LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
作者: Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko
摘要: Large language models (LLMs) are central to modern natural language
processing, delivering exceptional performance in various tasks. However, their
substantial computational and memory requirements present challenges,
especially for devices with limited DRAM capacity. This paper tackles the
challenge of efficiently running LLMs that exceed the available DRAM capacity
by storing the model parameters in flash memory, but bringing them on demand to
DRAM. Our method involves constructing an inference cost model that takes into
account the characteristics of flash memory, guiding us to optimize in two
critical areas: reducing the volume of data transferred from flash and reading
data in larger, more contiguous chunks. Within this hardware-informed
framework, we introduce two principal techniques. First, “windowing”
strategically reduces data transfer by reusing previously activated neurons,
and second, “row-column bundling”, tailored to the sequential data access
strengths of flash memory, increases the size of data chunks read from flash
memory. These methods collectively enable running models up to twice the size
of the available DRAM, with a 4-5x and 20-25x increase in inference speed
compared to naive loading approaches in CPU and GPU, respectively. Our
integration of sparsity awareness, context-adaptive loading, and a
hardware-oriented design paves the way for effective inference of LLMs on
devices with limited memory.
中文摘要: 大型语言模型（LLM）是现代自然语言处理的核心，在各种任务中提供卓越的性能。然而，它们的大量计算和内存需求带来了挑战，尤其是对于DRAM容量有限的设备。本文通过将模型参数存储在闪存中，但将其按需带到DRAM中，来解决高效运行超过可用DRAM容量的LLM的挑战。我们的方法包括构建一个考虑闪存特性的推理成本模型，指导我们在两个关键领域进行优化：减少从闪存传输的数据量和读取更大、更连续的数据块。在这个以硬件为基础的框架中，我们介绍了两种主要技术。首先，“窗口化”通过重复使用先前激活的神经元来战略性地减少数据传输，其次，根据闪存的顺序数据访问强度量身定制的“行-列绑定”增加了从闪存读取的数据块的大小。这些方法共同实现了运行高达可用DRAM两倍大小的模型，与CPU和GPU中的原始加载方法相比，推理速度分别提高了4-5倍和20-25倍。我们将稀疏性感知、上下文自适应加载和面向硬件的设计相结合，为在内存有限的设备上有效推断LLM铺平了道路
[论文下载:]http://arxiv.org/abs/2312.11514v2

标题: Memory, Consciousness and Large Language Model
作者: Jitang Li, Jinzheng Li
摘要: With the development in cognitive science and Large Language Models (LLMs),
increasing connections have come to light between these two distinct fields.
Building upon these connections, we propose a conjecture suggesting the
existence of a duality between LLMs and Tulving’s theory of memory. We identify
a potential correspondence between Tulving’s synergistic ecphory model (SEM) of
retrieval and the emergent abilities observed in LLMs, serving as supporting
evidence for our conjecture. Furthermore, we speculate that consciousness may
be considered a form of emergent ability based on this duality. We also discuss
how other theories of consciousness intersect with our research.
中文摘要: 随着认知科学和大型语言模型的发展，这两个不同领域之间的联系越来越紧密。基于这些联系，我们提出了一个猜想，表明LLM和Tulving的记忆理论之间存在对偶性。我们确定了Tulving的协同回指检索模型（SEM）与LLM中观察到的涌现能力之间的潜在对应关系，作为我们推测的支持证据。此外，我们推测意识可能被认为是基于这种二元性的一种涌现能力。我们还讨论了其他意识理论如何与我们的研究相交叉
[论文下载:]http://arxiv.org/abs/2401.02509v1

标题: On the Prospects of Incorporating Large Language Models (LLMs) in
Automated Planning and Scheduling (APS)
作者: Vishal Pallagani, Kaushik Roy, Bharath Muppasani
摘要: Automated Planning and Scheduling is among the growing areas in Artificial
Intelligence (AI) where mention of LLMs has gained popularity. Based on a
comprehensive review of 126 papers, this paper investigates eight categories
based on the unique applications of LLMs in addressing various aspects of
planning problems: language translation, plan generation, model construction,
multi-agent planning, interactive planning, heuristics optimization, tool
integration, and brain-inspired planning. For each category, we articulate the
issues considered and existing gaps. A critical insight resulting from our
review is that the true potential of LLMs unfolds when they are integrated with
traditional symbolic planners, pointing towards a promising neuro-symbolic
approach. This approach effectively combines the generative aspects of LLMs
with the precision of classical planning methods. By synthesizing insights from
existing literature, we underline the potential of this integration to address
complex planning challenges. Our goal is to encourage the ICAPS community to
recognize the complementary strengths of LLMs and symbolic planners, advocating
for a direction in automated planning that leverages these synergistic
capabilities to develop more advanced and intelligent planning systems.
中文摘要: 自动规划和调度是人工智能（AI）中不断增长的领域之一，LLM的提及越来越受欢迎。基于对126篇论文的全面回顾，本文基于LLM在解决规划问题的各个方面的独特应用，研究了八类LLM：语言翻译、计划生成、模型构建、多智能体规划、交互式规划、启发式优化、工具集成和脑动规划。对于每一类，我们都阐述了所考虑的问题和存在的差距。我们的综述得出的一个关键见解是，当LLM与传统的象征规划者相结合时，LLM的真正潜力就会显现出来，从而指向一种有前景的神经象征方法。这种方法有效地将LLM的生成方面与经典规划方法的精度相结合。通过综合现有文献的见解，我们强调了这种整合解决复杂规划挑战的潜力。我们的目标是鼓励ICAPS社区认识到LLM和象征规划者的互补优势，倡导自动化规划的方向，利用这些协同能力开发更先进、更智能的规划系统
[论文下载:]http://arxiv.org/abs/2401.02500v1

标题: LLaMA Pro: Progressive LLaMA with Block Expansion
作者: Chengyue Wu, Yukang Gan, Yixiao Ge
摘要: Humans generally acquire new skills without compromising the old; however,
the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to
CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with
an expansion of Transformer blocks. We tune the expanded blocks using only new
corpus, efficiently and effectively improving the model’s knowledge without
catastrophic forgetting. In this paper, we experiment on the corpus of code and
math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from
LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro
and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced
performance among various benchmarks, demonstrating superiority over existing
open models in the LLaMA family and the immense potential of reasoning and
addressing diverse tasks as an intelligent agent. Our findings provide valuable
insights into integrating natural and programming languages, laying a solid
foundation for developing advanced language agents that operate effectively in
various environments.
中文摘要: 人类通常在不损害旧技能的情况下获得新技能；然而，大型语言模型（LLM）的情况正好相反，例如，从LLaMA到CodeLLaMA。为此，我们提出了一种新的LLM后预训练方法，该方法扩展了Transformer块。我们只使用新的语料库来调整扩展的块，在没有灾难性遗忘的情况下有效地提高了模型的知识。在本文中，我们在代码和数学语料库上进行了实验，产生了LLaMA-Pro-8.3B，这是一个由LLaMA2-7B初始化的通用基础模型，在一般任务、编程和数学方面表现出色。LLaMA Pro及其指令跟随对应产品（LLaMA Pro-Directive）在各种基准测试中实现了先进的性能，展示了LLaMA系列中现有开放模型的优势，以及作为智能代理推理和处理各种任务的巨大潜力。我们的发现为集成自然语言和编程语言提供了宝贵的见解，为开发在各种环境中有效运行的高级语言代理奠定了坚实的基础
[论文下载:]http://arxiv.org/abs/2401.02415v1

标题: LLM Augmented LLMs: Expanding Capabilities through Composition
作者: Rachit Bansal, Bidisha Samanta, Siddharth Dalmia
摘要: Foundational models with billions of parameters which have been trained on
large corpora of data have demonstrated non-trivial skills in a variety of
domains. However, due to their monolithic structure, it is challenging and
expensive to augment them or impart new skills. On the other hand, due to their
adaptation abilities, several new instances of these models are being trained
towards new domains and tasks. In this work, we study the problem of efficient
and practical composition of existing foundation models with more specific
models to enable newer capabilities. To this end, we propose CALM –
Composition to Augment Language Models – which introduces cross-attention
between models to compose their representations and enable new capabilities.
Salient features of CALM are: (i) Scales up LLMs on new tasks by ‘re-using’
existing LLMs along with a few additional parameters and data, (ii) Existing
model weights are kept intact, and hence preserves existing capabilities, and
(iii) Applies to diverse domains and settings. We illustrate that augmenting
PaLM2-S with a smaller model trained on low-resource languages results in an
absolute improvement of up to 13% on tasks like translation into English and
arithmetic reasoning for low-resource languages. Similarly, when PaLM2-S is
augmented with a code-specific model, we see a relative improvement of 40%
over the base model for code generation and explanation tasks – on-par with
fully fine-tuned counterparts.
中文摘要: 在大型数据语料库上训练的具有数十亿参数的基础模型已经在各个领域展示了非平凡的技能。然而，由于它们的整体结构，增强它们或传授新技能具有挑战性且成本高昂。另一方面，由于它们的适应能力，这些模型的几个新实例正在针对新的领域和任务进行训练。在这项工作中，我们研究了现有基础模型与更具体的模型的高效实用组合问题，以实现更新的功能。为此，我们提出了CALM——增强语言模型的组合——它引入了模型之间的交叉关注，以组合它们的表示并实现新的功能。CALM的突出特点是：（i）通过“重新使用”现有LLM以及一些额外的参数和数据来扩大新任务的LLM，（ii）现有模型权重保持不变，从而保留现有功能，以及（iii）适用于不同的域和设置。我们说明，用在低资源语言上训练的较小模型来扩充PaLM2-S，可以在翻译成英语和低资源语言的算术推理等任务上获得高达13%的绝对改进。类似地，当用特定于代码的模型扩充PaLM2-S时，我们看到代码生成和解释任务的基本模型相比有40%的相对改进——与完全微调的对应模型相当
[论文下载:]http://arxiv.org/abs/2401.02412v1

标题: 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language
Distillation
作者: Zihao Xiao, Longlong Jing, Shangxuan Wu
摘要: 3D panoptic segmentation is a challenging perception task, which aims to
predict both semantic and instance annotations for 3D points in a scene.
Although prior 3D panoptic segmentation approaches have achieved great
performance on closed-set benchmarks, generalizing to novel categories remains
an open problem. For unseen object categories, 2D open-vocabulary segmentation
has achieved promising results that solely rely on frozen CLIP backbones and
ensembling multiple classification outputs. However, we find that simply
extending these 2D models to 3D does not achieve good performance due to poor
per-mask classification quality on novel categories. In this paper, we propose
the first method to tackle 3D open-vocabulary panoptic segmentation. Our model
takes advantage of the fusion between learnable LiDAR features and dense frozen
vision CLIP features, using a single classification head to make predictions
for both base and novel classes. To further improve the classification
performance on novel classes and leverage the CLIP model, we propose two novel
loss functions: object-level distillation loss and voxel-level distillation
loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our
method outperforms strong baselines by a large margin.
中文摘要: 三维全景分割是一项具有挑战性的感知任务，旨在预测场景中三维点的语义和实例注释。尽管先前的3D全景分割方法在闭集基准上取得了很好的性能，但推广到新的类别仍然是一个悬而未决的问题。对于看不见的对象类别，2D开放词汇分割已经取得了很有希望的结果，该结果仅依赖于冻结的CLIP主干和集合多个分类输出。然而，我们发现，简单地将这些2D模型扩展到3D并不能获得良好的性能，因为在新的类别上每个掩码的分类质量较差。在本文中，我们提出了第一种解决三维开放词汇全景分割的方法。我们的模型利用了可学习的激光雷达特征和密集冻结视觉CLIP特征之间的融合，使用单个分类头对基本类和新类进行预测。为了进一步提高新类的分类性能并利用CLIP模型，我们提出了两种新的损失函数：对象级蒸馏损失和体素级蒸馏损失。我们在nuScenes和SemanticKITTI数据集上的实验表明，我们的方法在很大程度上优于强基线
[论文下载:]http://arxiv.org/abs/2401.02402v1

标题: One Shot Learning as Instruction Data Prospector for Large Language
Models
作者: Yunshui Li, Binyuan Hui, Xiaobo Xia
摘要: Aligning large language models(LLMs) with human is a critical step in
effectively utilizing their pre-trained capabilities across a wide array of
language tasks. Current instruction tuning practices often rely on expanding
dataset size without a clear strategy for ensuring data quality, which can
inadvertently introduce noise and degrade model performance. To address this
challenge, we introduce Nuggets, a novel and efficient methodology that employs
one shot learning to select high-quality instruction data from expansive
datasets. Nuggets assesses the potential of individual instruction examples to
act as effective one shot examples, thereby identifying those that can
significantly enhance diverse task performance. Nuggets utilizes a scoring
system based on the impact of candidate examples on the perplexity of a diverse
anchor set, facilitating the selection of the most beneficial data for
instruction tuning. Through rigorous testing on two benchmarks, including
MT-Bench and Alpaca-Eval, we demonstrate that instruction tuning with the top
1% of Nuggets-curated examples substantially outperforms conventional methods
that use the full dataset. These findings advocate for a data selection
paradigm that prioritizes quality, offering a more efficient pathway to align
LLMs with humans.
中文摘要: 将大型语言模型（LLM）与人类相协调是在一系列语言任务中有效利用其预先训练的能力的关键一步。当前的指令调优实践通常依赖于扩展数据集大小，而没有明确的策略来确保数据质量，这可能会无意中引入噪声并降低模型性能。为了应对这一挑战，我们引入了Nuggets，这是一种新颖高效的方法，它采用一次性学习从庞大的数据集中选择高质量的教学数据。掘金评估了个人教学示例作为有效的一次性示例的潜力，从而确定了那些可以显著提高不同任务表现的示例。掘金利用了一种基于候选示例对不同锚集困惑的影响的评分系统，有助于选择最有益的数据进行教学调整。通过对包括MT Bench和Alpaca Eval在内的两个基准进行严格测试，我们证明，使用前1%的掘金精选示例进行的指令调整大大优于使用完整数据集的传统方法。这些发现倡导一种优先考虑质量的数据选择范式，为LLM与人类的协调提供了一条更有效的途径
[论文下载:]http://arxiv.org/abs/2312.10302v3

标题: Vietnamese Poem Generation & The Prospect Of Cross-Language Poem-To-Poem
Translation
作者: Triet Minh Huynh, Quan Le Bao
摘要: Poetry generation has been a challenging task in the field of Natural
Language Processing, as it requires the model to understand the nuances of
language, sentiment, and style. In this paper, we propose using Large Language
Models to generate Vietnamese poems of various genres from natural language
prompts, thereby facilitating an intuitive process with enhanced content
control. Our most efficacious model, the GPT-3 Babbage variant, achieves a
custom evaluation score of 0.8, specifically tailored to the “luc bat” genre of
Vietnamese poetry. Furthermore, we also explore the idea of paraphrasing poems
into normal text prompts and yield a relatively high score of 0.781 in the “luc
bat” genre. This experiment presents the potential for cross-Language
poem-to-poem translation with translated poems as the inputs while concurrently
maintaining complete control over the generated content.
中文摘要: 诗歌生成在自然语言处理领域是一项具有挑战性的任务，因为它需要模型理解语言、情感和风格的细微差别。在本文中，我们建议使用大型语言模型根据自然语言提示生成各种类型的越南诗歌，从而促进直观的过程，增强内容控制。我们最有效的模型，GPT-3巴贝奇变体，实现了0.8的自定义评估分数，专门针对越南诗歌的“luc-bat”流派。此外，我们还探索了将诗歌转述为正常文本提示的想法，并在“luc-bat”类型中获得了相对较高的0.781分。该实验展示了以翻译诗歌为输入进行跨语言诗歌翻译的潜力，同时保持对生成内容的完全控制
[论文下载:]http://arxiv.org/abs/2401.01078v3

标题: SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded
Entity Retrieval
作者: Griffin Adams, Jason Zucker, Noémie Elhadad
摘要: Clinician must write a lengthy summary each time a patient is discharged from
the hospital. This task is time-consuming due to the sheer number of unique
clinical concepts covered in the admission. Identifying and covering salient
entities is vital for the summary to be clinically useful. We fine-tune
open-source LLMs (Mistral-7B-Instruct and Zephyr-7B-\b{eta}) on the task and
find that they generate incomplete and unfaithful summaries. To increase entity
coverage, we train a smaller, encoder-only model to predict salient entities,
which are treated as content-plans to guide the LLM. To encourage the LLM to
focus on specific mentions in the source notes, we propose SPEER:
Sentence-level Planning via Embedded Entity Retrieval. Specifically, we mark
each salient entity span with special “{{ }}” boundary tags and instruct the
LLM to retrieve marked spans before generating each sentence. Sentence-level
planning acts as a form of state tracking in that the model is explicitly
recording the entities it uses. We fine-tune Mistral and Zephyr variants on a
large-scale, diverse dataset of ~167k in-patient hospital admissions and
evaluate on 3 datasets. SPEER shows gains in both coverage and faithfulness
metrics over non-guided and guided baselines.
中文摘要: 每次患者出院时，临床医生必须写一份冗长的总结。由于入院时涉及到大量独特的临床概念，这项任务非常耗时。识别和涵盖突出实体对于总结具有临床实用性至关重要。我们对任务中的开源LLM（Mistral-7B-Directive和Zephyr-7B-\b｛eta｝）进行了微调，发现它们生成了不完整和不忠实的摘要。为了增加实体覆盖率，我们训练一个较小的、仅限编码器的模型来预测显著实体，这些实体被视为指导LLM的内容计划。为了鼓励LLM专注于源注释中的特定提及，我们提出了SPEER：通过嵌入式实体检索进行句子级规划。具体来说，我们用特殊的“｛｛｝｝”边界标签标记每个显著实体跨度，并指示LLM在生成每个句子之前检索标记的跨度。句子级规划是一种状态跟踪形式，因为模型明确记录了它使用的实体。我们在约167k名住院患者的大规模、多样化数据集上微调Mistral和Zephyr变体，并在3个数据集上进行评估。SPEER显示，与非引导基线和引导基线相比，覆盖率和忠诚度指标都有所提高
[论文下载:]http://arxiv.org/abs/2401.02369v1

标题: Towards a Foundation Purchasing Model: Pretrained Generative
Autoregression on Transaction Sequences
作者: Piotr Skalski, David Sutton, Stuart Burrell
摘要: Machine learning models underpin many modern financial systems for use cases
such as fraud detection and churn prediction. Most are based on supervised
learning with hand-engineered features, which relies heavily on the
availability of labelled data. Large self-supervised generative models have
shown tremendous success in natural language processing and computer vision,
yet so far they haven’t been adapted to multivariate time series of financial
transactions. In this paper, we present a generative pretraining method that
can be used to obtain contextualised embeddings of financial transactions.
Benchmarks on public datasets demonstrate that it outperforms state-of-the-art
self-supervised methods on a range of downstream tasks. We additionally perform
large-scale pretraining of an embedding model using a corpus of data from 180
issuing banks containing 5.1 billion transactions and apply it to the card
fraud detection problem on hold-out datasets. The embedding model significantly
improves value detection rate at high precision thresholds and transfers well
to out-of-domain distributions.
中文摘要: 机器学习模型是许多现代金融系统的基础，用于欺诈检测和流失预测等用例。大多数都是基于具有手工设计功能的监督学习，这在很大程度上依赖于标记数据的可用性。大型自监督生成模型在自然语言处理和计算机视觉方面取得了巨大成功，但到目前为止，它们还没有适应金融交易的多变量时间序列。在本文中，我们提出了一种生成预训练方法，可用于获得金融交易的上下文嵌入。公共数据集上的基准测试表明，它在一系列下游任务上优于最先进的自我监督方法。我们还使用来自180家发卡行的包含51亿笔交易的数据语料库对嵌入模型进行了大规模预训练，并将其应用于搁置数据集上的卡欺诈检测问题。嵌入模型显著提高了高精度阈值下的值检测率，并很好地转移到域外分布
[论文下载:]http://arxiv.org/abs/2401.01641v2

标题: Beyond Extraction: Contextualising Tabular Data for Efficient
Summarisation by Language Models
作者: Uday Allu, Biddwan Ahmed, Vishesh Tripathi
摘要: The conventional use of the Retrieval-Augmented Generation (RAG) architecture
has proven effective for retrieving information from diverse documents.
However, challenges arise in handling complex table queries, especially within
PDF documents containing intricate tabular structures.This research introduces
an innovative approach to enhance the accuracy of complex table queries in
RAG-based systems. Our methodology involves storing PDFs in the retrieval
database and extracting tabular content separately. The extracted tables
undergo a process of context enrichment, concatenating headers with
corresponding values. To ensure a comprehensive understanding of the enriched
data, we employ a fine-tuned version of the Llama-2-chat language model for
summarisation within the RAG architecture. Furthermore, we augment the tabular
data with contextual sense using the ChatGPT 3.5 API through a one-shot prompt.
This enriched data is then fed into the retrieval database alongside other
PDFs. Our approach aims to significantly improve the precision of complex table
queries, offering a promising solution to a longstanding challenge in
information retrieval.
中文摘要: 检索增强生成（RAG）架构的传统使用已被证明对从各种文档中检索信息是有效的。然而，在处理复杂的表查询时会遇到挑战，尤其是在包含复杂表格结构的PDF文档中。本研究引入了一种创新的方法来提高基于RAG的系统中复杂表查询的准确性。我们的方法包括将PDF存储在检索数据库中，并分别提取表格内容。提取的表经过上下文丰富的过程，将头与相应的值连接起来。为了确保全面了解丰富的数据，我们采用了Llama-2-chat语言模型的微调版本，在RAG架构中进行总结。此外，我们通过一次提示使用ChatGPT 3.5 API来增强具有上下文意义的表格数据。然后将这些丰富的数据与其他PDF一起输入检索数据库。我们的方法旨在显著提高复杂表查询的精度，为信息检索中的长期挑战提供了一个有前途的解决方案
[论文下载:]http://arxiv.org/abs/2401.02333v1

标题: DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models
作者: Songbo Hu, Xiaobin Wang, Zhangdie Yuan
摘要: We present DIALIGHT, a toolkit for developing and evaluating multilingual
Task-Oriented Dialogue (ToD) systems which facilitates systematic evaluations
and comparisons between ToD systems using fine-tuning of Pretrained Language
Models (PLMs) and those utilising the zero-shot and in-context learning
capabilities of Large Language Models (LLMs). In addition to automatic
evaluation, this toolkit features (i) a secure, user-friendly web interface for
fine-grained human evaluation at both local utterance level and global dialogue
level, and (ii) a microservice-based backend, improving efficiency and
scalability. Our evaluations reveal that while PLM fine-tuning leads to higher
accuracy and coherence, LLM-based systems excel in producing diverse and
likeable responses. However, we also identify significant challenges of LLMs in
adherence to task-specific instructions and generating outputs in multiple
languages, highlighting areas for future research. We hope this open-sourced
toolkit will serve as a valuable resource for researchers aiming to develop and
properly evaluate multilingual ToD systems and will lower, currently still
high, entry barriers in the field.
中文摘要: 我们展示了DIALIGHT，这是一个用于开发和评估多语言面向任务的对话（ToD）系统的工具包，它有助于使用预训练语言模型（PLM）的微调和使用大型语言模型（LLM）的零样本和上下文学习能力的ToD系统之间的系统评估和比较。除了自动评估之外，该工具包还具有以下特点：（i）一个安全、用户友好的web界面，用于在本地话语级别和全球对话级别进行细粒度的人类评估；（ii）一个基于微服务的后端，提高了效率和可扩展性。我们的评估表明，虽然PLM微调可以带来更高的准确性和一致性，但基于LLM的系统在产生多样化和讨人喜欢的响应方面表现出色。然而，我们也发现了LLM在遵守特定任务指令和以多种语言生成输出方面的重大挑战，突出了未来研究的领域。我们希望这个开源工具包将成为旨在开发和正确评估多语言ToD系统的研究人员的宝贵资源，并将降低该领域目前仍然很高的进入门槛
[论文下载:]http://arxiv.org/abs/2401.02208v1

文章来源:https://blog.csdn.net/u011573853/article/details/135470775
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！