A Comprehensive Overview of Large Language Models - Anatomia (2024)

A Comprehensive Overview of Large Language Models

Humza Naveeda, Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f, Muhammad Usmane,f, Naveed Akhtarg,i,

Nick Barnesh, Ajmal Miani

aUniversity of Engineering and Technology (UET), Lahore, Pakistan

bThe Chinese University of Hong Kong (CUHK), HKSAR, China

cUniversity of Technology Sydney (UTS), Sydney, Australia

dCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia

eKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia

fSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia

gThe University of Melbourne (UoM), Melbourne, Australia

hAustralian National University (ANU), Canberra, Australia

iThe University of Western Australia (UWA), Perth, Australia

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and

beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse

topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs,

robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in

LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering

the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise

yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature

on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background

concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only

provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from

extensive informative summaries of the existing works to advance the LLM research.

Keywords:

Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking

1. Introduction

Language plays a fundamental role in facilitating commu-

nication and self-expression for humans, and their interaction

with machines. The need for generalized models stems from

the growing demand for machines to handle complex language

tasks, including translation, summarization, information re-

trieval, conversational interactions, etc. Recently, significant

breakthroughs have been witnessed in language models, pri-

marily attributed to transformers [1], increased computational

capabilities, and the availability of large-scale training data.

These developments have brought about a revolutionary trans-

formation by enabling the creation of LLMs that can approxi-

mate human-level performance on various tasks [2, 3]. Large

∗Equal contribution

Email addresses: humza_naveed@yahoo.com (Humza Naveed),

aukhanee@gmail.com (Asad Ullah Khan), shiqiu@cse.cuhk.edu.hk (Shi

Qiu), muhammad.saqib@data61.csiro.au (Muhammad Saqib),

saeed.anwar@kfupm.edu.sa (Saeed Anwar),

muhammad.usman@kfupm.edu.sa (Muhammad Usman),

naveed.akhtar1@unimelb.edu.au (Naveed Akhtar),

nick.barnes@anu.edu.au (Nick Barnes), ajmal.mian@uwa.edu.au

(Ajmal Mian)

Figure 1: The trend of papers released over years containing keywords "Large

Language Model", "Large Language Model + Fine-Tuning", and "Large Lan-

guage Model + Alignment".

Preprint submitted to Elsevier April 11, 2024

ar

X

iv

:2

30

7.

06

43

5v

9

[

cs

.C

L

]

9

A

pr

2

02

4

2019

T5 (Oct)

GPT-3 (May) WebGPT (Dec)

OPT-IML

TK-Instruct (May)

mT0 (Dec) Wizard-LM

Vicuna

Alpaca (Mar)

HuaTuo (Apr)

Koala (May)

Wizard-Coder (Jun)

Goat

PanGu-α (Apr)

CPM-2 (Jun)

GPT-NeoX-20B (Apr)

CodeGen (Mar)

Galactica (Nov)

GLM (Oct)

OPT

UL2 (May)

LLaMA (Feb)

LLaMA 2 (Jul)

MPT (Jun)

CodeT5+

Code Llama (Aug)

StarCoder

Xuan Yuan 2.0 (May)

2020 2021 2022 2023 2024

mT5 (Oct)

HyperCLOVA (Sep)

ERNIE 3.0

Codex (Jul)

Jurassic-1 (Aug)

Yuan 1.0 (Oct)

Gopher (Dec)

ERNIE 3.0 Titan

GLaM

LaMDA

T0 (Oct)

ChatGPT (Nov)

Sparrow (Sep)

FLAN-U-PaLM (Oct)

Bard (Oct)

MT-NLG (Jan)

AlphaCode (Feb)

Chinchilla (Mar)

PaLM (Apr)

U-PALM (Oct)

BLOOM (Nov)

AlexaTM (Aug) PaLM2 (May)

GPT-4

PanGu-Σ (Mar)

BloombergGPT

Claude

Gemini (Dec)

Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models

on the upper half signify open-source availability, whereas those on the bottom half are closed-source. The chart illustrates the increasing trend towards instruction-

tuned models and open-source models, highlighting the evolving landscape and trends in natural language processing research.

Language Models (LLMs) have emerged as cutting-edge arti-

ficial intelligence systems that can process and generate text

with coherent communication [4], and generalize to multiple

tasks [5, 6].

The historical progress in natural language processing (NLP)

evolved from statistical to neural language modeling and then

from pre-trained language models (PLMs) to LLMs. While

conventional language modeling (LM) trains task-specific mod-

els in supervised settings, PLMs are trained in a self-supervised

setting on a large corpus of text [7, 8, 9] with the aim of learning

a generic representation that is shareable among various NLP

tasks. After fine-tuning for downstream tasks, PLMs surpass

the performance gains of traditional language modeling (LM).

The larger PLMs bring more performance gains, which has led

to the transitioning of PLMs to LLMs by significantly increas-

ing model parameters (tens to hundreds of billions) [10] and

training dataset (many GBs and TBs) [10, 11]. Following this

development, numerous LLMs have been proposed in the lit-

erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the

number of released LLMs and names of a few significant LLMs

proposed over the years are shown in Fig 1 and Fig 2, respec-

tively.

The early work on LLMs, such as T5 [10] and mT5 [11] em-

ployed transfer learning until GPT-3 [6] showed LLMs are

zero-shot transferable to downstream tasks without fine-tuning.

LLMs accurately respond to task queries when prompted with

task descriptions and examples. However, pre-trained LLMs

fail to follow user intent and perform worse in zero-shot set-

tings than in few-shot. Fine-tuning them with task instruc-

tions data [16, 17, 18, 19] and aligning with human prefer-

ences [20, 21] enhances generalization to unseen tasks, im-

proving zero-shot performance significantly and reducing mis-

aligned behavior.

In addition to better generalization and domain adaptation,

LLMs appear to have emergent abilities, such as reasoning,

planning, decision-making, in-context learning, answering in

zero-shot settings, etc. These abilities are known to be ac-

quired by them due to their gigantic scale even when the pre-

trained LLMs are not trained specifically to possess these at-

tributes [22, 23, 24]. Such abilities have led LLMs to be widely

adopted in diverse settings including, multi-modal, robotics,

tool manipulation, question answering, autonomous agents, etc.

Various improvements have also been suggested in these areas

either by task-specific training [25, 26, 27, 28, 29, 30, 31] or

better prompting [32].

The LLMs abilities to solve diverse tasks with human-level

performance come at a cost of slow training and inference,

extensive hardware requirements, and higher running costs.

Such requirements have limited their adoption and opened up

opportunities to devise better architectures [15, 33, 34, 35]

and training strategies [36, 37, 21, 38, 39, 40, 41]. Param-

eter efficient tuning [38, 41, 40], pruning [42, 43], quantiza-

tion [44, 45], knowledge distillation, and context length inter-

,

to answer queries beyond the capacity ac-

quired during training [6, 55]. These emergent abilities allow

for adapting the model without fine-tuning—a costly process.

Aside from this, hallucination, producing inaccurate, unsafe,

or factually incorrect responses, is common for LLMs, which is

avoided by augmenting contextual data. While the user can pro-

vide in-context samples in the query [54, 32], here we specifi-

cally refer to the methods that access external storage program-

matically, calling them augmented LLMs.

The literature suggests various external memory designs to aug-

ment LLMs, long-term [181, 182, 183, 184], short-term [185],

symbolic [186], and non-symbolic [187, 188]. The memory

can be maintained in different formats such as documents, vec-

tors, or databases. A few systems maintain intermediate mem-

ory representations to retain information across multiple iter-

ations [184, 182], while others extract important information

from the datasets and save it in memory for recall [189]. The

memory read and write operations are performed either with

or without LLMs cooperation [182, 190, 184, 191], acting as

a feedback signal in [185]. We discuss different types of aug-

mented LLMs below.

3.4.1. Retrieval Augmented LLMs

LLMs may have limited memory and outdated information,

leading to inaccurate responses. Retrieving relevant informa-

tion from external up-to-date storage enables the LLMs to

accurately answer with references and utilize more informa-

tion. With retrieval augmentation, smaller models have been

shown to perform at par with larger models. For instance, the

11B model can become competitive to 540B PaLM in [25] and

7.5B to 280B Gopher in [183]. Retrieval augmented language

modeling (RALM) has two major components, shown in

Figure 12, namely: 1) retriever and 2) language model. In

RALM, the retriever plays a crucial role in driving LLM

17

Figure 12: A flow diagram of Retrieval Augmented LLMs. The retriever ex-

tracts a similar context to the input and forwards it to the LLM either in simple

language or encoded through Fusion-in-Decoder (FiD). Depending on the task,

retrieval and generation may repeat multiple times.

response, where incorrect information can steer LLMs to false

behavior. This leads to the development of various methods to

retrieve accurate information and fuse with the query for better

performance.

Zero-Shot Retrieval Augmentation: This kind of augmen-

tation keeps the original LLM architecture and weights

unchanged and uses BM25 [192], nearest neighbors, or frozen

pre-trained models like Bert [7] as a retriever. The retrieved

information is provided as input to the model for response

generation, shown to improve performance over LLMs without

retrieval [188, 193]. In some scenarios, multiple retrieval

iterations are required to complete the task. The output

generated in the first iteration is forwarded to the retriever

to fetch similar documents. Forward-looking active retrieval

(FLARE) [187] initially generates the response and corrects

the output by retrieving relevant documents if the response

contains low-confidence tokens. Similarly, RepoCoder [194]

fetches code snippets recursively for code completion.

Training with Retrieval Augmentation: To reduce failures in

retrieval augmentation generation (RAG), researchers train or

fine-tune retrievers and LLMs with a retrieval augmentation

pipeline. We discuss the literature below based on their focus

on the respective training processes of the pipeline.

Training LLM: Retrieval-enhanced transformer (RETRO) [183]

shows pre-training smaller LLMs with RAG pipeline outper-

forms larger LLMs, such as GPT-3 trained without RAG.

RETRO uses a 2-trillion token subset of MassiveText as

a database. The retrieval pipeline divides the input query

into subsets and retrieves relevant chunks from the database

for each subset, encoded together with input intermediate

representations for generating tokens. It uses cross-chunked

attention to attend to previous chunks auto-regressively. A

study on RETRO [195] shows models pre-trained without RAG

but fine-tuned using RAG lack the performance gains obtained

by pre-training with RAG.

Training Retriever: Quality of responses generated by LLMs

is highly dependent on the in-context examples. There-

fore, [196, 197, 198, 199] train retrievers to retrieve accurate

few-shot samples while keeping the LLM frozen for gener-

ation. Retrieved samples are ranked to build ground-truth

data to train retrievers with contrastive learning in [196, 198].

RoBERTa is trained for downstream tasks in [197] for ICL

samples retrieval. REPLUG [199] trains the retriever with

supervised signals from the frozen LLM-generated outputs.

Training Retriever and LLM: Further benefits are achieved by

training both the retriever and the model in [25, 200, 201]. In

this case, the error propagates back to the retriever, updating

both the language model and the retriever. While masked

language modeling (MLM) is a common pre-training objec-

tive [25, 201], retrieval pre-trained transformer (RPT) [200]

used document chunk prediction as a pre-training objective for

long text modeling.

Encoded Context Augmentation: Concatenating retrieved

documents with the query becomes infeasible as the sequence

length and sample size grow. Encoding the context and fusing

it with the decoder (Fusion-in-Decoder) using cross-attention

makes it possible to augment more samples without increasing

computation costs significantly [202, 183, 200, 25].

Web Augmented: Locally stored memory, but external to

LLM, has limited information. However, a large amount of

information is available on the internet, which gets updated

regularly. Rather than storing information locally, various

methods retrieve query-related context through a web search

and forward it to LLMs [203, 204, 156].

3.4.2. Tool Augmented LLMs

While RAG relies on the retriever to provide context to the

LLM to answer queries, tool augmented LLMs capitalize on the

reasoning abilities of LLMs to iteratively plan by dividing tasks

into sub-tasks, selecting necessary tools, and taking actions to

complete the task [205, 206, 207, 27]. A generic pipeline of

tool-augmented LLMs is shown in Figure 13, where different

modules in Figure 13 are selected in a loop until the task com-

pletion.

Zero-Shot Tool Augmentation: LLMs in-context learning

and reasoning abilities enable them to interact with tools with-

out training. Automatic reasoning and tool-use (ART) [207]

builds a task library with demonstrations of reasoning steps and

calling external tools. It retrieves similar task examples and

provides the context to the LLM for inference. Aside from

this, [208] shows tool documentation is enough to teach LLMs

to use tools without demonstrations. RestGPT [209] integrates

LLMs with RESTful APIs by decomposing tasks into planning

and API selection steps. The API selector understands the API

documentation to select a suitable API for the task and plan the

execution. ToolkenGPT [210] uses tools as tokens by concate-

nating tool embeddings with other token embeddings. During

inference, the LLM generates the tool tokens representing the

tool call, stops text generation, and restarts using the tool exe-

cution output.

Training with Tool Augmentation: LLMs are trained to inter-

act with diverse tools, enhancing planning abilities to overcome

the limitations of zero-shot tool augmentation [211, 27, 212,

213]. Gorilla [211] instruction-tunes LLaMA with information

retrieval from API documentation. It uses the self-instruct [19]

18

Figure 13: A basic flow diagram of tool augmented LLMs. Given an input and

a set of available tools, the model generates a plan to complete the task. The

tool augmented LLMs utilize different modules iteratively, such as retriever,

tool execution, read-write to memory, feedback, etc., depending on the task.

data generation pipeline with GPT-4 by providing in-context

examples retrieved from API documentation. Tool augmented

language model (TALM) [27] fine-tunes T5 [10] for

,

tool use

with a self-play approach, where it iteratively completes tool

manipulation tasks and includes them back in the training set.

ToolLLM [213] collects 16k APIs from RapidAPI. It samples

APIs from the list to generate an instruction-tuning dataset us-

ing ChatGPT in single-tool and multi-tool scenarios. For high-

quality datasets, ToolLLM suggested a depth-first search-based

decision tree (DFSDT) method to generate ground-truths with

diverse reasoning and planning.

Multimodal Tool Augmentation: The compositional reasoning

capacity of LLMs allows them to manipulate tools in multi-

modal settings [205, 206, 214]. Following the pipeline shown

in Figure 13, the LLM outlines a plan, generally executing in a

sequence: Plan → Tool selection → Execute → Inspect →

Generate, to respond to the user query. Here, the database of

tools is rich in modalities, including text, images, etc. Many of

the multimodal tool augmentation systems employ multimodal

LLMs [31, 215, 214, 206], while others utilize single modality

LLMs and generate a plan on using different modality tools to

solve multimodal queries [216].

3.5. LLMs-Powered Agents

AI agents are autonomous entities, capable of planning,

decision-making, and performing actions to achieve complex

goals. In the early days, AI agents were rule-based, de-

signed for narrow tasks, and had limited capabilities, such

as Clippy [217] and Deep Blue [218]. In contrast to this,

LLMs abilities to respond to dynamic scenarios have made it

possible to incorporate them in diverse applications, includ-

ing LLMs-powered agents [214, 206], where LLMs behave

as the brain of agents. LLMs have been incorporated in web

agents [156, 157], coding agents [219], tool agents [27, 213],

embodied agents [26], and conversational agents [185], requir-

ing minimal to no fine-tuning". Below we summarize the re-

search in LLMs-based autonomous agents. For a more detailed

discussion, please refer to [220, 221].

LLMs Steering Autonomous Agents: LLMs are the cognitive

controllers of the autonomous agents. They generate plans, rea-

son about tasks, incorporate memory to complete tasks, and

adapt the outline depending on the feedback from the environ-

ment. Depending on the acquired capabilities of LLMs, many

methods fine-tune, propose a better prompting approach, or uti-

lize different modules to enhance agents’ performance. Mod-

ules and strategies employed in autonomous agents are briefly

discussed below.

Planning and Reasoning: Completing a complex task requires

human-like logical thinking, planning necessary steps, and

reasoning current and future directions. Prompting methods

like chain-of-thoughts [103], tree-of-thoughts [105], and self-

consistency [104] are central to agents, eliciting LLMs to rea-

son its actions and choose among different paths for task com-

pletion. When LLMs are prompted with a task description and

a sequence of actions, they can accurately generate plan ac-

tions without any fine-tuning [222]. Reasoning via planning

(RAP) [223] incorporates a re-purposed LLM as a world model

to reason about future outcomes and explore alternative paths

for task completion. Retroformer [224] uses a retrospective

LLM to improve main LLM planning and reasoning capabil-

ities by providing helpful task cues.

Feedback: LLMs in open-loop systems generate plans and as-

sume that the agent will complete them successfully. However,

the actual scenario is different with failures and variable re-

sponses from the environment. To correctly complete tasks,

many methods use LLMs in a closed-loop where the action re-

sponse is provided as feedback to the LLMs to re-assess and

update the plan as required [225, 226, 227, 185]. Another di-

rection of research exploits LLMs as reward functions to train

reinforcement learning (RL) policies instead of humans [228].

Memory: LLMs can learn from the context provided in the

prompt. In addition to internal memory, various systems em-

ploy external memory to save the response history. Reflex-

ion [185] maintains an episodic memory to use previous re-

sponses as feedback to improve future decision-making. Retro-

former [224] improves its responses by employing short-term

and long-term memory, where short-term memory contains re-

cent responses and long-term memory keeps summarized failed

attempts to add in the prompt as reflection.

Multi-Agents Systems: LLMs can play user-defined roles and

behave like a specific domain expert. In multi-agent systems,

each LLM is assigned a unique role, simulating human behav-

ior and collaborating with other agents to complete a complex

task [219, 229].

LLMs in Physical Environment: LLMs are good at

instruction-following, however, utilizing them for physically

grounded tasks requires adaptation, as they lack real-world

knowledge. This could lead to generating illogical responses

for a particular physical situation [230, 26]. SayCan [230]

19

make LLMs aware of the available low-level task operations.

LLM (Say) builds a high-level plan to complete the task and

a learned affordance function (Can) explores the possibility of

executing the plan in the real world. SayCan uses RL to train

the language-conditioned affordance function. PaLM-E enables

the LLM to solve grounded tasks by training multi-modal LLM

feeding inputs directly from the sensors.

Manipulation: In the area of manipulation [226, 231], LLMs

enhance a robot’s dexterity and adaptability, excelling in tasks

like object recognition, grasping, and collaboration. They ana-

lyze visual and spatial information to determine the most effec-

tive approach to interact with objects.

Navigation: LLMs enhance a robot’s ability to navigate com-

plex environments with precision and adaptability [232, 233,

234, 235]. They generate feasible paths and trajectories for

robots, accounting for intricate environmental details [236].

This ability is valuable in scenarios requiring precise and

dynamically adaptable navigation in environments like ware-

houses, transport, healthcare facilities, and residences.

3.6. Efficient LLMs

Deploying LLMs in production is expensive. Reducing their

running costs while preserving performance is an appealing

area of research. This section summarizes the approaches sug-

gested to enhance LLMs’ efficiency.

3.6.1. Parameter Efficient Fine-Tuning

Fine-tuning LLMs with tens or hundreds of billions of pa-

rameters, such as GPT-3 (175B), BLOOM (176B), MT-NLG

(540B), etc., is computationally intensive and time-consuming.

To avoid complete model fine-tuning, numerous parameter-

efficient fine-tuning (PEFT) techniques [40, 237, 41, 38, 39] try

to achieve acceptable model fine-tuning performance at reduced

costs. As compared to full fine-tuning [238], PEFT performs

better in low-resource setups, achieves comparable perfor-

mance on medium-resource scenarios, and performs worse than

full fine-tuning under high-resource availability. An overview

of different PEFT approaches is shown in Figure 14.

Adapter Tuning: Adds a few trainable parameters within the

transformer block. The adapter layer is a sequence of feature

downscaling, non-linearity, and upscaling [106]. Variants of

adapter tuning inject adapter layers sequentially [106] and in

parallel [38], whereas the mixture of adapter (AdaMix) [239]

employs multiple adapter modules in a single layer. AdaMix

routes input instances randomly to one of the multiple down-

scale and upscale modules. The mixture of adapters is averaged

out for inference to avoid additional latency. Low-Rank Adap-

tation (LoRA) [240] learns low-rank decomposed matrices to

freeze original weights. The learned weights are fused with the

original weights for inference, avoiding latency.

Prompt Tuning: Prompting is an effective way to adapt a

pre-trained LLM for the downstream task. However, manual

prompts bring uncertainty in the model’s prediction, where a

change in a single word drops the performance [237]. Prompt

tuning alleviates this problem by fine-tuning only 0.001%-3%

additional parameters [241]. It concatenates trainable prompt

,

parameters with the model embeddings [237, 40, 241]. Task-

specific fixed discrete prompts are concatenated with input em-

beddings in [40]. As discrete prompts bring instability, prompts

are encoded through a learnable mapping in P-Tuning [237],

naming continuous prompts, which are appended with the dis-

crete prompts. Only the prompt encoder is trainable in the

model. In an extension of P-Tuning, continuous prompts are

concatenated with each layer of the network in [241]. Progres-

sive prompts [242] avoid catastrophic forgetting and transfer

previously learned knowledge by sequentially adding trainable

prompt embeddings to the previously frozen task embeddings.

Prefix Tuning: A set of trainable task-specific prefix vectors

are appended to the frozen transformer layers in prefix tun-

ing [41]. The prefix vectors are virtual tokens attended by the

context tokens on the right. In addition, adaptive prefix tun-

ing [243] applies a gating mechanism to control the information

from the prefix and actual tokens.

Bias Tuning: Fine-tuning only bias terms in small to medium

training data has been found effective in BitFit [244]. This

method achieves full fine-tuning performance for tasks with less

training data and comparable performance with more training

data.

3.6.2. Quantization

LLMs require extensive computing and memory for infer-

ence. Deploying a 175B parameter GPT-3 model needs at

least 5x80GB A100 GPUs and 350GB of memory to store in

FP16 format [44]. Such demanding requirements for deploying

LLMs make it harder for smaller organizations to utilize them.

Model compression is an effective solution but comes at the cost

of degraded performance, especially at large scales greater than

6B. These models exhibit very large magnitude outliers that do

not exist in smaller models [245], making it challenging and re-

quiring specialized methods for quantizing LLMs [44, 246].

Post-Training Quantization: Minimal or no training is re-

quired in this type of quantization, without significantly com-

promising the model performance. LLM-8-bit [245] uses full-

precision matrix multiplication for weights associated with out-

lier features and 8-bit for remaining features. The lower pre-

cision multiplication outputs are converted to FP-16 and con-

catenated with others. The quantized models have hom*ogenous

word embeddings, which may degrade their performance. To

fix this, token-level knowledge distillation is employed in [45]

along with independent quantization scaling factors for each

module due to varying weight distribution. Feature distribu-

tions are asymmetric and appear in different channels; outlier

suppression [247] shifts and scales per-channel activation dis-

tributions for effective quantization. SmoothQuant [44] quan-

tizes activations and weights to INT8 format by smoothing

activations and migrating the quantization difficulty toward

weights. It multiplies the inverse of the smoothing factor with

weights, which introduces a few outliers in the weights but is

easier to quantify than unsmoothed activations. OPTQ [246]

uses the optimal brain compression (OBC) [248] algorithm to

quantize the model layer-by-layer and update weights to com-

pensate for quantization error. To improve speed and per-

formance, OPTQ updates weights in arbitrary order, employs

20

Figure 14: Illustration of parameter-efficient fine-tuning paradigms, where x is input and h is hidden state, figure courtesy [38]. Parallel adapter and LoRA fall in

the adapter tuning category.

lazy updates, and uses better Cholesky kernels. Outlier-aware

weight quantization (OWQ) [249] uses the OPTQ algorithm for

quantization but assigns higher precision to vulnerable weights,

causing outliers and lower precision for others.

Quantization-Aware Training: To compensate for perfor-

mance degradation, a quantized model is fine-tuned in

quantization-aware training (QAT) [250, 251, 252]. Al-

pha Tuning quantizes the model using binary coding quan-

tization (BCQ) [253] and fine-tunes only quantization scal-

ing factors. This approach improves performance over

parameter-efficient fine-tuning of the pre-trained model. Sim-

ilarly, parameter-efficient and quantization-aware adaptation

(PEQA) [254] reduces the precision of fully-connected layers

and fine-tunes only quantization scaling parameters. LLM-

QAT [252] generates training data from the pre-trained network

and trains a quantized student model with knowledge distilla-

tion. QLoRA [251] fine-tunes 4-bit quantized pre-trained LLM

with LoRA [240] using a 4-bit normal float, which shows better

performance over a 4-bit integer and float.

3.6.3. Pruning

Pruning is an alternative approach to quantization to com-

press model size, thereby reducing LLMs deployment costs

significantly. Compared to task-agnostic pruning, task-specific

pruning is easily achievable with good performance, where a

model is fine-tuned on the downstream task and pruned for

faster inference. It is possible to prune LLMs for individual

tasks, but the cost of pruning and deploying task-specific mod-

els is high. To overcome this, many structured and unstructured

pruning methods for LLMs have been proposed to maintain rea-

sonable performance across all tasks while shrinking the model

size [255, 42, 256].

Unstructured Pruning: This kind of pruning removes less im-

portant weights without maintaining any structure. Existing

LLM pruning methods take advantage of the unique charac-

teristics of LLMs, uncommon for smaller models, where a

small subset of hidden states are activated with large magni-

tude [245]. Pruning by weights and activations (Wanda) [255]

prunes weights in every row based on importance, calculated

by multiplying the weights with the norm of input. The pruned

model does not require fine-tuning, thereby saving computa-

tional costs. Outlier weighed layerwise sparsity (OWL) [257]

extends Wanda with non-uniform layer pruning. It shows that

the number of outliers varies for different layers; therefore, the

model should have variable pruning ratios for better perfor-

mance for every layer. Contrastive pruning (CAP) [43] itera-

tively prunes the model by training the sparse model using con-

trastive loss between pre-trained, fine-tuned, and snapshots of

previous sparse models to learn task-specific and task-agnostic

knowledge.

Structured Pruning: Here, the parameters are removed in

groups, rows, columns, or matrices, which speeds up the

inference because of effective hardware tensor core utiliza-

tion [255]. LLM-Pruner [42] employs a 3-stage structured

pruning strategy, identifying the groups of hidden states caus-

ing each other to activate during the forward-pass, keeping im-

portant groups and removing less important ones, and fine-

tuning the pruned model with LoRA. Sparsity-induced mask

learning (SIMPLE) [258] prunes the network using learnable

masks. Similarly, another method prunes LLMs by learning

masks and removing unimportant rank-1 components of the

factorized weight matrix [256].

3.7. Multimodal LLMs

Inspired by the success of LLMs in natural language process-

ing applications, an increasing number of research works are

now facilitating LLMs to perceive different modalities of infor-

mation like image [259, 260, 261], video [262, 263, 264], au-

dio [265, 264, 266], etc. Multimodal LLMs (MLLMs) present

substantial benefits compared to standard LLMs that process

only text. By incorporating information from various modal-

ities, MLLMs can achieve a deeper understanding of context,

leading to more intelligent responses infused with a variety of

expressions. Importantly, MLLMs align closely with human

perceptual experiences, leveraging the synergistic nature of our

multisensory inputs to form a comprehensive understanding of

the world [266, 26]. Coupled with a user-friendly interface,

MLLMs can offer intuitive, flexible, and adaptable interactions,

allowing users to engage with intelligent assistants through a

spectrum of input methods. According to the ways of construct-

21

ing models, current MLLMs can be

,

generally divided into three

streams: pre-training, fine-tuning, and prompting. In this sec-

tion, we will discuss more details of these main streams, as well

as the important application of MLLMs in visual reasoning.

Pre-training: This stream of MLLMs intends to support differ-

ent modalities using unified end-to-end models. For instance,

Flamingo [259] applies gated cross-attention to fuse vision and

language modalities, which are collected from pre-trained and

frozen visual encoder and LLM, respectively. Moreover, BLIP-

2 [260] proposes a two-stage strategy to pre-train a Querying

Transformer (Q-Former) for the alignment between vision and

language modalities: in the first stage, vision-language repre-

sentation learning is bootstrapped from a frozen visual encoder;

and in the second stage, a frozen LLM bootstraps vision-to-

language generative learning for zero-shot image-to-text gen-

eration. Similarly, MiniGPT-4 [267] deploys pre-trained and

frozen ViT [268], Q-Former and Vicuna LLM [149], only train-

ing the linear projection layer for vision and language modali-

ties alignment.

Fine-tuning: Derived from instruction tuning [16] for NLP

tasks [20, 16, 97], researchers are fine-tune pre-trained LLMs

using multimodal instructions. Following this method, LLMs

can be easily and effectively extended as multimodal chat-

bots [267, 261, 29] and multimodal task solvers [269, 30, 270].

The key issue of this stream of MLLMs is to collect multi-

modal instruction-following data for fine-tuning [58]. To ad-

dress this issue, the solutions of benchmark adaptation [269,

271, 272], self-instruction [19, 31, 273], and hybrid composi-

tion [274, 270] are employed, respectively. To mitigate the gap

between the original language modality and additional modal-

ities, the learnable interface is introduced to connect differ-

ent modalities from frozen pre-trained models. Particularly,

the learnable interface is expected to work in a parameter-

efficient tuning manner: e.g., LLaMA-Adapter [275] applies

an efficient transformer-based adapter module for training,

and LaVIN [274] dynamically learns the multimodal feature

weights using a mixture-of-modality adapter. Different from

the learnable interface, the expert models can directly convert

multimodalities into language: e.g., VideoChat-Text [262] in-

corporates Whisper [276], a speech recognition expert model,

to generate the captions of given videos for the understanding

of following LLMs.

Prompting: Different from the fine-tuning technique that

directly updates the model parameters given task-specific

datasets, the prompting technique provides certain context, ex-

amples, or instructions to the model, fulfilling specialized tasks

without changing the model parameters. Since prompting can

significantly reduce the need for large-scale multimodal data,

this technique is widely used to construct MLLMs. Particularly,

to solve multimodal Chain of Thought (CoT) problems [103],

LLMs are prompted to generate both the reasoning process and

the answer given multimodal inputs [277]. On this front, differ-

ent learning paradigms are exploited in practice: for example,

Multimodal-CoT [277] involves two stages of rationale genera-

tion and answer inference, where the input of the second stage

is a combination of the original input and the output of the first

stage; and CoT-PT [278] applies both prompt tuning and spe-

cific visual bias to generate a chain of reasoning implicitly. In

addition to CoT problems, LLMs can also be prompted with

multimodal descriptions and tools, effectively dividing complex

tasks into sub-tasks [279, 280].

Visual Reasoning Application: Recent visual reasoning sys-

tems [281, 282, 206, 283] tend to apply LLMs for better visual

information analysis and visual-language integration. Differ-

ent from previous works [284, 285] that rely on limited VQA

datasets and small-scale neural networks, current LLM-aided

methods offer benefits of stronger generalization ability, emer-

gent ability, and interactivity [58]. To realize visual reasoning

with the help of LLMs, prompting and fine-tuning techniques

can also be utilized: for example, PointClip V2 [282] applies

LLMs to generate 3D-specific prompts, which are encoded as

textual features and then combined with visual features for

3D recognition; and GPT4Tools [31] employs LoRA [240] to

fine-tune LLMs following tool-related instructions. Serving

as a controller [283], decision maker [286], or semantics re-

finer [281, 287], LLMs significantly facilitates the progress of

visual reasoning research.

3.8. Summary and Discussion

3.8.1. Architecture

Due to the gigantic scale of LLMs, minor changes in archi-

tecture and training strategies have a big impact on performance

and stability. Here, we summarize key architectural modules

used in various LLMs, leading to better performance, reduced

training time and memory, and better training stability.

Layer Normalization: The performance and training stability

of LLMs are affected significantly by layer normalization. Pre-

norm, that is normalizing inputs rather than outputs, is more

common among LLMs stabilizing the training [6, 127, 108].

BLOOM [13] and AlexaTM [122] utilize an additional layer

normalization before embedding layer to stabilize the training

of large-scale models, while the model’s zero-shot generaliza-

tion ability can be negatively impacted [13]. However, another

study [33] finds that pre-norm degrades fine-tuned model per-

formance as compared to post-norm, and there are no stability

benefits of pre-norm beyond the 100B scale. Therefore, GLM-

130B [33] used deep-norm which is a variant of post-norm for

better downstream task performance after fine-tuning.

Positional Encoding: Like other building blocks of the model,

positional encoding also affects the performance and training

stability of LLMs. BLOOM [13] finds ALiBi outperforms

learned and rotary positional encodings. Contrary to this,

GLM-130B [33] identifies rotary positional encoding as being

better than ALiBi. So, there is no conclusion in the literature

about positional encodings yet.

Parallel Attention: In this type of attention, feed-forward and

attention layers are parallel to each other rather than sequen-

tial in a transformer block. It has been shown to reduce train-

ing time by 15%. There is no evidence of performance drop

due to this change in the literature and it is used by the models

PaLM [15], GPT-NeoX [118], and CodeGen [130].

Multi-Query Attention It has shared key and value attention

heads in a transformer block while query attention heads are

22

projected as usual. This reduces memory usage and speeds up

sampling in autoregressive decoding. No performance degrada-

tion has been observed with this change and it makes the train-

ing efficient allowing larger batch sizes. Multi-query attention

is used in [15, 132].

Mixture of Experts: This type of architecture enables eas-

ily scaling models to trillions of parameters [92, 91]. Only a

few experts are activated during the computation making them

compute-efficient. The performance of MoE models is better

than dense models for the same amount of data and requires less

computation during fine-tuning to achieve performance similar

to dense models as discussed in [91]. MoE architectures are

less prone to catastrophic forgetting, therefore are more suited

for continual learning [92]. Extracting smaller sub-models for

downstream tasks is possible without losing any performance,

making MoE architecture hardware-friendly [92].

Sparse vs Dense Activated: GPT-3 [6] uses sparse transform-

ers [67] whereas GLaM [91] and PanGu-

[92] use MoE [121]

architectures to lower computational costs and increase the

model size and capacity. According to the literature, sparse

modules do not degrade the model’s performance [67]. How-

ever, more experiments are required to verify this statement.

3.8.2. Training Strategies

Training models at a huge scale require tricks to reduce train-

ing costs, avoid loss divergence, and achieve better perfor-

,

mance. We summarize and discuss some of these key tricks

used in different LLMs.

Mixed Precision: It is a famous method for LLMs to reduce

memory usage and improve training efficiency. In mixed pre-

cision, forward and backward passes are performed in FP16

format whereas optimizer states and master weights are kept

in FP32 format [120]. A drawback associated with this for-

mat change is training instability due to a smaller value range

resulting in loss spikes [33]. An alternative to FP16 is BF16

which has a comparatively larger range and performs precision-

sensitive operations like gradient accumulation and softmax in

FP32 [13]. BF16 has better performance and training stability

but uses more memory and is supported on specific hardware,

for example, A100 GPUs. Therefore, its adoption in LLMs is

limited.

Training Instability: Loss divergence or spiking is a common

issue in LLMs that occurs multiple times during training. This

happens in the presence of gradient clipping [15]. To mitigate

this problem, many approaches suggest restarting training from

an earlier checkpoint [15, 33, 91], skipping 200-500 earlier

data batches at the point of divergence in [15] and re-shuffling

batches in [91]. The embedding layer gradient shrink proves to

further stabilize the training as its gradient norm is significantly

larger than the other layers [33]. Another suggestion to improve

training stability for larger models is not to use biases in dense

and norm layers as in [15].

Weight Initialization: It plays a significant role in model con-

vergence and training stability. GPT-NeoX [118] initializes

feed-forward layers before residuals with 2

L

d

as in [143] and

other layers with the small initialization scheme [288]. This

avoids activations growing exponentially with increasing depth.

MT-NLG [117] found higher variance for weight initialization

leads to unstable training, hence validating small initialization

scheme [288]. Various models perform random weight initial-

ization which can cause bad initialization, Galactica [138] sug-

gests a longer warmup to negate the effect.

Learning Rate: A suitable learning rate is important for sta-

ble training. It is suggested to use a lower value [13, 15, 124]

with warmup and decay (cosine or linear). Usually, the learn-

ing rate is within the range 1e−4 to 8e−4. Moreover, MT-NLG

(530B) [117] and GPT-NeoX (20B) [118] suggest interpolat-

ing learning rates based on the model size using the GPT-3 [6]

models ranging between 13B and 175B. This avoids tuning the

learning rate hyperparameter.

Training Parallelism: 3D parallelism, a combination of data,

pipeline, and tensor parallelism, is the most utilized training

parallelism approach in LLMs [33, 15, 14, 13, 117, 115, 112].

In addition to 3D parallelism, BLOOM [13] uses a zero op-

timizer [37] to shard optimizer states. PanGu-α [108] and

PanGu-Σ [92] go beyond 3D parallelism and apply 5D paral-

lelism which additionally contains optimizer parallelism and

rematerialization.

Mode Switching: It adds task-related tokens at the beginning

of the text during training. These tokens refer to the natural

language understanding and natural language generation tasks

which are shown to improve downstream task performance

in [125, 124, 122]. During fine-tuning and inference, tokens

are appended based on the downstream tasks.

Controllable Text Generation: Generating credible and con-

trolled text from a pre-trained model is challenging. GPT-3 [6]

and other LLMs use in-context learning to control generated

text. While in-context learning helps in controlling the gener-

ated text, ERNIE 3.0 Titan [35] suggests using adversarial loss

to rank its generated text for credibility and soft prompts such as

genre, topic, keywords, sentiment, and length for better control

on generated text.

3.8.3. Supervised Models vs Generalized Models

Although generalized models are capable of performing di-

verse tasks with good performance they have not yet outper-

formed models trained in supervised settings. The supervised

trained models are still state-of-the-art in various NLP tasks by

a large margin as shown in [6, 15, 18].

3.8.4. Zero-Shot vs Few-Shot

LLMs perform well in zero-shot and few-shot settings. But

the performance difference between zero-shot and few-shot is

large for pre-trained models [6, 15], naming LLMs as meta-

learners [6]. LLMs zero-shot evaluations underperform unsu-

pervised methods in neural machine translation [6]. The liter-

ature shows pre-training is not enough for good zero-shot per-

formance [15, 16]. To improve the zero-shot performance the

literature suggests using instruction fine-tuning that improves

the zero-shot performance significantly and outperforms base-

lines. Instruction fine-tuning has also been shown to improve

zero-shot generalization to unseen tasks. Another model, Flan-

PaLM [16], unlocks zero-shot reasoning with CoT training.

23

3.8.5. Encoder vs Decoder vs Encoder-Decoder

Traditionally, these architectures perform well for different

tasks, for example, encoder-only for NLU tasks, decoder-only

for NLG, and encoder-decoder for sequence2sequence model-

ing. Encoder-only models are famous for smaller models such

as Bert [7], RoBERTa [289], etc., whereas LLMs are either

decoder-only [6, 118, 13] or encoder-decoder [10, 11, 122].

While decoder-only models are good at NLG tasks, various

LLMs, PaLM [15], OPT [14], GPT-3 [6], BLOOM [13],

LLaMA [146], are decoder-only models with significant per-

formance gains on both NLU and NLG tasks. In contradic-

tion to this, T5 [10] and UL2 [125] identify encoder-decoder

models out-performing decoder-only models. In another study,

PaLM [15] finds increasing the size of decoder-only models

can reduce the performance gap between decoder-only and

encoder-decoder architectures.

Although decoder-only architectures have become a trend for

LLMs, many recently proposed approaches [125, 122] use

mode-switching tokens in text with encoder-decoder architec-

tures to enable task-specific modes. Similarly, CodeT5+ [34]

uses an encoder-decoder architecture with multiple training ob-

jectives for different tasks, activating the encoder, decoder, or

both according to the tasks. These variations in architecture

and training objectives allow a model to perform well in differ-

ent settings. Because of this dynamic configuration, the future

of LLMs can be attributed to encoder-decoder architectures.

4. Model Configurations

We provide different statistics of pre-trained and instruction-

tuned models in this section. This includes information such as

publication venue, license type, model creators, steps trained,

parallelism, etc in Table 3 and Table 4. Architecture details

of pre-trained LLMs are available in Table 5. Providing these

details for instruction-tuned models is unnecessary because it

fine-tunes pre-trained models for instruction datasets. Hence,

architectural details are the same as the baselines. Moreover,

optimization settings for various LLMs are available in Table 6

and Table 7. We do not include details on precision, warmup,

and weight decay in Table 7. These details are not as important

as others to mention for instruction-tuned models, and are not

provided by the papers.

5. Datasets and Evaluation

Generating training and evaluation datasets is expensive be-

cause of the large-scale data demand of LLMs. Hence, datasets

for training and benchmarking these models are topics of key

importance. A summary of datasets commonly used by LLMs

is provided next.

5.1. Training Datasets

The performance of LLMs largely depends on the training

data’s quality, size, and diversity. Preparing training datasets

of high quality at a large scale is laborious. Researchers have

suggested various pre-training and fine-tuning datasets to en-

hance LLMs capabilities. We summarize these efforts in Ta-

ble 8. While numerous training datasets are available in the

literature, we cover the most widely used ones in our summary.

5.2. Evaluation Datasets and Tasks

The evaluation of LLMs is important

,

in gauging their profi-

ciency and limitations. This process measures the model’s abil-

ity to comprehend, generate, and interact with human language

across a spectrum of tasks. Evaluating a language model (LM)

is divided into two broader categories: 1) natural language un-

derstanding (NLU) and 2) natural language generation (NLG).

It is emphasized that tasks in NLU and NLG are softly catego-

rized and are often used interchangeably in the literature.

Natural Language Understanding: This task measures the lan-

guage understanding capacity of LMs. It encompasses multiple

tasks, including sentiment analysis, text classification, natural

language inference (NLI), question answering (QA), common-

sense reasoning (CR), mathematical reasoning (MR), reading

comprehension (RC), etc.

Natural Language Generation: This task assesses the language

generation capabilities of LLMs by understanding the provided

input context. It includes tasks such as summarization, sen-

tence completion, machine translation (MT), dialogue genera-

tion, etc.

Numerous datasets are proposed for each task, evaluating

LLMs against different characteristics. To provide an overview

of evaluation datasets, we briefly discuss a few famous datasets

within each category and offer a comprehensive list of datasets

in Table 9. Moreover, we show a detailed overview of the train-

ing datasets and evaluation tasks and benchmarks used by vari-

ous pre-trained LLMs in Table 10 and fine-tuned LLMs in Ta-

ble 11. We also compare the top-performing LLMs in various

NLP tasks in Table 12.

5.2.1. Multi-task

MMLU [297]: A benchmark that measures the knowledge

acquired by models during pretraining and evaluates models in

zero-shot and few-shot settings across 57 subjects, testing both

world knowledge and problem-solving ability.

SuperGLUE [2]: A more challenging and diverse successor

to the GLUE [299] benchmark, SuperGLUE includes a variety

of language understanding tasks, such as question answering,

natural language inference, and co-reference resolution. It is

designed to provide a rigorous test of language understanding

and requires significant progress in areas like sample-efficient,

transfer, multi-task, and unsupervised or self-supervised learn-

ing.

BIG-bench [298]: The BIG-bench (Behavior of Intelligent

Generative Models Benchmark) is a large-scale benchmark de-

signed to test the abilities of LLMs across a wide range of

tasks, including reasoning, creativity, ethics, and understanding

of specific domains.

GLUE [299]: The General Language Understanding Evalua-

tion (GLUE) benchmark is a collection of resources for train-

ing, evaluating, and analyzing natural language understanding

24

Table 3: Summary of pre-trained LLMs (>10B). Only the LLMs discussed individually in the previous sections are summarized. “Data/Tokens” is the model’s

pre-training data, which is either the number of tokens or data size. “Data Cleaning” indicates whether data cleaning is performed or not. This includes heuristics

(Heur), deduplication (Dedup), quality filtering (QF), and privacy filtering (PF), “Cost” is the calculated training cost obtained by multiplying the GPUs/TPUs

hourly rate with the number of GPUs and the training time. The actual cost may vary due to many reasons such as using in-house GPUs or getting a discounted rate,

re-training, number of employees working on the problem, etc. “Training Parallelism” indicates distributed training using data parallelism (D), tensor parallelism

(T), pipeline parallelism (P), model parallelism (M), optimizer parallelism (OP), and rematerialization (R), where for “Library” column, “DS” is a short form for

Deep Speed. In column “Commercial Use”, we assumed a model is for non-commercial purposes if its license is unavailable.

Models Publication

Venue

License

Type

Model

Creators Purpose

No. of

Params

Commercial

Use

Steps

Trained

Data/

Tokens

Data

Cleaning

No. of

Processing Units

Processing

Unit Type

Training

Time

Calculated

Train. Cost

Training

Parallelism Library

T5 [10] JMLR'20 Apache-2.0 Google General 11B ✓ 1M 1T Heur+Dedup 1024 TPU v3 - - D+M Mesh TensorFlow

GPT-3 [6] NeurIPS'20 - OpenAI General 175B × - 300B Dedup+QF - V100 - - M -

mT5 [11] NAACL'21 Apache-2.0 Google General 13B ✓ 1M 1T - - - - - - -

PanGu-α [108] arXiv'21 Apache-2.0 Huawei General 200B ✓ 260k 1.1TB Heur+Dedup 2048 Ascend 910 - - D+OP+P+O+R MindSpore

CPM-2 [12] AI Open'21 MIT Tsinghua General 198B ✓ 1M 2.6TB Dedup - - - - D+M JAXFormer

Codex [131] arXiv'21 - OpenAI Coding 12B × - 100B Heur - - - - - -

ERNIE 3.0 [110] arXiv'21 - Baidu General 10B × 120k∗ 375B Heur+Dedup 384 V100 - - M∗ PaddlePaddle

Jurassic-1 [112] White-Paper'21 Apache-2.0 AI21 General 178B ✓ - 300B - 800 GPU - - D+M+P Megatron+DS

HyperCLOVA [114] EMNLP'21 - Naver General 82B × - 300B Clf+Dedup+PF 1024 A100 321h 1.32 Mil M Megatron

Yuan 1.0 [115] arXiv'21 Apache-2.0 - General 245B ✓ 26k∗ 180B Heur+Clf+Dedup 2128 GPU - - D+T+P -

Gopher [116] arXiv'21 - Google General 280B × - 300B QF+Dedup 4096 TPU v3 920h 13.19 Mil D+M JAX+Haiku

ERNIE 3.0 Titan [35] arXiv'21 - Baidu General 260B × - 300B Heur+Dedup - Ascend 910 - - D+M+P+D* PaddlePaddle

GPT-NeoX-20B [118] BigScience'22 Apache-2.0 EleutherAI General 20B ✓ 150k 825GB None 96 40G A100 - - M Megatron+DS+PyTorch

OPT [14] arXiv'22 MIT Meta General 175B ✓ 150k 180B Dedup 992 80G A100 - - D+T Megatron

BLOOM [13] arXiv'22 RAIL-1.0 BigScience General 176B ✓ - 366B Dedup+PR 384 80G A100 2520h 3.87 Mil D+T+P Megatron+DS

Galactica [138] arXiv'22 Apache-2.0 Meta Science 120B × 225k 106B Dedup 128 80GB A100 - - - Metaseq

GLaM [91] ICML'22 - Google General 1.2T × 600k∗ 600B Clf 1024 TPU v4 - - M GSPMD

LaMDA [140] arXiv'22 - Google Dialog 137B × 3M 2.81T Filtered 1024 TPU v3 1384h 4.96 Mil D+M Lingvo

MT-NLG [117] arXiv'22 Apache-v2.0 MS.+Nvidia General 530B × - 270B - 4480 80G A100 - - D+T+P Megatron+DS

AlphaCode [132] Science'22 Apache-v2.0 Google Coding 41B ✓ 205k 967B Heur+Dedup - TPU v4 - - M JAX+Haiku

Chinchilla [96] arXiv'22 - Google General 70B × - 1.4T QF+Dedup - TPUv4 - - - JAX+Haiku

PaLM [15] arXiv'22 - Google General 540B × 255k 780B Heur 6144 TPU v4 - - D+M JAX+T5X

AlexaTM [122] arXiv'22 Apache v2.0 Amazon General 20B × 500k 1.1T Filtered 128 A100 2880h 1.47 Mil M DS

U-PaLM [124] arXiv'22 - Google General 540B × 20k - - 512 TPU v4 120h 0.25 Mil - -

UL2 [125] ICLR'23 Apache-2.0 Google General 20B ✓ 2M 1T - 512 TPU v4 - - M JAX+T5X

GLM [33] ICLR'23 Apache-2.0 Multiple General 130B × - 400B - 768 40G A100 1440h 3.37 Mil M -

CodeGen [130] ICLR'23 Apache-2.0 Salesforce Coding 16B ✓ 650k 577B Heur+Dedup - TPU v4 - - D+M JAXFormer

LLaMA [127] arXiv'23 - Meta General 65B × 350k 1.4T Clf+Heur+Dedup 2048 80G A100 504h 4.12 Mil D+M xFormers

PanGuΣ [92] arXiv'23 - Huawei General 1.085T × - 329B - 512 Ascend 910 2400h - D+OP+P+O+R MindSpore

BloombergGPT [141] arXiv23 - Bloomberg Finance 50B × 139k 569B Dedup 512 40G A100 1272h 1.97 Mil M PyTorch

Xuan Yuan 2.0 [142] arXiv23 RAIL-1.0 Du Xiaoman Finance 176B ✓ - 366B Filtered 80GB A100 - - P DS

CodeT5+ [34] arXiv'23 BSD-3 Salesforce Coding 16B ✓ 110k 51.5B Dedup 16 40G A100 - - - DS

StarCoder [137] arXiv'23 OpenRAIL-M BigCode Coding 15.5B ✓ 250k 1T Dedup+QF+PF 512 80G A100 624h 1.28 Mil D+T+P Megatron-LM

LLaMA-2 [21] arXiv'23 LLaMA-2.0 Meta General 70B ✓ 500k 2T Minimal Filtering - 80G A100 1.7Mh - - -

PaLM-2 [123] arXiv'23 - Google General - × - - Ddedup+PF+QF - - - - - -

Table 4: Summary of instruction tuned LLMs (>10B). All abbreviations are the same as Table 3. Entries in “Data/Tokens” starting with “S-” represents the number

of training samples.

Models Publication

Venue

License

Type

Model

Creators Purpose

No. of

Params

Commercial

Use

Pre-trained

Models

Steps

Trained

Data/

Tokens

No. of

Processing Units

Processing

Unit Type

Train.

Time

Calculated

Train. Cost

Train.

Parallelism Library

WebGPT [156] arXiv'21 - OpenAI General 175B × GPT-3 - - - - - - - -

T0 [17] ICLR'22 Apache-2.0 BigScience General 11B ✓ T5 - 250B 512 TPU v3 270h 0.48 Mil - -

Tk-Instruct [18] EMNLP'22

,

MIT AI2+ General 11B ✓ T5 1000 - 256 TPU v3 4h 0.0036 Mil - Google T5

OPT-IML [97] arXiv'22 - Meta General 175B × OPT 8k 2B 128 40G A100 - - D+T Megatron

Flan-U-PaLM [16] ICLR'22 Apache-2.0 Google General 540B ✓ U-PaLM 30k - 512 TPU v4 - - - JAX+T5X

mT0 [144] ACL'23 Apache-2.0 HuggingFace+ General 13B ✓ mT5 - - - - - - - -

Sparrow [157] arXiv'22 - Google Dialog 70B × Chinchilla - - 64 TPU v3 - - M -

WizardCoder [154] arXiv'23 Apache-2.0 HK Bapt. Coding 15B × StarCoder 200 S-78k - - - - - -

Alpaca [148] Github'23 Apache-2.0 Stanford General 13B ✓ LLaMA 3-Epoch S-52k 8 80G A100 3h 600 FSDP PyTorch

Vicuna [149] Github'23 Apache-2.0 LMSYS General 13B ✓ LLaMA 3-Epoch S-125k - - - - FSDP PyTorch

LIMA [175] arXiv'23 - Meta+ General 65B - LLaMA 15-Epoch S-1000 - - - - - -

Koala [290] Github'23 Apache-2.0 UC-Berkley General 13B × LLaMA 2-Epoch S-472k 8 A100 6h 100 - JAX/FLAX

systems. It includes a variety of tasks that test a wide range of

linguistic phenomena, making it a comprehensive tool for eval-

uating language understanding in AI.

5.2.2. Language Understanding

WinoGrande [344]: A large-scale dataset inspired by the orig-

inal Winograd [347] Schema Challenge tests models on their

ability to resolve pronoun ambiguity and encourages the devel-

opment of models that understand the broad context in natural

language text.

CoQA [306]: A conversational question-answering dataset,

CoQA challenges models with questions that rely on conver-

sation history and require free-form text answers. Its diverse

content from seven domains makes it a rigorous test for mod-

els’ ability to handle a wide range of topics and conversational

contexts.

WiC [307]: This dataset assesses a model’s ability to dis-

cern word meanings based on context, aiding in tasks related

to Word Sense Disambiguation.

Wikitext103 [308]: With over 100 million tokens from

Wikipedia’s top articles, this dataset is a rich resource for tasks

that require understanding long-term dependencies, such as lan-

guage modeling and translation.

PG19 [309]: This is a digital library of diverse books from

Project Gutenberg. It is specifically designed to facilitate re-

search in unsupervised learning and language modeling, with a

25

Table 5: Architecture details of LLMs. Here, “PE” is the positional embedding, “nL” is the number of layers, “nH” is the number of attention heads, “HS” is the

size of hidden states.

Models Type Training

Objective Attention Vocab Tokenizer Norm PE Activation Bias nL nH HS

T5 (11B) Enc-Dec Span Corruption Standard 32k SentencePiece Pre-RMS Relative ReLU × 24 128 1024

GPT3 (175B) Causal-Dec Next Token Dense+Sparse - - Layer Learned GeLU ✓ 96 96 12288

mT5 (13B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - - - -

PanGu-α (200B) Causal-Dec Next Token Standard 40k BPE Layer - - - 64 128 16384

CPM-2 (198B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - 24 64 -

Codex (12B) Causal-Dec Next Token Standard - BPE+ Pre-Layer Learned GeLU - 96 96 12288

ERNIE 3.0 (10B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 64 4096

Jurassic-1 (178B) Causal-Dec Next Token Standard 256k SentencePiece∗ Pre-Layer Learned GeLU ✓ 76 96 13824

HyperCLOVA (82B) Causal-Dec Next Token Dense+Sparse - BPE* Pre-Layer Learned GeLU - 64 80 10240

Yuan 1.0 (245B) Causal-Dec Next Token Standard - - - - - - 76 - 16384

Gopher (280B) Causal-Dec Next Token Standard 32k SentencePiece Pre-RMS Relative GeLU ✓ 80 128 16384

ERNIE 3.0 Titan (260B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 192 12288

GPT-NeoX-20B Causal-Dec Next Token Parallel 50k BPE Layer Rotary GeLU ✓ 44 64 -

OPT (175B) Causal-Dec Next Token Standard - BPE - - ReLU ✓ 96 96 -

BLOOM (176B) Causal-Dec Next Token Standard 250k BPE Layer ALiBi GeLU ✓ 70 112 14336

Galactica (120B) Causal-Dec Next Token Standard 50k BPE+custom Layer Learned GeLU × 96 80 10240

GLaM (1.2T) MoE-Dec Next Token Standard 256k SentencePiece Layer Relative GeLU ✓ 64 128 32768

LaMDA (137B) Causal-Dec Next Token Standard 32k BPE Layer Relative GeGLU - 64 128 8192

MT-NLG (530B) Causal-Dec Next Token Standard 50k BPE Pre-Layer Learned GeLU ✓ 105 128 20480

AlphaCode (41B) Enc-Dec Next Token Multi-query 8k SentencePiece - - - - 64 128 6144

Chinchilla (70B) Causal-Dec Next Token Standard 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 80 64 8192

PaLM (540B) Causal-Dec Next Token Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432

AlexaTM (20B) Enc-Dec Denoising Standard 150k SentencePiece Pre-Layer Learned GeLU ✓ 78 32 4096

Sparrow (70B) Causal-Dec Pref.&Rule RM - 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 16∗ 64 8192

U-PaLM (540B) Non-Causal-Dec MoD Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432

UL2 (20B) Enc-Dec MoD Standard 32k SentencePiece - - - - 64 16 4096

GLM (130B) Non-Causal-Dec AR Blank Infilling Standard 130k SentencePiece Deep RoPE GeGLU ✓ 70 96 12288

CodeGen (16B) Causal-Dec Next Token Parallel - BPE Layer RoPE - - 34 24 -

LLaMA (65B) Causal-Dec Next Token Standard 32k BPE Pre-RMS RoPE SwiGLU - 80 64 8192

PanGu-Σ (1085B) Causal-Dec Next Token Standard - BPE Fused Layer - FastGeLU - 40 40 5120

BloombergGPT (50B) Causal-Dec Next Token Standard 131k Unigram Layer ALiBi GeLU ✓ 70 40 7680

Xuan Yuan 2.0 (176B) Causal-Dec Next Token Self 250k BPE Layer ALiBi GeLU ✓ 70 112 14336

CodeT5+ (16B) Enc-Dec SC+NT+Cont.+Match Standard - Code-Specific - - - - - - -

StarCoder (15.5B) Causal-Dec FIM Multi-query 49k BPE - Learned - - 40 48 6144

LLaMA (70B) Causal-Dec Next Token Grouped-query 32k BPE Pre-RMS RoPE SwiGLUE - - - -

PaLM-2 - MoD Parallel - - - - - - - - -

special focus on long-form content.

C4 [10]: A clean, multilingual dataset, C4 offers billions of to-

kens from web-crawled data. It is a comprehensive resource for

training advanced Transformer models on various languages.

LCQMC [310]: The Large-scale Chinese Question Matching

Corpus (LCQMC) is a dataset for evaluating the performance

of models in semantic matching tasks. It contains pairs of ques-

tions in Chinese and their matching status, making it a valuable

resource for research in Chinese language understanding.

5.2.3. Story Cloze and Sentence Completion

StoryCloze [324]: It introduces a new “StoryCloze Test”, a

commonsense reasoning framework for evaluating story under-

standing, generation, and script learning. It considers a model’s

ability to understand and generate coherent and sensible stories.

LAMBADA [325]: This dataset evaluates contextual text un-

derstanding through a word prediction task. Models must pre-

dict the last word of a passage, which is easy for humans when

given the whole passage, but not when given only the last sen-

tence.

5.2.4. Physical Knowledge and World Understanding

PIQA [330]: A dataset that probes the physical knowledge of

models, aiming to understand how well they are learning about

the real world.

TriviaQA [331]: A dataset that tests models on reading com-

prehension and open domain question answering (QA) tasks,

with a focus on Information Retrieval (IR)-style QA.

ARC [332]: A larger version of the ARC-Challenge, this

dataset contains both easy and challenging grade-school level,

multiple-choice science questions. It is a comprehensive test of

a model’s ability to understand and answer complex questions.

ARC-Easy [332]: A subset of the ARC dataset, ARC-

Easy, contains questions that are answered correctly by either

a retrieval-based algorithm or a word co-occurrence algorithm.

It is a great starting point for models beginning to explore ad-

vanced question-answering.

ARC-Challenge [332]: A rigorous question-answering

dataset, ARC-Challenge includes complex, grade-school level

questions that demand reasoning beyond simple retrieval, test-

ing the true comprehension capabilities of models.

5.2.5. Contextual Language Understanding

RACE [337]: The RACE dataset is a reading comprehension

dataset collected from English examinations in China, which

benchmarks AI models

,

for understanding and answering ques-

tions on long and complex passages, simulating the challenge

of a real-world examination.

RACE-Middle [337]: Another subset of the RACE [337]

dataset, RACE-Middle, contains middle school-level English

exam questions. It offers a slightly less challenging but academ-

ically oriented evaluation of a model’s comprehension skills.

RACE-High [337]: A subset of the RACE [337] dataset,

RACE-High consists of high school-level English exam ques-

tions. It is designed to evaluate the comprehension ability of

models in a more academic and challenging context.

26

Table 6: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and dropout are 0.1, 1.0, and 0.1, respectively,

for most of the LLMs.

Sequence LR Optimizers Precision Weight Grad

Models Batch Size Length LR Warmup Decay AdaFactorAdam AdamWFP16 BF16 Mixed Decay Clip Dropout

T5 (11B) 211 512 0.01 × inverse square root ✓ - - - - - ✓

GPT3 (175B) 32K - 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -

mT5 (13B) 1024 1024 0.01 - inverse square root ✓ - - - - - ✓

PanGu-α (200B) - 1024 2e-5 - - - - - - ✓ - - - -

CPM-2 (198B) 1024 1024 0.001 - - ✓ - - - - - ✓

Codex (12B) - - 6e-5 ✓ cosine ✓ ✓ ✓ - -

ERNIE 3.0 (12B) 6144 512 1e-4 ✓ linear ✓ - - - ✓ - -

Jurassic-1 (178B) 3.2M 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -

HyperCLOVA (82B) 1024 - 6e-5 - cosine ✓ - - - ✓ - -

Yuan 1.0 (245B) <10M 2048 1.6e-4 ✓ cosine decay to 10% ✓ - - - ✓ - -

Gopher (280B) 3M 2048 4e-5 ✓ cosine decay to 10% ✓ ✓ - ✓ -

ERNIE 3.0 Titan (260B) - 512 1e-4 ✓ linear ✓ ✓ ✓ ✓ -

GPT-NeoX-20B 1538 2048 0.97e-5 ✓ cosine ✓ ✓ ✓ ✓ ×

OPT (175B) 2M 2048 1.2e-4 - linear ✓ ✓ ✓ ✓ ✓

BLOOM (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×

Galactica (120B) 2M 2048 7e-6 ✓ linear decay to 10% ✓ - - - ✓ ✓ ✓

GLaM (1.2T) 1M 1024 0.01 - inverse square root ✓ FP32 + ✓ - ✓ ×

LaMDA (137B) 256K - - - - - - - - - - - - -

MT-NLG (530B) 1920 2048 5e-5 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -

AlphaCode (41B) 2048 1536+768 1e-4 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -

Chinchilla (70B) 1.5M 2048 1e-4 ✓ cosine decay to 10% ✓ ✓ - - -

PaLM (540B) 2048 2048 0.01 - inverse square root ✓ - - - ✓ ✓ ×

AlexaTM (20B) 2M 1024 1e-4 - linear decay to 5% ✓ ✓ ✓ - ✓

U-PaLM (540B) 32 2048 1e-4 - cosine ✓ - - - - - -

UL2 (20B) 1024 1024 - - inverse square root - - - - - - × - -

GLM (130B) 4224 2048 8e-5 ✓ cosine ✓ ✓ ✓ ✓ ✓

CodeGen (16B) 2M 2048 5e-5 ✓ cosine ✓ - - - ✓ ✓ -

LLaMA (65B) 4M Tokens 2048 1.5e-4 ✓ cosine decay to 10% ✓ - - - ✓ ✓ -

PanGu-Σ (1.085T) 512 1024 2e-5 ✓ - ✓ ✓ - - -

BloombergGPT (50B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×

Xuan Yuan 2.0 (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -

CodeT5+ (16B) 2048 1024 2e-4 - linear ✓ ✓ ✓ - -

StarCoder (15.5B) 512 8k 3e-4 ✓ cosine ✓ ✓ ✓ - -

LLaMA-2 (70B) 4M Tokens 4k 1.5e-4 ✓ cosine ✓ ✓ ✓ ✓ -

Table 7: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are the same as the pre-trained models, while

no model uses weight decay for instruction tuning.

Sequence Optimizers Grad

Models Batch Size Length LR Warmup LR_Decay AdaFactor Adam AdamW Clip Dropout

WebGPT (175B) BC:512, RM:32 - 6e-5 - - ✓ - -

T0 (11B) 1024 1280 1e-3 - - ✓ - ✓

Tk-Instruct (11B) 1024 - 1e-5 - constant - - - - -

OPT-IML (175B) 128 2048 5e-5 × linear ✓ ✓ ✓

Flan-U-PaLM (540B) 32 - 1e-3 - constant ✓ - ✓

Sparrow (70B) RM: 8+16, RL:16 - 2e-6 ✓ cosine decay to 10% ✓ ✓ ×

WizardCoder (15B) 512 2048 2e-5 ✓ cosine - - - - -

Alpaca (13B) 128 512 1e-5 ✓ cosine - - ✓ ✓ ×

Vicuna (13B) 128 -2048 2e-5 ✓ cosine ✓ - ×

LIMA (65B) 32 2048 1e-5 × linear ✓ - ✓

QuAC [338]: This dataset simulates an information-seeking

dialog between students and teachers using hidden Wikipedia

text. It introduces unique challenges not found in machine com-

prehension datasets, making it a valuable resource for advanc-

ing dialog systems.

5.2.6. Commonsense Reasoning

HellaSwag [345]: A dataset that challenges models to pick the

best ending to a context uses Adversarial Filtering to create a

‘Goldilocks’ zone of complexity, where generated text is absurd

to humans but often misclassified by models.

COPA [391]: This dataset evaluates a model’s progress in

open-domain commonsense causal reasoning. Each question

comprises a premise and two alternatives, and the model must

select the more plausible alternative, testing a model’s ability to

understand and reason about cause and effect.

WSC [347]: The Winograd Schema Challenge (WSC) is a

reading comprehension task in which a system must resolve

references in a text, often requiring world knowledge and rea-

soning about the text.

CSQA [348]: The CommonsenseQA is a question-answering

dataset that requires commonsense knowledge to evaluate the

ability of AI models to understand and answer questions.

27

Table 8: Details of various well-known pre-training and fine-tuning datasets. Here, alignment means aligning with human preferences.

Dataset Type Size/Samples Tasks Source Creation Comments

C4 [10] Pretrain 806GB - Common Crawl Automated A clean, multilingual dataset with billions

of tokens

mC4 [11] Pretrain 38.49TB - Common Crawl Automated A multilingual extension of the C4

dataset, mC4 identifies over 100 lan-

guages using cld3 from 71 monthly web

scrapes of Common Crawl.

PILE [291] Pretrain 825GB -

Common Crawl, PubMed Central,

OpenWebText2, ArXiv, GitHub,

Books3, and others

Automated A massive dataset comprised of 22 con-

stituent sub-datasets

ROOTs [292] Pretrain 1.61TB - 498 Hugging Face datasets Automated 46 natural and 13 programming lan-

guages

MassiveText [116] Pretrain 10.5TB -

MassiveWeb, Books, News,

Wikipedia, Github, C4 Automated 99% of the data is in English

Wikipedia [293] Pretrain - - Wikipedia Automated Dump of wikipedia

RedPajama [294] Pretrain 5TB -

CommonCrawl, C4, Wikipedia,

Github, Books, StackExchange Automated Open-source replica of LLaMA dataset

PushShift.io Reddit Pretrain 21.1GB - Reddit Automated Submissions and comments on Reddit

from 2005 to 2019

BigPython [130] Pretrain 5.5TB Coding GitHub Automated -

Pool of Prompt (P3) [17] Instructions 12M 62 PromptSource Manual A Subset of PromptSource, created from

177 datasets including summarization,

QA, classification, etc.

xP3 [144] Instructions 81M 71 P3+Multilingual datasets Manual Extending P3 to total 46 languages

Super-NaturalInstructions (SNI) [18] Instructions 12.4M 1616 Multiple datasets Manual Extending P3 with additional multi-

lingual datasets, total 46 languages

Flan [16] Instructions 15M 1836 Muffin+T0-SF+NIV2 Manual Total 60 languages

OPT-IML [97] Instructions 18.1M 1667 - Manual -

Self-Instruct [19] Instructions 82k 175 - Automated Generated 52k instructions with 82k sam-

ples from 175 seed tasks using GPT-3

Alpaca [148] Instructions 52k - - Automated Employed self-instruct method to gener-

ate data from text-davinci-003

Vicuna [149] Instructions 125k - ShareGPT Automated Conversations shared by users on

ShareGPT using public APIs

LLaMA-GPT-4 [150] Instructions 52k - Alpaca Automated Recreated Alpaca dataset with GPT-4 in

English and Chinese

Unnatural Instructions [295] Instructions 68k - 15-Seeds (SNI) Automated -

LIMA [175] Instructions 1k - Multiple datasets Manual Carefully created samples to test perfor-

mance with fine-tuning on less data

Anthropic-HH-RLHF [296] Alignment 142k - - Manual

Anthropic-HH-RLHF-2 [168] Alignment 39k - - Manual

5.2.7. Reading Comprehension

BoolQ [353]: A dataset derived from Google search queries,

BoolQ challenges models to answer binary (yes/no) questions.

The questions are naturally occurring and are paired with a

paragraph from a Wikipedia article containing the answer. It

is a test of reading comprehension and reasoning.

SQUADv2 [354]: The Stanford Question Answering Dataset

(SQuAD) [352] is a collection of questions posed by crowd

workers on a set of Wikipedia articles, where the answer to ev-

ery question is a segment of text from the corresponding reading

passage. SQuADv2 combines the original SQuAD1.1 dataset

with over 50,000 unanswerable questions. The aim is to evalu-

,

ate a model’s ability to understand and answer questions based

on a given context and to determine when a question is unan-

swerable.

DROP [355]: DROP, or Discrete Reasoning Over the con-

tent of Paragraphs, is designed to test a model’s ability to un-

derstand a wide variety of reading phenomena. It encourages

comprehensive and reliable evaluation of reading comprehen-

sion capabilities.

RTE [356]: The Recognizing Textual Entailment (RTE)

datasets come from a series of annual competitions on textual

entailment, predicting whether a given sentence logically fol-

lows from another and evaluating a model’s understanding of

logical relationships in a text.

WebQA [357]: A dataset for open-domain question answering,

WebQA offers a large collection of web-based question-answer

pairs. It is designed to assess the ability of AI models to under-

stand and answer questions based on web content.

CMRC2018 [359]: This dataset is a test of Chinese language

models’ ability to reason comprehensively and is designed with

a challenging span-extraction format that pushes the boundaries

of machine performance.

5.2.8. Mathematical Reasoning

MATH [372]: This dataset is a platform for evaluating the

mathematical problem-solving abilities of AI models. It con-

tains a diverse set of math problems, ranging from arithmetic

to calculus, and is designed to test the model’s ability to under-

stand and solve complex mathematical problems.

Math23k [373]: This one challenges a model’s ability to un-

derstand and solve mathematical word problems. It contains

23,000 Chinese arithmetic word problems that require models

to perform reasoning and computation based on the problem

28

Table 9: Categorized evaluation datasets used in evaluating LLMs.

Type Datasets/Benchmarks

Multi-Task MMLU [297], SuperGLUE [2], BIG-bench [298], GLUE [299], BBH [298], CUGE [300], Zero-

CLUE [301], FewCLUE [302], Blended Skill Talk [303], HELM [304], KLUE-STS [305]

Language Understanding CoQA [306], WiC [307], Wikitext103 [308], PG19 [309], LCQMC [310], QQP [311], WinoGender [312],

CB [313], FinRE [314], SanWen [315], AFQMC [301], BQ Corpus [316], CNSS [317], CKBQA 13 [318],

CLUENER [301], Weibo [319], AQuA [320], OntoNotes [321], HeadQA [322], Twitter Dataset [323]

Story Cloze and

Sentence Completion StoryCloze [324], LAMBADA [325], LCSTS [326], AdGen [327], E2E [328], CHID [329], CHID-

FC [302]

Physical Knowledge and

World Understanding PIQA [330], TriviaQA [331], ARC [332], ARC-Easy [332], ARC-Challenge [332], PROST [333], Open-

BookQA [334], WebNLG [335], DogWhistle Insider & Outsider [336]

Contextual Language

Understanding

RACE [337], RACE-Middle [337], RACE-High [337], QuAC [338], StrategyQA [339], Quiz Bowl [340],

cMedQA [341],cMedQA2 [342], MATINF-QA [343]

Commonsense Reasoning WinoGrande [344], HellaSwag [345], COPA [346], WSC [347], CSQA [348], SIQA [349], C3 [350],

CLUEWSC2020 [301], CLUEWSC [301], CLUEWSC-FC [302], ReCoRD [351]

Reading Comprehension SQuAD [352], BoolQ [353], SQUADv2 [354], DROP [355], RTE [356], WebQA [357], CMRC2017 [358],

CMRC2018 [359], CMRC2019 [360], COTE-BD [361], COTE-DP [361], COTE-MFW [361], Mul-

tiRC [362], Natural Questions [363], CNSE [317], DRCD [364], DuReader [365], Dureaderrobust [366],

DuReader-QG [365], SciQ [367], Sogou-log [368], Dureaderrobust-QG [366], QA4MRE [369], KorQuAD

1.0 [370], CAIL2018-Task1 & Task2 [371]

Mathematical Reasoning MATH [372], Math23k [373], GSM8K [374], MathQA [375], MGSM [376], MultiArith [377], AS-

Div [378], MAWPS [379], SVAMP [380]

Problem Solving HumanEval [131], DS-1000 [381], MBPP [382], APPS [372], CodeContests [132]

Natural Language Inference

& Logical Reasoning ANLI [383], MNLI-m [384], MNLI-mm [384],QNLI [352], WNLI [347], OCNLI [301], CMNLI [301],

ANLI R1 [383], ANLI R2 [383], ANLI R3 [383], HANS [385], OCNLI-FC [302], LogiQA [386], Strate-

gyQA [339]

Cross-Lingual Understanding MLQA [387], XNLI [388], PAWS-X [389], XSum [390], XCOPA [391], XWinograd [392], TyDiQA-

GoldP [393], MLSum [394]

Truthfulness and Fact Checking TruthfulQA [395], MultiFC [396], Fact Checking on Fever [397]

Biases and Ethics in AI ETHOS [398], StereoSet [399], BBQ [400], Winobias [401], CrowS-Pairs [402]

Toxicity RealToxicityPrompts [403], CivilComments toxicity classification [404]

Language Translation WMT [405], WMT20 [406], WMT20-enzh [406], EPRSTMT [302], CCPM [407]

Scientific Knowledge AminoProbe [138], BioLAMA [138], Chemical Reactions [138], Galaxy Clusters [138], Mineral

Groups [138]

Dialogue Wizard of Wikipedia [408], Empathetic Dialogues [409], DPC-generated [96] dialogues, ConvAI2 [410],

KdConv [411]

Topic Classification TNEWS-FC [302], YNAT [305], KLUE-TC [305], CSL [301], CSL-FC [302], IFLYTEK [412]

description.

GSM8K [374]: A dataset of diverse grade school math word

problems, testing a model’s ability to perform multi-step math-

ematical reasoning.

5.2.9. Problem Solving and Logical Reasoning

ANLI [383]: A large-scale dataset designed to test the robust-

ness of machine learning models in Natural Language Inference

(NLI) is created through an iterative, adversarial process where

humans try to generate examples that models cannot correctly

classify.

HumanEval [131]: A dataset for evaluating the problem-

solving ability of AI models, which includes a diverse set of

tasks that require various cognitive abilities, making it a com-

prehensive tool for assessing general intelligence in AI.

StrategyQA [339]: A question-answering dataset that re-

quires reasoning over multiple pieces of evidence to evaluate

the strategic reasoning ability of AI models, pushing the bound-

aries of what machines can understand and answer.

5.2.10. Cross-Lingual Understanding

XNLI [388]: A cross-lingual benchmark, XNLI extends the

MultiNLI [419] corpus to 15 languages, including low-resource

ones like Urdu. It tests models on cross-lingual sentence under-

standing, with 112,500 annotated pairs across three categories:

entailment, contradiction, and neutral.

PAWS-X [389]: PAWS-X, or Cross-lingual Paraphrase Adver-

saries from Word Scrambling, is a multilingual version of the

PAWS [420] dataset for paraphrase identification. It includes

examples in seven languages and is designed to evaluate the

performance of cross-lingual paraphrase identification models.

5.2.11. Truthfulness

Truthful-QA [395]: A unique benchmark that measures a

language model’s truthfulness when generating answers. The

dataset includes questions across various categories like health,

law, and politics, some designed to test the model against com-

mon human misconceptions.

29

Table 10: An illustration of training datasets and evaluation tasks employed by pre-trained LLMs. Here, “QA” is question-answering, “Clf” is classification, “NLI”

is natural language inference, “MT” is machine translation, “RC” is reading comprehension, “CR” is commonsense reasoning, “MR” is mathematical reasoning,

“Mem.” is memorization.

Benchmark

Models Training Dataset BIG-

bench MMLU

Super

GLUE QA Clf NLI MT Cloze/

Completion RC CR MR Coding

Truthful/

Bias/

Toxicity/

Mem.

T5 C4 [10] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

GPT-3 Common Crawl, WebText, Books Cor-

pora, Wikipedia

✓ ✓ ✓ ✓ ✓ ✓

mT5 mC4 [11] ✓ ✓ ✓

PanGu-α 1.1TB Chinese Text Corpus ✓ ✓ ✓ ✓ ✓

CPM-2 WuDaoCorpus [109] ✓ ✓

Codex 54 million public repositories from Github ✓

ERNIE-3.0 Chinese text corpora, Baidu Search, Web

text, QA-long, QA-short, Poetry and Cou-

plet Domain-specific data from medical,

law, and financial area Baidu knowledge

graph with more than 50 million facts

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Jurassic-1 Wikipedia, OWT, Books, C4, Pile [291],

arXiv, GitHub

✓ ✓ ✓ ✓

HyperCLOVA Korean blogs, Community sites, News,

KiN Korean Wikipedia, Wikipedia (En-

glish and Japanese), Modu-Corpus: Mes-

senger, News, Spoken and written lan-

guage corpus, Web corpus

Yuan 1.0 Common Crawl, SogouT, Sogou News,

Baidu Baike, Wikipedia, Books

✓ ✓ ✓ ✓

Gopher subsets of MassiveWeb Books, C4, News,

GitHub and Wikipedia samples from Mas-

siveText

✓ ✓ ✓ ✓ ✓ ✓ ✓

ERNIE-3.0 TITAN Same as ERNIE 3.0 and ERNIE

,

3.0 ad-

versarial dataset, ERNIE 3.0 controllable

dataset

✓ ✓ ✓ ✓ ✓

GPT-NeoX-20B Pile [291] ✓ ✓ ✓ ✓ ✓ ✓

OPT RoBERTa [289], Pile [291], PushShift.io

Reddit [413]

✓ ✓ ✓ ✓

BLOOM ROOTs [13] ✓ ✓ ✓ ✓ ✓ ✓

Galactica arXiv, PMC, Semantic Scholar, Wikipedia,

StackExchange, LibreText, Open Text-

books, RefSeq Genome, OEIS, LIPID

MAPS, NASAExoplanet, Common Crawl,

ScientificCC, AcademicCC, GitHub repos-

itories Khan Problems, GSM8K, OneS-

mallStep

✓ ✓ ✓ ✓ ✓

GLaM Filtered Webpages, Social media conversa-

tions Wikipedia, Forums, Books, News

✓ ✓ ✓ ✓ ✓

LaMDA Infiniset : Public documents, Dialogs, Ut-

terances

MT-NLG Two snapshots of Common Crawl and

Books3, OpenWebText2, Stack Exchange,

PubMed Abstracts, Wikipedia, PG-19

[242], BookCorpus2, NIH ExPorter, Pile,

CC-Stories, RealNews

✓ ✓ ✓ ✓ ✓

AlphaCode Selected GitHub repositories, CodeCon-

tests: Codeforces, Description2Code, Co-

deNet

Chinchilla MassiveWeb, MassiveText Books, C4,

News, GitHub, Wikipedia

✓ ✓ ✓ ✓ ✓ ✓

PaLM webpages, books, Wikipedia, news, arti-

cles, source code, social media conversa-

tions

✓ ✓ ✓ ✓ ✓ ✓

AlexaTM Wikipedia, mC4 ✓ ✓ ✓ ✓ ✓

U-PaLM Same as PaLM ✓ ✓ ✓ ✓ ✓ ✓ ✓

UL2 - ✓ ✓ ✓ ✓ ✓ ✓

GLM-130B - ✓ ✓ ✓

CodeGen Pile, BigQuery, BigPython ✓

LLaMA CommonCrawl, C4, Github, Wikipedia,

Books, arXiv, StackExchange

✓ ✓ ✓ ✓ ✓ ✓ ✓

PanGu-Σ WuDaoCorpora, CLUE, Pile, C4, Python

code

✓ ✓ ✓ ✓ ✓ ✓

BloombergGPT inPile, Pile, C4, Wikipedia ✓ ✓ ✓ ✓ ✓ ✓ ✓

CodeT5+ CodeSearchNet, Github Code ✓ ✓

StarCoder The Stack v1.2 ✓ ✓ ✓ ✓

LLaMA-2 ✓ ✓ ✓ ✓ ✓ ✓ ✓

PaLM-2 Web documents, Code, Books, Maths,

Conversation

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

30

Table 11: An illustration of training datasets and evaluation benchmarks used in fine-tuned LLMs. “SNI” is a short of Super-NaturalInsturctions.

Models Training Dataset BIG-

bench MMLU BBH RAFT FLAN SNI PromptSource TyDiQA HumanEval MBPP

Truthful/

Bias/

Toxicity

T0 Pool of Prompts ✓

WebGPT ELI5 [414], ELI5 fact-

check [156], TriviaQA [331],

ARC-Challenge [332], ARC-

Easy [332], Hand-written data,

Demonstrations of humans, Com-

parisons between model-generated

answers

Tk-INSTRUCT SNI [18] ✓

mT0 xP3 [144]

OPT-IML PromptSource [17], FLAN [16],

SNI [415], UnifiedSKG [416],

CrossFit [417], ExMix [418],

T5 [10], Reasoning

✓ ✓ ✓ ✓ ✓ ✓

Flan Muffin, T0-SF, NIv2, CoT ✓ ✓ ✓

WizardCoder Code Alpaca ✓ ✓

5.2.12. Biases and Ethics in AI

ETHOS [398]: ETHOS is a hate speech detection dataset

built from YouTube and Reddit comments. It is a tool in the

fight against online hate speech, offering binary and multi-label

variants for robust content moderation.

StereoSet [399]: StereoSet is a comprehensive dataset de-

signed to measure and evaluate the presence of stereotypical

biases in language models. It focuses on four key domains:

gender, profession, race, and religion. Contrasting stereotypi-

cal bias against language modeling ability provides a valuable

tool for understanding and mitigating biases in large language

models.

6. Applications

Applying Large Language Models (LLMs) to a variety of

downstream tasks has become a popular trend in both AI-

related research communities and industries, with many emerg-

ing uses being discovered and explored daily. LLMs, which are

capable of understanding and generating human-like text, have

found meaningful applications across a variety of fields. This

section provides an overview of LLM applications in medicine,

education, science, mathematics, law, finance, robotics, and

coding. While each of these domains pose different challenges,

LLMs open up opportunities to make significant contributions

to these domains through their generalizability.

General Purpose: LLMs are being widely considered as

general-purpose tools for a wide variety of tasks [421]. This

is due to their inherent ability to understand, generate, and

manipulate human-like text in a contextually relevant man-

ner. This allows them to perform tasks ranging from simple

language translation and question-answering to more complex

tasks like summarization, text generation, and even program-

ming help [422]. The utility of LLMs is further enhanced by

their ability to adapt to the specific style and tone of the text

they are processing, making the outputs more user-friendly and

context-aware. In everyday applications, LLMs can be used as

personal assistants, helping users draft emails or schedule ap-

pointments [423]; they can also be deployed in customer ser-

vice to handle common questions; or applied to generate con-

tent for digital platforms like websites, by creating human-like

text based on given prompts [424]. Moreover, LLMs play a cru-

cial role in data analysis, where they can filter large volumes of

text data, summarize key points, and find patterns that would

take humans much longer to identify [425]. Despite their wide-

ranging applications, it is essential to remember that LLMs,

similar to any AI system, are only as good as the data they have

been trained on.

Medicine: The application of LLMs in the field of medicine is

reshaping healthcare delivery and research. For example, LLMs

are increasingly used in clinical decision support systems to

provide physicians with evidence-based treatment recommen-

dations [426, 427, 428]. By analyzing patient data and medical

literature, they can help identify potential diagnoses, suggest

appropriate tests, and recommend optimal treatment strategies.

Moreover, LLMs can also enhance patient interactions with

healthcare systems; e.g., they can be used in chatbot applica-

tions [429, 430, 431] to answer patient queries about symptoms

or medications, schedule appointments, and even provide es-

sential health advice. For medical research, LLMs are used to

extract and filter information from a considerable amount of

medical literature, identify relevant studies, summarize find-

ings, and even predict future research trends [432, 433, 434].

For medical education, LLMs can help create training mate-

rials, generate exam questions, provide detailed explanations

of complex medical topics, and offer personalized feedback to

students [435, 436, 437, 438]. They can also simulate patient

interactions, enabling students to practice and improve their

clinical skills. At a broader level, LLMs can assist in public

health initiatives by analyzing media data to detect disease out-

breaks, monitor public sentiment towards health policies, and

disseminate health information in a clear and understandable

manner [439]. LLMs can be employed to support public health

initiatives, addressing related issues such as data privacy, the

necessity for explainability, and the potential risk of propagat-

ing biases [440, 441].

Education: The integration of LLMs into the educational sec-

tor offers opportunities to enhance learning experiences, teacher

31

Table 12: Performance comparison of top performing LLMs across various NLU and NLG tasks. Here, “N-Shots” indicate the number of example prompts provided

to the model during the evaluation, representing its capability in few-shot or zero-shot learning settings, “f” represents the fine-tuned version, and “B” represents the

benchmark.

Task Dataset/Benchmark Top-1 Top-2 Top-3

Model (Size) Score (N-shots) Model (Size) Score (N-shots) Model (Size) Score (N-shots)

Multi-Task BIG-bench (B) Chinchilla (70B) 65.1 (5-shot) Gopher (280B) 53.97 (5-shot) PaLM (540B) 53.7 (5-shot)

MMLU (B) GPT-4 (-) 86.4 (5-shot) Gemini (Ultra) 83.7 (5-shot) Flan-PaLM-2( f ) (Large) 81.2 (5-shot)

Language Understanding SuperGLUE (B) ERNIE 3.0 (12B) 90.6 (-) PaLM( f ) (540B) 90.4 (-) T5 (11B) 88.9 (-)

Story Comprehension and

Generation

HellaSwag GPT-4 (-) 95.3 (10-shot) Gemini (Ultra) 87.8 (10-shot) PaLM-2 (Large) 86.8 (one shot)

StoryCloze GPT3 (175B) 87.7 (few shot) PaLM-2 (Large) 87.4 (one shot) OPT (175B) 79.82 (-)

Physical Knowledge and

World Understanding

PIQA PaLM-2 (Large) 85.0 (one shot) LLaMa (65B) 82.8 (zero shot) MT-NLG (530B) 81.99 (zero shot)

TriviaQA PaLM-2 (Large) 86.1 (one shot) LLaMA-2 (70B) 85.0 (one shot) PaLM (540B) 81.4 (one shot)

Contextual Language

,

polation [46, 47, 48, 49] among others are some of the methods

widely studied for efficient LLM utilization.

Due to the success of LLMs on a wide variety of tasks, the

research literature has recently experienced a large influx of

LLM-related contributions. Researchers have organized the

LLMs literature in surveys [50, 51, 52, 53], and topic-specific

surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our

contribution focuses on providing a comprehensive yet concise

overview of the general direction of LLM research. This arti-

cle summarizes architectural and training details of pre-trained

LLMs and delves deeper into the details of concepts like fine-

tuning, multi-modal LLMs, augmented LLMs, datasets, eval-

uation, applications, challenges, and others to provide a self-

contained comprehensive overview. Our key contributions are

summarized as follows.

• We present a survey on the developments in LLM research

providing a concise comprehensive overview of the direc-

tion.

• We present extensive summaries of pre-trained models that

include fine-grained details of architecture and training de-

tails.

• We summarize major findings of the popular contributions

and provide a detailed discussion on the key design and

development aspects of LLMs to help practitioners effec-

tively leverage this technology.

• In this self-contained article, we cover a range of con-

cepts to present the general direction of LLMs compre-

2

Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications

7. Challenges

hensively, including background, pre-training, fine-tuning,

multi-modal LLMs, augmented LLMs, LLMs-powered

agents, datasets, evaluation, etc.

We loosely follow the existing terminology to ensure a stan-

dardized outlook of this research direction. For instance, fol-

lowing [50], our survey discusses pre-trained LLMs with 10B

parameters or more. We refer the readers interested in smaller

pre-trained models to [51, 52, 53].

The organization of this paper is as follows. Section 2 discusses

the background of LLMs. Section 3 focuses on LLMs overview,

architectures, training pipelines and strategies, fine-tuning, and

utilization in different domains. Section 4 highlights the config-

uration and parameters that play a crucial role in the function-

ing of these models. Summary and discussions are presented

in section 3.8. The LLM training and evaluation, datasets, and

benchmarks are discussed in section 5, followed by challenges

and future directions, and conclusion in sections 7 and 8, re-

spectively.

3

2. Background

We provide the relevant background to understand the fun-

damentals related to LLMs in this section. We briefly discuss

necessary components in LLMs and refer the readers interested

in details to the original works.

2.1. Tokenization

Tokenization [59] is an essential pre-processing step in

LLM training that parses the text into non-decomposing units

called tokens. Tokens can be characters, subwords [60], sym-

bols [61], or words, depending on the tokenization process.

Some of the commonly used tokenization schemes in LLMs

include wordpiece [62], byte pair encoding (BPE) [61], and un-

igramLM [60]. Readers are encouraged to refer to [63] for a

detailed survey.

2.2. Encoding Positions

The transformer processes input sequences in parallel and

independently of each other. Moreover, the attention mod-

ule in the transformer does not capture positional information.

As a result, positional encodings were introduced in trans-

former [64], where a positional embedding vector is added to

the token embedding. Variants of positional embedding include

absolute, relative, or learned positional encodings. Within rel-

ative encoding, Alibi and RoPE are two widely used positional

embeddings in LLMs.

Alibi [65]: It subtracts a scalar bias from the attention score

that increases with the distance between token positions. This

favors using recent tokens for attention.

RoPE [66]: It rotates query and key representations at an an-

gle proportional to the token absolute position in the input

sequence, resulting in a relative positional encoding scheme

which decays with the distance between the tokens.

2.3. Attention in LLMs

Attention assigns weights to input tokens based on impor-

tance so that the model gives more emphasis to relevant tokens.

Attention in transformers [64] calculates query, key, and value

mappings for input sequences, where the attention score is

obtained by multiplying the query and key, and later used to

weight values. We discuss different attention strategies used in

LLMs below.

Self-Attention [64]: Calculates attention using queries, keys,

and values from the same block (encoder or decoder).

Cross Attention: It is used in encoder-decoder architectures,

where encoder outputs are the queries, and key-value pairs

come from the decoder.

Sparse Attention [67]: Self-attention has O(n2) time complex-

ity which becomes infeasible for large sequences. To speed

up the computation, sparse attention [67] iteratively calculates

attention in sliding windows for speed gains.

Flash Attention [68]: Memory access is the major bottleneck

in calculating attention using GPUs. To speed up, flash

attention employs input tiling to minimize the memory reads

and writes between the GPU high bandwidth memory (HBM)

and the on-chip SRAM.

2.4. Activation Functions

The activation functions serve a crucial role in the curve-

fitting abilities of neural networks [69]. We discuss activation

functions used in LLMs in this section.

ReLU [70]: The Rectified linear unit (ReLU) is defined as:

ReLU(x) = max(0, x) (1)

GeLU [71]: The Gaussian Error Linear Unit (GeLU) is the

combination of ReLU, dropout [72] and zoneout [73].

GLU variants [74]: The Gated Linear Unit [75] is a neural

network layer that is an element-wise product (⊗) of a linear

transformation and a sigmoid transformed (σ) linear projection

of the input given as:

GLU(x,W,V, b, c) = (xW + b) ⊗ σ(xV + c), (2)

where X is the input of layer and l, W, b,V and c are learned

parameters. Other GLU variants [74] used in LLMs are:

ReGLU(x,W,V, b, c) = max(0, xW + b)⊗,

GEGLU(x,W,V, b, c) = GELU(xW + b) ⊗ (xV + c),

S wiGLU(x,W,V, b, c, β) = S wishβ(xW + b) ⊗ (xV + c).

2.5. Layer Normalization

Layer normalization leads to faster convergence and is an in-

tegrated component of transformers [64]. In addition to Layer-

Norm [76] and RMSNorm [77], LLMs use pre-layer normal-

ization [78], applying it before multi-head attention (MHA).

Pre-norm is shown to provide training stability in LLMs. An-

other normalization variant, DeepNorm [79] fixes the issue with

larger gradients in pre-norm.

2.6. Distributed LLM Training

This section describes distributed LLM training approaches

briefly. More details are available in [13, 37, 80, 81].

Data Parallelism: Data parallelism replicates the model on

multiple devices where data in a batch gets divided across de-

vices. At the end of each training iteration weights are synchro-

nized across all devices.

Tensor Parallelism: Tensor parallelism shards a tensor compu-

tation across devices. It is also known as horizontal parallelism

or intra-layer model parallelism.

Pipeline Parallelism: Pipeline parallelism shards model layers

across different devices. This is also known as vertical paral-

lelism.

Model Parallelism: A combination of tensor and pipeline par-

allelism is known as model parallelism.

3D Parallelism: A combination of data, tensor, and model par-

allelism is known as 3D parallelism.

Optimizer Parallelism: Optimizer parallelism also known as

zero redundancy optimizer [37] implements optimizer state

partitioning, gradient partitioning, and parameter partitioning

across devices to reduce memory consumption while keeping

the communication costs as low as possible.

4

2.7. Libraries

Some commonly used libraries for LLMs training are:

Transformers [82]: The library provides access to various pre-

trained transformer models

,

Understanding LAMBADA PaLM (540B) 89.7 (few shot) MT-NLG (530B) 87.15 (few shot) PaLM-2 (Large) 86.9 (one shot)

Commonsense Reasoning WinoGrande GPT-4 (-) 87.5 (5-shot) PaLM-2 (Large) 83.0 (one shot) PaLM (540B) 81.1 (zero shot)

SIQA LLaMA (65B) 52.3 (zero shot) Chinchilla (70B) 51.3 (zero shot) Gopher (280B) 50.6 (zero shot)

Reading Comprehension BoolQ PaLM( f ) (540B) 92.2 (-) T5 (11B) 91.2 (-) PaLM-2 (Large) 90.9 (one shot)

Truthfulness Truthful-QA LLaMA (65B) 57 (-)

Mathematical Reasoning MATH Gemini (Ultra) 53.2 (4-shot) PaLM-2 (Large) 34.3 (4-shot) LLaMa-2 (65B) 13.5 (4-shot)

GSM8K GPT-4 (-) 92.0 (5-shot) PaLM-2 (Large) 80.7 (8-shot) U-PaLM (540B) 58.5 (-)

Problem Solving and

Logical Reasoning HumanEval Gemini( f ) (Ultra) 74.4 (zero shot) GPT-4 (-) 67.0 (zero shot) Code Llama (34B) 48.8 (zero shot)

support, and educational content development. For students, by

analyzing their learning styles, performance, and preferences,

LLMs can provide customized study materials and practice

questions to develop personalized learning experiences [442].

For teachers, LLMs can help to create lesson plans and grade

assignments and generate diverse and inclusive educational

content, significantly saving more time for teaching and student

interaction [443, 444]. In language learning, LLMs serve as

advanced conversational partners capable of simulating conver-

sations in multiple languages, correcting grammar, enhancing

vocabulary, and aiding pronunciation for the needs of fluency

in practice [445]. Furthermore, LLMs improve accessibility

in education by providing support for students with disabili-

ties. They can generate real-time transcriptions for the hear-

ing impaired, offer reading assistance for the visually impaired,

and simplify complex texts for those with learning disabili-

ties [441]. As LLMs continue to evolve, their applications in

education can benefit more students and teachers from different

perspectives in practice.

Science: Similar to medical applications, LLMs can expedite

the research process by quickly analyzing and summarizing sci-

entific literature. By briefing comprehensible and accessible re-

search summaries, LLMs can assist researchers in staying up-

to-date with the latest findings, even in fields outside their area

of expertise [446, 447]. In addition, LLMs can aid scientists

in formulating new hypotheses and research questions since

their ability to process large-scale datasets allows them to un-

veil insights that might not be immediately apparent to human

researchers [448]. Moreover, for scientific writing, LLMs can

help researchers draft documents, suggest improvements, and

ensure adherence to specific formatting guidelines [449, 450].

This not only saves time but also improves the clarity of scien-

tific communication, enabling interdisciplinary teams to work

together more effectively.

Maths: In addition to providing mathematical research and

education support, LLMs can assist in solving mathematical

problems by giving step-by-step explanations and guiding users

through complex proofs and calculations. They can help iden-

tify errors in reasoning or computation and suggest corrections,

serving as an invaluable tool for both learning and verification

purposes [451, 452]. LLMs can be employed to check the valid-

ity of mathematical proofs, offering a preliminary filter before

human review. While they are not a substitute for the meticu-

lous work of mathematicians, they can help simplify the process

of proof verification [453, 454]. Moreover, LLMs enhance ac-

cessibility to mathematics by translating complex concepts and

findings into understandable language for non-specialists [455],

where the gap between theoretical mathematics and applied

contexts such as physics, engineering, and economics can be

bridged.

Law: LLMs can assist with the thematic analysis of legal doc-

uments, including generating initial coding for datasets, iden-

tifying themes, and classifying data according to these themes.

This collaborative effort between legal experts and LLMs has

proved to be effective in analyzing legal texts such as court

opinions on theft, improving both the efficiency and quality of

the research [456]. Additionally, LLMs have been evaluated for

their ability to generate explanations of legal terms, focusing

on improving factual accuracy and relevance by incorporating

sentences from case law. By feeding relevant case law into the

LLM, the augmented models can generate higher-quality expla-

nations with less factually incorrect information [457]. More-

over, LLMs can be trained with specialized domain knowledge

to perform legal reasoning tasks [458] and answer legal ques-

tions [459].

Finance: LLMs like BloombergGPT [141], trained on exten-

sive proprietary financial datasets, exhibit superior performance

on financial tasks. This indicates the value of domain-specific

training in creating LLMs that can more accurately understand

and process industry-specific language and concepts. The intro-

duction of FinGPT [460] as an open-source model offers trans-

parent and accessible resources to develop novel applications

such as robo-advising, algorithmic trading, and low-code so-

lutions, ultimately expanding the capabilities of financial ser-

vices. Both BloombergGPT and FinGPT show the adaptabil-

ity of LLMs to the financial domain, with the former showing

32

the power of custom datasets and the latter emphasizing a data-

centric approach and low-rank adaptation techniques for cus-

tomization. Moreover, LLMs demonstrate an ability to break

down complex financial tasks into actionable plans, enabling

end-to-end solutions that were previously unfeasible with a sin-

gle model [461].

Robotics: In robotics research, LLMs have promising appli-

cations, such as enhancing human-robot interaction [28, 462,

463, 464], task planning [227], motion planning [236], nav-

igation [236, 465], object manipulation [226], personalized

robots [466], etc. LLMs enable robots to understand the en-

vironment effectively and generate plans to complete tasks col-

laboratively [230, 26]. They can facilitate continuous learning

by allowing robots to access and integrate information from a

wide range of sources, helping robots acquire new skills, adapt

to changes, and refine their paths [214, 223, 224].

7. Challenges and Future Directions

LLMs such as GPT-4 and its predecessors have significantly

advanced natural language processing. Nevertheless, they also

bring along a set of challenges. The computational cost, ad-

versarial robustness, and interpretability are among the tech-

nical challenges that are intrinsic to these models. Further-

more, as these models are scaled up to handle more complex

tasks or to operate in more complex or dynamic environments,

new challenges in scalability, privacy, and real-time processing

emerge. On the frontier of foundational research, integrating

multi-modality and the effectiveness of transfer learning are be-

ing keenly explored. Additionally, the continuous learning as-

pect of these models, which aims to have models that can adapt

to new information over time, presents a fresh set of challenges.

These challenges not only underscore the technical intricacies

involved but also highlight the broader impact and the future

trajectory of LLMs in real-world applications. The following

sections delve into these challenges, shedding light on the on-

going and potential efforts to address them.

Computational Cost: Training LLMs requires extensive com-

putational resources, which increases production costs and

raises environmental concerns due to substantial energy con-

sumption during large-scale training. Improved performance

occurs as computational resources increase, but the rate of

improvement gradually decreases when both the model and

dataset size remain fixed, following the power law of dimin-

ishing returns [467].

Bias and Fairness: LLMs can inherit and amplify societal bi-

ases in their training data. These biases can manifest in the

model’s outputs, leading to potential

,

ethical and fairness is-

sues [468].

Overfitting: Although LLMs possess substantial learning ca-

pabilities, they are susceptible to overfitting noisy and peculiar

patterns within their extensive training data. Consequently, this

may cause them to generate illogical responses [469]. The de-

bate about Memorization vs. Generalization in LLMs is about

finding the right balance. Memorization allows the model to

remember specific details from its training data, ensuring it can

provide accurate answers to precise questions. However, gen-

eralization enables the model to make inferences and produce

responses for inputs it has not seen before, which is essential

for handling various real-world tasks. Striking the right bal-

ance is the challenge: too much memorization can lead to over-

fitting, making the model inflexible and struggling with new

inputs [470].

Economic and Research Inequality: The high cost of train-

ing and deploying LLMs may make their development concen-

trated within well-funded organizations, potentially worsening

economic and research inequalities in AI [471].

Reasoning and Planning: Some reasoning and planning tasks,

even as seemingly simple as common-sense planning, which

humans find easy, remain well beyond the current capabilities

of LLMs evaluated using an assessment framework. This is not

entirely unexpected, considering that LLMs primarily generate

text completions based on likelihood and offer no solid guaran-

tees in terms of reasoning abilities [472].

Hallucinations: LLMs exhibit “hallucinations", where they

generate responses that, while sounding plausible, are incor-

rect or do not align with the provided information [473]. The

hallucination can be categorized into three categories.

• Input-conflicting hallucination, wherein LLMs produce

content that diverges from the input given by users.

• Context-conflicting hallucination, where LLMs generate

content that contradicts information they have generated

earlier.

• Fact-conflicting hallucination involves LLM’s generation

of content that does not align with established world

knowledge.

Prompt Engineering: Prompts serve as inputs to LLMs, and

their syntax and semantics play a crucial role in determining

the model’s output. The prompt variations, sometimes counter-

intuitive to humans, can result in significant changes in model

output and are addressed through prompt engineering, which

involves designing natural language queries to guide LLMs

responses effectively [474, 32].

Limited Knowledge: Information acquired during pretraining

is limited and may become obsolete after some time. Re-

training the model using updated data is costly. To generate

factually accurate responses people use a retrieval augmen-

tation pipeline [188]. However, pre-trained models are not

trained with retrieval augmentation generation (RAG) [6, 21],

hence, adapting the training pipeline is necessary [183, 25].

Safety and Controllability: Using LLMs comes with the risk

of generating harmful, misleading, or inappropriate content,

whether by accident or when given specific prompts. Ensuring

these models are safely utilized is a significant concern [475].

Multi-Modality: Multi-modal learning, where LLMs are

trained on diverse data like text, images, and videos, aims to

create models with richer understanding but faces challenges

in data alignment, fusion strategies, and higher computational

demands.

Catastrophic Forgetting: LLMs are often pre-trained on large

33

datasets and then fine-tuned on domain-specific data, reducing

training resources but facing issues like domain adaptation and

catastrophic forgetting, which hinders the retention of original

knowledge when learning new tasks.

Adversarial Robustness: Large Language Models (LLMs)

have shown great capabilities in various tasks but are vul-

nerable to adversarial attacks, where slight, deliberate input

alterations can mislead them. Especially with models like

BERT, adversarial fine-tuning can enhance robustness, al-

though it sometimes compromises generalization [476]. As

LLMs integrate more into complex systems, examining their

security properties becomes crucial, given the emerging field

of adversarial attacks on LLMs within trustworthy ML [477].

This vulnerability is notable in safety-critical domains, ne-

cessitating robust adversarial evaluation tools to ensure LLM

reliability [478].

Interpretability and Explainability: The "black-box" nature

of LLMs poses challenges in understanding their decision-

making, which is crucial for broader acceptance and trust,

especially in sensitive domains. Despite their advanced

capabilities, the lack of insight into their operation limits their

effectiveness and trustworthiness [479, 480]. Efforts are being

made to make LLMs more explainable to promote user trust

and to ensure responsible AI usage. Understanding the logic

behind LLMs’ responses is essential for fostering trust and

ensuring they align with human values and legal standards.

Privacy Concerns: Privacy concerns in Large Language

Models (LLMs) have escalated with their growth in complexity

and size, particularly around data sharing and potential misuse.

There is a risk of malicious content creation, filter bypass,

and data privacy issues, especially in e-commerce, where

protecting customer privacy is crucial. If models are trained

on private data, additional concerns arise if such models are

made publicly available. LLMs tend to memorize phrases from

their training sets, which an adversary could exploit to extract

sensitive data, posing a threat to personal privacy [481, 482].

Real-Time Processing: Real-time processing in Large Lan-

guage Models (LLMs) is pivotal for various applications,

especially with the rising popularity of mobile AI applications

and concerns regarding information security and privacy.

However, LLMs often have hundreds of layers and millions

of parameters, which impede real-time processing due to the

high computational demands and limited weight storage on

hardware platforms, particularly in edge computing environ-

ments [483]. While certain efforts like MobileBERT aim

to reduce memory requirements, they still face substantial

execution overhead due to the large number of model layers,

leading to high inference latency.

Long-Term Dependencies: Large Language Models (LLMs)

have shown considerable progress in understanding and

generating text, yet they often struggle with preserving context

and handling long-term dependencies, particularly in complex,

multi-turn conversations or long documents. This limitation

can lead to incoherent or irrelevant responses.

Hardware Acceleration: The growth of LLMs presents signif-

icant hardware challenges due to the increasing computational

and memory demands associated with training and deploying

these models. GPUs have played a crucial role in meeting the

hardware requirements for training LLMs, with the networking

industry also evolving to optimize hardware for training

workloads. However, the growing size of LLMs, which has

been outpacing hardware progress, makes model inference in-

creasingly costly. Model quantization is a promising approach

to bridge the widening gap between LLM size and hardware

capacity [484]. Although specialized hardware acceleration

like GPUs or TPUs can significantly reduce the computational

cost, making real-time applications more feasible, they may not

fully resolve all limitations, necessitating further advancements

in hardware technology.

Regulatory and Ethical Frameworks: The rapid advancements

in artificial intelligence have given rise to sophisticated Large

Language Models (LLMs) like OpenAI’s GPT-4 [147] and

Google’s Bard. These developments underscore the imperative

for regulatory oversight to manage the ethical and social

challenges accompanying LLMs’ widespread use [485]. For

instance, LLMs can generate content that can be used posi-

tively or negatively, emphasizing the need for proactive ethical

frameworks and policy measures to guide their responsible

use and assign accountability for their

,

outputs [486]. Auditing

is identified as a promising governance mechanism to ensure

that AI systems, including LLMs, are designed and deployed

ethically, legally, and technically robust [487].

8. Conclusion

This article has reviewed the developments on LLMs com-

prehensively. It contributes to summarizing significant find-

ings of LLMs in the existing literature and provides a de-

tailed analysis of the design aspects, including architectures,

datasets, and training pipelines. We identified crucial archi-

tectural components and training strategies employed by dif-

ferent LLMs. These aspects are presented as summaries and

discussions throughout the article. Moreover, we have dis-

cussed the performance differences of LLMs in zero-shot and

few-shot settings, explored the impact of fine-tuning, and com-

pared supervised and generalized models and encoder vs. de-

coder vs. encoder-decoder architectures. A comprehensive re-

view of multi-modal LLMs, retrieval augmented LLMs, LLMs-

powered agents, efficient LLMs, datasets, evaluation, applica-

tions, and challenges is also provided. This article is anticipated

to serve as a valuable resource for researchers, offering insights

into the recent advancements in LLMs and providing funda-

mental concepts and details to develop better LLMs.

References

[1] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his-

tory” for natural language processing?, in: Machine Learning and

Knowledge Discovery in Databases. Research Track: European Con-

ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021,

Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1

[2] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill,

O. Levy, S. Bowman, Superglue: A stickier benchmark for general-

purpose language understanding systems, Advances in neural informa-

tion processing systems 32 (2019). 1, 24, 29

34

[3] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan,

Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al., Towards a human-

like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1

[4] B. A. y Arcas, Do large language models understand us?, Daedalus

151 (2) (2022) 183–197. 2

[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al.,

Language models are unsupervised multitask learners, OpenAI blog

1 (8) (2019) 9. 2, 7

[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,

A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models

are few-shot learners, Advances in neural information processing sys-

tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 17, 22, 23, 24, 25, 33

[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training

of deep bidirectional transformers for language understanding, arXiv

preprint arXiv:1810.04805 (2018). 2, 18, 24

[8] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,

L. Zettlemoyer, Deep contextualized word representations, in: NAACL-

HLT, Association for Computational Linguistics, 2018, pp. 2227–2237.

2

[9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,

V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-

training for natural language generation, translation, and comprehen-

sion, arXiv preprint arXiv:1910.13461 (2019). 2

[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,

Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with

a unified text-to-text transformer, The Journal of Machine Learning Re-

search 21 (1) (2020) 5485–5551. 2, 7, 8, 17, 19, 24, 25, 26, 28, 30,

31

[11] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant,

A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-

text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24,

25, 28, 30

[12] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi,

J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan-

guage models, AI Open 2 (2021) 216–224. 2, 8, 25

[13] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow,

R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b-

parameter open-access multilingual language model, arXiv preprint

arXiv:2211.05100 (2022). 2, 4, 9, 11, 22, 23, 24, 25, 30

[14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan,

M. Diab, X. Li, X. V. Lin, et al., Opt: Open pre-trained transformer

language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 23,

24, 25

[15] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,

P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal-

ing language modeling with pathways, arXiv preprint arXiv:2204.02311

(2022). 2, 6, 9, 10, 22, 23, 24, 25

[16] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li,

X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned

language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 17,

22, 23, 25, 28, 31

[17] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai,

A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask

prompted training enables zero-shot task generalization, arXiv preprint

arXiv:2110.08207 (2021). 2, 11, 25, 28, 31

[18] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,

A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al.,

Super-naturalinstructions: Generalization via declarative instructions on

1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical

Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7,

11, 17, 23, 25, 28, 31

[19] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha-

jishirzi, Self-instruct: Aligning language model with self generated in-

structions, arXiv preprint arXiv:2212.10560 (2022). 2, 11, 18, 22, 28

[20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,

C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod-

els to follow instructions with human feedback, Advances in Neural In-

formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16,

22

[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,

N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open

foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288

(2023). 2, 7, 10, 16, 25, 33

[22] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo-

gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of

large language models, arXiv preprint arXiv:2206.07682 (2022). 2

[23] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large

language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2

[24] D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci-

entific research capabilities of large language models, arXiv preprint

arXiv:2304.05332 (2023). 2

[25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick,

J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with

retrieval augmented language models, arXiv preprint arXiv:2208.03299

(2022). 2, 17, 18, 33

[26] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,

A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied

multimodal language model, arXiv preprint arXiv:2303.03378 (2023).

2, 19, 21, 33

[27] A. Parisi, Y. Zhao, N. Fiedel, Talm: Tool augmented language models,

arXiv preprint arXiv:2205.12255 (2022). 2, 18, 19

[28] B. Zhang, H. Soh, Large language models as zero-shot human models

for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2,

33

[29] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi,

Y. Shi, et al., mplug-owl: Modularization empowers large language

models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2,

22

[30] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo,

T. Lu, J. Zhou, Y. Qiao, et al., Visionllm: Large language model

is also an open-ended decoder for vision-centric tasks, arXiv preprint

arXiv:2305.11175 (2023).

,

2, 22

[31] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, Y. Shan, Gpt4tools:

Teaching large language model to use tools via self-instruction, arXiv

preprint arXiv:2305.18752 (2023). 2, 19, 22

[32] E. Saravia, Prompt Engineering Guide, https://github.com/dair-

ai/Prompt-Engineering-Guide (12 2022). 2, 7, 17, 33

[33] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,

W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained

model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 22, 23, 25

[34] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+:

Open code large language models for code understanding and genera-

tion, arXiv preprint arXiv:2305.07922 (2023). 2, 10, 24, 25

[35] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang,

Y. Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl-

edge enhanced pre-training for language understanding and generation,

arXiv preprint arXiv:2112.12731 (2021). 2, 8, 23, 25

[36] J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, Deepspeed: System op-

timizations enable training deep learning models with over 100 billion

parameters, in: Proceedings of the 26th ACM SIGKDD International

Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–

3506. 2, 5

[37] S. Rajbhandari, J. Rasley, O. Ruwase, Y. He, Zero: Memory optimiza-

tions toward training trillion parameter models, in: SC20: International

Conference for High Performance Computing, Networking, Storage and

Analysis, IEEE, 2020, pp. 1–16. 2, 4, 23

[38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards

a unified view of parameter-efficient transfer learning, arXiv preprint

arXiv:2110.04366 (2021). 2, 20, 21

[39] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po-

ria, Llm-adapters: An adapter family for parameter-efficient fine-tuning

of large language models, arXiv preprint arXiv:2304.01933 (2023). 2,

20

[40] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-

efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8,

20

[41] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for

generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20

[42] X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of

large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 21

[43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang,

From dense to sparse: Contrastive pruning for better pre-trained lan-

35

guage model compression, in: Proceedings of the AAAI Conference on

Artificial Intelligence, Vol. 36, 2022, pp. 11547–11555. 2, 21

[44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant:

Accurate and efficient post-training quantization for large language

models, in: ICML, Vol. 202 of Proceedings of Machine Learning Re-

search, PMLR, 2023, pp. 38087–38099. 2, 20

[45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong,

Compression of generative pre-trained language models via quantiza-

tion, arXiv preprint arXiv:2203.10705 (2022). 2, 20

[46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu,

Giraffe: Adventures in expanding context lengths in llms, arXiv preprint

arXiv:2308.10882 (2023). 2, 17

[47] B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: Efficient con-

text window extension of large language models, arXiv preprint

arXiv:2309.00071 (2023). 2, 17

[48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, Y. Yang,

Longt5: Efficient text-to-text transformer for long sequences, arXiv

preprint arXiv:2112.07916 (2021). 2, 17

[49] S. Chen, S. Wong, L. Chen, Y. Tian, Extending context window

of large language models via positional interpolation, arXiv preprint

arXiv:2306.15595 (2023). 2, 17

[50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,

J. Zhang, Z. Dong, et al., A survey of large language models, arXiv

preprint arXiv:2303.18223 (2023). 2, 3, 7

[51] U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur-

vey on word representation models: From classical to state-of-the-art

word representation language models, Transactions on Asian and Low-

Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3

[52] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,

E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro-

cessing via large pre-trained language models: A survey, arXiv preprint

arXiv:2111.01243 (2021). 2, 3

[53] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan,

L. He, et al., A comprehensive survey on pretrained foundation models:

A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023).

2, 3

[54] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,

J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint

arXiv:2301.00234 (2022). 2, 7, 17

[55] J. Huang, K. C.-C. Chang, Towards reasoning in large language models:

A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 17

[56] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang,

Q. Liu, Aligning large language models with human: A survey, arXiv

preprint arXiv:2307.12966 (2023). 2

[57] X. Zhu, J. Li, Y. Liu, C. Ma, W. Wang, A survey on model compression

for large language models, arXiv preprint arXiv:2308.07633 (2023). 2

[58] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi-

modal large language models, arXiv preprint arXiv:2306.13549 (2023).

2, 22

[59] J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL-

ING 1992 volume 4: The 14th international conference on computa-

tional linguistics, 1992. 4

[60] T. Kudo, Subword regularization: Improving neural network translation

models with multiple subword candidates, in: Proceedings of the 56th

Annual Meeting of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), 2018, pp. 66–75. 4

[61] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare

words with subword units, in: Proceedings of the 54th Annual Meet-

ing of the Association for Computational Linguistics (Volume 1: Long

Papers), 2016, pp. 1715–1725. 4

[62] M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012

IEEE international conference on acoustics, speech and signal process-

ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4

[63] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé,

A. Raja, C. Si, W. Y. Lee, B. Sagot, et al., Between words and char-

acters: A brief history of open-vocabulary modeling and tokenization in

nlp, arXiv preprint arXiv:2112.10508 (2021). 4

[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural

information processing systems 30 (2017). 4, 7

[65] O. Press, N. Smith, M. Lewis, Train short, test long: Attention with

linear biases enables input length extrapolation, in: International Con-

ference on Learning Representations, 2022.

URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17

[66] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu, Roformer: En-

hanced transformer with rotary position embedding, arXiv preprint

arXiv:2104.09864 (2021). 4, 9, 17

[67] R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences

with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7,

23

[68] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and

memory-efficient exact attention with io-awareness, Advances in Neural

Information Processing Systems 35 (2022) 16344–16359. 4

[69] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks

are universal approximators, Neural networks 2 (5) (1989) 359–366. 4

[70] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltz-

mann machines, in: Proceedings of the 27th international conference on

machine learning (ICML-10), 2010, pp. 807–814. 4

[71] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv

preprint arXiv:1606.08415 (2016). 4

[72] N. Srivastava,

,

G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,

Dropout: a simple way to prevent neural networks from overfitting, The

journal of machine learning research 15 (1) (2014) 1929–1958. 4

[73] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.

Ke, A. Goyal, Y. Bengio, A. Courville, C. Pal, Zoneout: Regular-

izing rnns by randomly preserving hidden activations, arXiv preprint

arXiv:1606.01305 (2016). 4

[74] N. Shazeer, Glu variants improve transformer, arXiv preprint

arXiv:2002.05202 (2020). 4

[75] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with

gated convolutional networks, in: International conference on machine

learning, PMLR, 2017, pp. 933–941. 4

[76] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint

arXiv:1607.06450 (2016). 4

[77] B. Zhang, R. Sennrich, Root mean square layer normalization, Advances

in Neural Information Processing Systems 32 (2019). 4

[78] A. Baevski, M. Auli, Adaptive input representations for neural language

modeling, arXiv preprint arXiv:1809.10853 (2018). 4

[79] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling

transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4

[80] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro,

Megatron-lm: Training multi-billion parameter language models using

model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5

[81] "bmtrain: Efficient training for big models.".

URL https://github.com/OpenBMB/BMTrain 4, 5

[82] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-

tac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-

art natural language processing, in: Proceedings of the 2020 conference

on empirical methods in natural language processing: system demon-

strations, 2020, pp. 38–45. 5

[83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclau-

rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, et al.,

Jax: composable transformations of python+ numpy programs (2018).

5

[84] S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, Y. You,

Colossal-ai: A unified deep learning system for large-scale parallel train-

ing, arXiv preprint arXiv:2110.14883 (2021). 5

[85] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, J. Tang, Fastmoe: A

fast mixture-of-expert training system, arXiv preprint arXiv:2103.13262

(2021). 5

[86] L. Huawei Technologies Co., Huawei mindspore ai development frame-

work, in: Artificial Intelligence Technology, Springer, 2022, pp. 137–

162. 5

[87] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,

T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imper-

ative style, high-performance deep learning library, Advances in neural

information processing systems 32 (2019). 5

[88] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,

S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-

scale machine learning., in: Osdi, Vol. 16, Savannah, GA, USA, 2016,

pp. 265–283. 5

[89] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,

36

https://openreview.net/forum?id=R8sQPpGCv0

https://openreview.net/forum?id=R8sQPpGCv0

https://openreview.net/forum?id=R8sQPpGCv0

https://github.com/OpenBMB/BMTrain

https://github.com/OpenBMB/BMTrain

B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible and efficient machine

learning library for heterogeneous distributed systems, arXiv preprint

arXiv:1512.01274 (2015). 5

[90] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to tril-

lion parameter models with simple and efficient sparsity, The Journal of

Machine Learning Research 23 (1) (2022) 5232–5270. 5, 9

[91] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,

Y. Zhou, A. W. Yu, O. Firat, et al., Glam: Efficient scaling of language

models with mixture-of-experts, in: International Conference on Ma-

chine Learning, PMLR, 2022, pp. 5547–5569. 5, 9, 23, 25

[92] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li,

X. Zhang, A. Podolskiy, G. Arshinov, et al., Pangu-

: Towards trillion

parameter language model with sparse heterogeneous computing, arXiv

preprint arXiv:2303.10845 (2023). 5, 10, 11, 23, 25

[93] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy,

J. Launay, C. Raffel, What language model architecture and pretrain-

ing objective works best for zero-shot generalization?, in: International

Conference on Machine Learning, PMLR, 2022, pp. 22964–22984. 5

[94] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou,

H.-W. Hon, Unified language model pre-training for natural language

understanding and generation, Advances in neural information process-

ing systems 32 (2019). 6

[95] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,

S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language

models, arXiv preprint arXiv:2001.08361 (2020). 6

[96] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,

E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,

et al., Training compute-optimal large language models, arXiv preprint

arXiv:2203.15556 (2022). 6, 9, 25, 29

[97] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster,

T. Wang, Q. Liu, P. S. Koura, et al., Opt-iml: Scaling language model in-

struction meta learning through the lens of generalization, arXiv preprint

arXiv:2212.12017 (2022). 7, 11, 17, 22, 25, 28

[98] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, C. Gan,

Principle-driven self-alignment of language models from scratch with

minimal human supervision, arXiv preprint arXiv:2305.03047 (2023).

7, 16

[99] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones,

N. Joseph, B. Mann, N. DasSarma, et al., A general language assistant

as a laboratory for alignment, arXiv preprint arXiv:2112.00861 (2021).

7

[100] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,

P. Christiano, G. Irving, Fine-tuning language models from human pref-

erences, arXiv preprint arXiv:1909.08593 (2019). 7

[101] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, M. Seo, The cot collec-

tion: Improving zero-shot and few-shot learning of language models via

chain-of-thought fine-tuning, arXiv preprint arXiv:2305.14045 (2023).

7, 11

[102] Q. Liu, F. Zhou, Z. Jiang, L. Dou, M. Lin, From zero to hero: Exam-

ining the power of symbolic tasks in instruction tuning, arXiv preprint

arXiv:2304.07995 (2023). 7, 11

[103] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,

D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large

language models, Advances in Neural Information Processing Systems

35 (2022) 24824–24837. 7, 19, 22

[104] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-

hery, D. Zhou, Self-consistency improves chain of thought reasoning in

language models, arXiv preprint arXiv:2203.11171 (2022). 7, 19

[105] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. Narasimhan,

Tree of thoughts: Deliberate problem solving with large language mod-

els, arXiv preprint arXiv:2305.10601 (2023). 7, 19

[106] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,

A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learn-

ing for nlp, in: International Conference on Machine Learning, PMLR,

2019, pp. 2790–2799. 7, 20

[107] S. McCandlish, J. Kaplan, D. Amodei, O. D. Team, An empirical model

of large-batch training, arXiv preprint arXiv:1812.06162 (2018). 7

[108] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang,

K. Wang, X. Zhang, et al., Pangu-α : Large-scale autoregressive pre-

trained chinese language models with auto-parallel computation, arXiv

preprint arXiv:2104.12369 (2021). 8, 22, 23, 25

[109] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang,

J. Tang, Wudaocorpora: A super large-scale chinese corpora for pre-

training language models, AI Open 2 (2021) 65–68. 8, 30

[110] Y. Sun, S. Wang, S. Feng,

,

S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,

Y. Zhao, Y. Lu, et al., Ernie 3.0: Large-scale knowledge enhanced

pre-training for language understanding and generation, arXiv preprint

arXiv:2107.02137 (2021). 8, 25

[111] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, R. Salakhutdinov,

Transformer-xl: Attentive language models beyond a fixed-length con-

text, arXiv preprint arXiv:1901.02860 (2019). 8

[112] O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical details

and evaluation, White Paper. AI21 Labs 1 (2021). 8, 23, 25

[113] Y. Levine, N. Wies, O. Sharir, H. Bata, A. Shashua, Limits to depth ef-

ficiencies of self-attention, Advances in Neural Information Processing

Systems 33 (2020) 22640–22651. 8, 11

[114] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park,

S. Kim, S. Kim, D. Seo, et al., What changes can large-scale language

models bring? intensive study on hyperclova: Billions-scale korean

generative pretrained transformers, arXiv preprint arXiv:2109.04650

(2021). 8, 25

[115] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo,

L. Xu, et al., Yuan 1.0: Large-scale pre-trained language model in zero-

shot and few-shot learning, arXiv preprint arXiv:2110.04725 (2021). 8,

23, 25

[116] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,

J. Aslanides, S. Henderson, R. Ring, S. Young, et al., Scaling lan-

guage models: Methods, analysis & insights from training gopher, arXiv

preprint arXiv:2112.11446 (2021). 8, 9, 25, 28

[117] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari,

J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al.,

Using deepspeed and megatron to train megatron-turing nlg 530b, a

large-scale generative language model, arXiv preprint arXiv:2201.11990

(2022). 8, 9, 23, 25

[118] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding,

H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-

source autoregressive language model, arXiv preprint arXiv:2204.06745

(2022). 9, 22, 23, 24, 25

[119] W. Ben, K. Aran, Gpt-j-6b: A 6 billion parameter autoregressive lan-

guage model (2021). 9

[120] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,

B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed pre-

cision training, arXiv preprint arXiv:1710.03740 (2017). 9, 23

[121] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hin-

ton, J. Dean, Outrageously large neural networks: The sparsely-gated

mixture-of-experts layer, arXiv preprint arXiv:1701.06538 (2017). 9, 23

[122] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza,

H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, et al., Alex-

atm 20b: Few-shot learning using a large-scale multilingual seq2seq

model, arXiv preprint arXiv:2208.01448 (2022). 9, 22, 23, 24, 25

[123] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,

S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., Palm 2 technical report,

arXiv preprint arXiv:2305.10403 (2023). 9, 25

[124] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Garcia,

H. S. Zheng, J. Rao, A. Chowdhery, et al., Transcending scaling laws

with 0.1% extra compute, arXiv preprint arXiv:2210.11399 (2022). 9,

23, 25

[125] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W.

Chung, D. Bahri, T. Schuster, S. Zheng, et al., Ul2: Unifying lan-

guage learning paradigms, in: The Eleventh International Conference

on Learning Representations, 2022. 9, 10, 23, 24, 25

[126] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, J. Tang, Glm: Gen-

eral language model pretraining with autoregressive blank infilling, in:

Proceedings of the 60th Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335. 10

[127] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,

T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al.,

Llama: Open and efficient foundation language models, arXiv preprint

arXiv:2302.13971 (2023). 10, 22, 25

[128] M. N. Rabe, C. Staats, Self-attention does not need o(n2) memory, arXiv

preprint arXiv:2112.05682 (2021). 10

[129] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,

37

M. Shoeybi, B. Catanzaro, Reducing activation recomputation in large

transformer models, Proceedings of Machine Learning and Systems 5

(2023). 10

[130] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,

C. Xiong, Codegen: An open large language model for code with multi-

turn program synthesis, arXiv preprint arXiv:2203.13474 (2022). 10,

22, 25, 28

[131] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Ed-

wards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large lan-

guage models trained on code, arXiv preprint arXiv:2107.03374 (2021).

10, 25, 29

[132] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,

T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level

code generation with alphacode, Science 378 (6624) (2022) 1092–1097.

10, 23, 25, 29

[133] N. Shazeer, Fast transformer decoding: One write-head is all you need,

arXiv preprint arXiv:1911.02150 (2019). 10

[134] R. Y. Pang, H. He, Text generation by learning from demonstrations,

arXiv preprint arXiv:2009.07839 (2020). 10

[135] R. Dabre, A. Fujita, Softmax tempering for training neural machine

translation models, arXiv preprint arXiv:2009.09372 (2020). 10

[136] Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified

pre-trained encoder-decoder models for code understanding and genera-

tion, arXiv preprint arXiv:2109.00859 (2021). 10

[137] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,

M. Marone, C. Akiki, J. Li, J. Chim, et al., Starcoder: may the source be

with you!, arXiv preprint arXiv:2305.06161 (2023). 10, 25

[138] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia,

A. Poulton, V. Kerkez, R. Stojnic, Galactica: A large language model for

science, arXiv preprint arXiv:2211.09085 (2022). 10, 23, 25, 29

[139] FairScale authors, Fairscale: A general purpose modular pytorch library

for high performance and large scale training, https://github.com/

facebookresearch/fairscale (2021). 10

[140] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T.

Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., Lamda: Language models

for dialog applications, arXiv preprint arXiv:2201.08239 (2022). 11, 25

[141] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann,

P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language

model for finance, arXiv preprint arXiv:2303.17564 (2023). 11, 25, 32

[142] X. Zhang, Q. Yang, D. Xu, Xuanyuan 2.0: A large chinese finan-

cial chat model with hundreds of billions parameters, arXiv preprint

arXiv:2305.12002 (2023). 11, 16, 25

[143] W. Ben, Mesh-transformer-jax: Model-parallel implementation of trans-

former language model with jax (2021). 12, 23

[144] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman,

T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al.,

Crosslingual generalization through multitask finetuning, arXiv preprint

arXiv:2211.01786 (2022). 11, 25, 28, 31

[145] D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, K.-W. Chang,

Dynosaur: A dynamic growth paradigm for instruction-tuning data cu-

ration, arXiv preprint arXiv:2305.14327 (2023). 16

[146] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu,

C. He, X. Yue, et al., Llama-adapter v2: Parameter-efficient visual in-

struction model, arXiv preprint arXiv:2304.15010 (2023). 16, 24

[147] Openai. gpt-4 technical report (2023). 16, 34

[148] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang,

T. B. Hashimoto, Stanford alpaca: An instruction-following llama

model, https://github.com/tatsu-lab/stanford_alpaca

(2023). 16, 25, 28

[149] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng,

S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing,

,

Vicuna: An

open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March

2023).

URL https://lmsys.org/blog/2023-03-30-vicuna/ 16, 22, 25,

28

[150] B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction tuning with gpt-4,

arXiv preprint arXiv:2304.03277 (2023). 16, 28

[151] T. Liu, B. K. H. Low, Goat: Fine-tuned llama outperforms gpt-4 on

arithmetic tasks, arXiv preprint arXiv:2305.14201 (2023). 16

[152] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo:

Tuning llama model with chinese medical knowledge, arXiv preprint

arXiv:2304.06975 (2023). 16

[153] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, D. Jiang,

Wizardlm: Empowering large language models to follow complex in-

structions, arXiv preprint arXiv:2304.12244 (2023). 16

[154] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,

D. Jiang, Wizardcoder: Empowering code large language models with

evol-instruct, arXiv preprint arXiv:2306.08568 (2023). 16, 25

[155] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick,

M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al., Teach-

ing language models to support answers with verified quotes, arXiv

preprint arXiv:2203.11147 (2022). 16

[156] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,

C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al., Webgpt: Browser-

assisted question-answering with human feedback, arXiv preprint

arXiv:2112.09332 (2021). 16, 18, 19, 25, 31

[157] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds,

M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al., Improving

alignment of dialogue agents via targeted human judgements, arXiv

preprint arXiv:2209.14375 (2022). 16, 19, 25

[158] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn,

Direct preference optimization: Your language model is secretly a re-

ward model, arXiv preprint arXiv:2305.18290 (2023). 16

[159] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum,

T. Zhang, Raft: Reward ranked finetuning for generative foundation

model alignment, arXiv preprint arXiv:2304.06767 (2023). 16

[160] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, F. Huang, Rrhf: Rank

responses to align language models with human feedback without tears,

arXiv preprint arXiv:2304.05302 (2023). 16

[161] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, H. Wang, Preference rank-

ing optimization for human alignment, arXiv preprint arXiv:2306.17492

(2023). 16

[162] H. Liu, C. Sferrazza, P. Abbeel, Languages are rewards: Hindsight fine-

tuning using human feedback, arXiv preprint arXiv:2302.02676 (2023).

16

[163] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen,

A. Goldie, A. Mirhoseini, C. McKinnon, et al., Constitutional ai: Harm-

lessness from ai feedback, arXiv preprint arXiv:2212.08073 (2022). 16

[164] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin,

P. Liang, T. B. Hashimoto, Alpacafarm: A simulation frame-

work for methods that learn from human feedback, arXiv preprint

arXiv:2305.14387 (2023). 16

[165] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, L. Wang,

Prompting gpt-3 to be reliable, arXiv preprint arXiv:2210.09150 (2022).

16

[166] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen,

A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez, et al., The capac-

ity for moral self-correction in large language models, arXiv preprint

arXiv:2302.07459 (2023). 16

[167] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety

training fail?, arXiv preprint arXiv:2307.02483 (2023). 16

[168] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath,

B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al., Red teaming lan-

guage models to reduce harms: Methods, scaling behaviors, and lessons

learned, arXiv preprint arXiv:2209.07858 (2022). 16, 28

[169] S. Casper, J. Lin, J. Kwon, G. Culp, D. Hadfield-Menell, Explore, estab-

lish, exploit: Red teaming language models from scratch, arXiv preprint

arXiv:2306.09442 (2023). 16

[170] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese,

N. McAleese, G. Irving, Red teaming language models with language

models, arXiv preprint arXiv:2202.03286 (2022). 16

[171] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned language models are

continual learners, in: Proceedings of the 2022 Conference on Empirical

Methods in Natural Language Processing, 2022, pp. 6107–6122. 16

[172] Z. Shi, A. Lipani, Don’t stop pretraining? make prompt-based fine-

tuning powerful learner, arXiv preprint arXiv:2305.01711 (2023). 17

[173] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty,

C. Baral, Instruction tuned models are quick learners, arXiv preprint

arXiv:2306.05539 (2023). 17

[174] H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong,

J. Zhao, Maybe only 0.5% data is needed: A preliminary exploration

of low training data instruction tuning, arXiv preprint arXiv:2305.09246

38

https://github.com/facebookresearch/fairscale

https://github.com/facebookresearch/fairscale

https://github.com/tatsu-lab/stanford_alpaca

https://lmsys.org/blog/2023-03-30-vicuna/

https://lmsys.org/blog/2023-03-30-vicuna/

https://lmsys.org/blog/2023-03-30-vicuna/

(2023). 17

[175] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat,

P. Yu, L. Yu, et al., Lima: Less is more for alignment, arXiv preprint

arXiv:2305.11206 (2023). 17, 25, 28

[176] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, S. Wang, Lm-infinite: Sim-

ple on-the-fly length generalization for large language models, arXiv

preprint arXiv:2308.16137 (2023). 17

[177] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyan-

skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay, et al., Colt5: Faster

long-range transformers with conditional computation, arXiv preprint

arXiv:2303.09752 (2023). 17

[178] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, F. Wei,

Longnet: Scaling transformers to 1,000,000,000 tokens, arXiv preprint

arXiv:2307.02486 (2023). 17

[179] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, Longlora: Effi-

cient fine-tuning of long-context large language models, arXiv preprint

arXiv:2309.12307 (2023). 17

[180] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend,

E. Karpas, A. Shashua, K. Leyton-Brown, Y. Shoham, Parallel context

windows for large language models, in: Proceedings of the 61st Annual

Meeting of the Association for Computational Linguistics (Volume 1:

Long Papers), 2023, pp. 6383–6402. 17

[181] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, F. Wei,

Augmenting language models with long-term memory, arXiv preprint

arXiv:2306.07174 (2023). 17

[182] X. Xu, Z. Gou, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, S. Wang, Long

time no see! open-domain conversation with long-term persona memory,

arXiv preprint arXiv:2203.05797 (2022). 17

[183] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli-

can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al.,

Improving language models by retrieving from trillions of tokens, in:

International conference on machine learning, PMLR, 2022, pp. 2206–

2240. 17, 18, 33

[184] W. Zhong, L. Guo, Q. Gao, Y. Wang, Memorybank: Enhanc-

ing large language models with long-term memory, arXiv preprint

arXiv:2305.10250 (2023). 17

[185] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, S. Yao,

Reflexion: Language agents with verbal reinforcement learning, arXiv

preprint arXiv:2303.11366 14 (2023). 17, 19

[186] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, H. Zhao, Chatdb: Augment-

ing llms with databases as their symbolic memory, arXiv preprint

arXiv:2306.03901 (2023). 17

[187] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,

J. Callan, G. Neubig, Active retrieval augmented generation, arXiv

preprint arXiv:2305.06983 (2023). 17, 18

[188] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-

Brown, Y. Shoham, In-context retrieval-augmented language models,

arXiv preprint

,

arXiv:2302.00083 (2023). 17, 18, 33

[189] X. Li, X. Qiu, Mot: Pre-thinking and recalling enable chatgpt to self-

improve with memory-of-thoughts, arXiv preprint arXiv:2305.05181

(2023). 17

[190] D. Schuurmans, Memory augmented large language models are compu-

tationally universal, arXiv preprint arXiv:2301.04589 (2023). 17

[191] A. Modarressi, A. Imani, M. Fayyaz, H. Schütze, Ret-llm: Towards a

general read-write memory for large language models, arXiv preprint

arXiv:2305.14322 (2023). 17

[192] S. Robertson, H. Zaragoza, et al., The probabilistic relevance frame-

work: Bm25 and beyond, Foundations and Trends® in Information Re-

trieval 3 (4) (2009) 333–389. 18

[193] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, D. Zhou,

Rationale-augmented ensembles in language models, arXiv preprint

arXiv:2207.00747 (2022). 18

[194] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, W. Chen,

Repocoder: Repository-level code completion through iterative retrieval

and generation, arXiv preprint arXiv:2303.12570 (2023). 18

[195] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y. Dong,

O. Kuchaiev, B. Li, C. Xiao, et al., Shall we pretrain autoregressive

language models with retrieval? a comprehensive study, arXiv preprint

arXiv:2304.06762 (2023). 18

[196] L. Wang, N. Yang, F. Wei, Learning to retrieve in-context examples for

large language models, arXiv preprint arXiv:2307.07164 (2023). 18

[197] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, W. Chen, What makes

good in-context examples for gpt-3?, arXiv preprint arXiv:2101.06804

(2021). 18

[198] O. Rubin, J. Herzig, J. Berant, Learning to retrieve prompts for in-

context learning, arXiv preprint arXiv:2112.08633 (2021). 18

[199] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle-

moyer, W.-t. Yih, Replug: Retrieval-augmented black-box language

models, arXiv preprint arXiv:2301.12652 (2023). 18

[200] O. Rubin, J. Berant, Long-range language modeling with self-retrieval,

arXiv preprint arXiv:2306.13421 (2023). 18

[201] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented

language model pre-training, in: International conference on machine

learning, PMLR, 2020, pp. 3929–3938. 18

[202] S. Hofstätter, J. Chen, K. Raman, H. Zamani, Fid-light: Efficient and ef-

fective retrieval-augmented text generation, in: Proceedings of the 46th

International ACM SIGIR Conference on Research and Development in

Information Retrieval, 2023, pp. 1437–1447. 18

[203] M. Komeili, K. Shuster, J. Weston, Internet-augmented dialogue gener-

ation, arXiv preprint arXiv:2107.07566 (2021). 18

[204] A. Lazaridou, E. Gribovskaya, W. Stokowiec, N. Grigorev, Internet-

augmented language models through few-shot prompting for open-

domain question answering, arXiv preprint arXiv:2203.05115 (2022).

18

[205] D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, M. Z. Shou, Assist-

gpt: A general multi-modal assistant that can plan, execute, inspect, and

learn, arXiv preprint arXiv:2306.08640 (2023). 18, 19

[206] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu,

J. Gao, Chameleon: Plug-and-play compositional reasoning with large

language models, arXiv preprint arXiv:2304.09842 (2023). 18, 19, 22

[207] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, M. T.

Ribeiro, Art: Automatic multi-step reasoning and tool-use for large lan-

guage models, arXiv preprint arXiv:2303.09014 (2023). 18

[208] C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, R. Kr-

ishna, T. Pfister, Tool documentation enables zero-shot tool-usage with

large language models, arXiv preprint arXiv:2308.00675 (2023). 18

[209] Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, S. Li, Restgpt:

Connecting large language models with real-world applications via rest-

ful apis, arXiv preprint arXiv:2306.06624 (2023). 18

[210] S. Hao, T. Liu, Z. Wang, Z. Hu, Toolkengpt: Augmenting frozen lan-

guage models with massive tools via tool embeddings, arXiv preprint

arXiv:2305.11554 (2023). 18

[211] S. G. Patil, T. Zhang, X. Wang, J. E. Gonzalez, Gorilla: Large language

model connected with massive apis, arXiv preprint arXiv:2305.15334

(2023). 18

[212] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, J. Zhang, On the tool manipu-

lation capability of open-source large language models, arXiv preprint

arXiv:2305.16504 (2023). 18

[213] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang,

B. Qian, et al., Toolllm: Facilitating large language models to master

16000+ real-world apis, arXiv preprint arXiv:2307.16789 (2023). 18,

19

[214] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, Y. Zhuang, Hugginggpt: Solv-

ing ai tasks with chatgpt and its friends in huggingface, arXiv preprint

arXiv:2303.17580 (2023). 19, 33

[215] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji,

S. Mao, et al., Taskmatrix. ai: Completing tasks by connecting foun-

dation models with millions of apis, arXiv preprint arXiv:2303.16434

(2023). 19

[216] D. Surís, S. Menon, C. Vondrick, Vipergpt: Visual inference via python

execution for reasoning, arXiv preprint arXiv:2303.08128 (2023). 19

[217] A. Maedche, S. Morana, S. Schacht, D. Werth, J. Krumeich, Advanced

user assistance systems, Business & Information Systems Engineering

58 (2016) 367–370. 19

[218] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence

134 (1-2) (2002) 57–83. 19

[219] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang,

S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for

multi-agent collaborative framework, arXiv preprint arXiv:2308.00352

(2023). 19

[220] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,

S. Jin, E. Zhou, et al., The rise and potential of large language model

39

based agents: A survey, arXiv preprint arXiv:2309.07864 (2023). 19

[221] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang,

X. Chen, Y. Lin, et al., A survey on large language model based au-

tonomous agents, arXiv preprint arXiv:2308.11432 (2023). 19

[222] W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language models as zero-

shot planners: Extracting actionable knowledge for embodied agents,

in: International Conference on Machine Learning, PMLR, 2022, pp.

9118–9147. 19

[223] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, Z. Hu, Reason-

ing with language model is planning with world model, arXiv preprint

arXiv:2305.14992 (2023). 19, 33

[224] W. Yao, S. Heinecke, J. C. Niebles, Z. Liu, Y. Feng, L. Xue, R. Murthy,

Z. Chen, J. Zhang, D. Arpit, et al., Retroformer: Retrospective

large language agents with policy gradient optimization, arXiv preprint

arXiv:2308.02151 (2023). 19, 33

[225] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng,

J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson,

N. Brown, L. Luu, S. Levine, K. Hausman, brian ichter, Inner mono-

logue: Embodied reasoning through planning with language models, in:

6th Annual Conference on Robot Learning, 2022.

URL https://openreview.net/forum?id=3R3Pz5i0tye 19

[226] C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, J. Fu, Alphablock:

Embodied finetuning for vision-language reasoning in robot manipula-

tion, arXiv preprint arXiv:2305.18898 (2023). 19, 20, 33

[227] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox,

J. Thomason, A. Garg, Progprompt: Generating situated robot task plans

using large language models, in: 2023 IEEE International Conference on

Robotics and Automation (ICRA), IEEE, 2023, pp. 11523–11530. 19,

33

[228] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L.

Chiang, T. Erez, L. Hasenclever, J. Humplik, et al., Language to rewards

for robotic skill synthesis, arXiv preprint arXiv:2306.08647 (2023). 19

[229] X. Tang, A. Zou, Z. Zhang, Y. Zhao, X. Zhang, A. Cohan, M. Gerstein,

Medagents: Large language models as collaborators for zero-shot med-

ical reasoning, arXiv preprint arXiv:2311.10537 (2023). 19

[230] A. Brohan, Y. Chebotar, C. Finn, K.

,

Hausman, A. Herzog, D. Ho,

J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as i can, not as i say:

Grounding language in robotic affordances, in: Conference on Robot

Learning, PMLR, 2023, pp. 287–318. 19, 33

[231] H. Ha, P. Florence, S. Song, Scaling up and distilling down: Language-

guided robot skill acquisition, arXiv preprint arXiv:2307.14535 (2023).

20

[232] A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, A. Velasquez, Say-

nav: Grounding large language models for dynamic planning to navi-

gation in new environments, arXiv preprint arXiv:2309.04077 (2023).

20

[233] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y. Su,

Llm-planner: Few-shot grounded planning for embodied agents with

large language models, arXiv preprint arXiv:2212.04088 (2022). 20

[234] V. S. Dorbala, J. F. Mullen Jr, D. Manocha, Can an embodied agent find

your" cat-shaped mug"? llm-based zero-shot object navigation, arXiv

preprint arXiv:2303.03480 (2023). 20

[235] C. Huang, O. Mees, A. Zeng, W. Burgard, Visual language maps for

robot navigation, in: 2023 IEEE International Conference on Robotics

and Automation (ICRA), IEEE, 2023, pp. 10608–10615. 20

[236] Y. Ding, X. Zhang, C. Paxton, S. Zhang, Task and motion planning

with large language models for object rearrangement, arXiv preprint

arXiv:2303.06247 (2023). 20, 33

[237] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt under-

stands, too, arXiv preprint arXiv:2103.10385 (2021). 20

[238] G. Chen, F. Liu, Z. Meng, S. Liang, Revisiting parameter-efficient tun-

ing: Are we really there yet?, arXiv preprint arXiv:2202.07962 (2022).

20

[239] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, J. Gao,

Adamix: Mixture-of-adapter for parameter-efficient tuning of large lan-

guage models, arXiv preprint arXiv:2205.12410 1 (2) (2022) 4. 20

[240] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,

W. Chen, Lora: Low-rank adaptation of large language models, arXiv

preprint arXiv:2106.09685 (2021). 20, 21, 22

[241] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, J. Tang, P-tuning: Prompt

tuning can be comparable to fine-tuning across scales and tasks, in: Pro-

ceedings of the 60th Annual Meeting of the Association for Computa-

tional Linguistics (Volume 2: Short Papers), 2022, pp. 61–68. 20

[242] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, A. Almahairi,

Progressive prompts: Continual learning for language models, arXiv

preprint arXiv:2301.12314 (2023). 20

[243] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, S. Huang, To-

wards adaptive prefix tuning for parameter-efficient language model

fine-tuning, arXiv preprint arXiv:2305.15212 (2023). 20

[244] E. B. Zaken, S. Ravfogel, Y. Goldberg, Bitfit: Simple parameter-

efficient fine-tuning for transformer-based masked language-models,

arXiv preprint arXiv:2106.10199 (2021). 20

[245] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, Llm. int8 ():

8-bit matrix multiplication for transformers at scale, arXiv preprint

arXiv:2208.07339 (2022). 20, 21

[246] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate

post-training quantization for generative pre-trained transformers, arXiv

preprint arXiv:2210.17323 (2022). 20

[247] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, X. Liu, Outlier sup-

pression+: Accurate quantization of large language models by equiva-

lent and optimal shifting and scaling, arXiv preprint arXiv:2304.09145

(2023). 20

[248] E. Frantar, D. Alistarh, Optimal brain compression: A framework for

accurate post-training quantization and pruning, Advances in Neural In-

formation Processing Systems 35 (2022) 4475–4488. 20

[249] C. Lee, J. Jin, T. Kim, H. Kim, E. Park, Owq: Lessons learned from ac-

tivation outliers for weight quantization in large language models, arXiv

preprint arXiv:2306.02272 (2023). 21

[250] S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.-

W. Ha, N. Sung, D. Lee, Alphatuning: Quantization-aware parameter-

efficient adaptation of large-scale pre-trained language models, arXiv

preprint arXiv:2210.03858 (2022). 21

[251] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient

finetuning of quantized llms, arXiv preprint arXiv:2305.14314 (2023).

21

[252] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Kr-

ishnamoorthi, V. Chandra, Llm-qat: Data-free quantization aware train-

ing for large language models, arXiv preprint arXiv:2305.17888 (2023).

21

[253] Y. Guo, A. Yao, H. Zhao, Y. Chen, Network sketching: Exploiting bi-

nary structure in deep cnns, in: Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2017, pp. 5955–5963. 21

[254] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, D. Lee,

Memory-efficient fine-tuning of compressed large language models via

sub-4-bit integer quantization, arXiv preprint arXiv:2305.14152 (2023).

21

[255] M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effective pruning

approach for large language models, arXiv preprint arXiv:2306.11695

(2023). 21

[256] Z. Wang, J. Wohlwend, T. Lei, Structured pruning of large language

models, arXiv preprint arXiv:1910.04732 (2019). 21

[257] L. Yin, Y. Wu, Z. Zhang, C.-Y. Hsieh, Y. Wang, Y. Jia, M. Pechenizkiy,

Y. Liang, Z. Wang, S. Liu, Outlier weighed layerwise sparsity (owl): A

missing secret sauce for pruning llms to high sparsity, arXiv preprint

arXiv:2310.05175 (2023). 21

[258] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, N. Wong,

Structured pruning for efficient generative pre-trained language models,

in: Findings of the Association for Computational Linguistics: ACL

2023, 2023, pp. 10880–10895. 21

[259] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc,

A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: a visual lan-

guage model for few-shot learning, Advances in Neural Information Pro-

cessing Systems 35 (2022) 23716–23736. 21, 22

[260] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image

pre-training with frozen image encoders and large language models,

arXiv preprint arXiv:2301.12597 (2023). 21, 22

[261] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint

arXiv:2304.08485 (2023). 21, 22

[262] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang,

Y. Qiao, Videochat: Chat-centric video understanding, arXiv preprint

arXiv:2305.06355 (2023). 21, 22

[263] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, Video-chatgpt: Towards de-

40

https://openreview.net/forum?id=3R3Pz5i0tye

https://openreview.net/forum?id=3R3Pz5i0tye

https://openreview.net/forum?id=3R3Pz5i0tye

tailed video understanding via large vision and language models, arXiv

preprint arXiv:2306.05424 (2023). 21

[264] H. Zhang, X. Li, L. Bing, Video-llama: An instruction-tuned

audio-visual language model for video understanding, arXiv preprint

arXiv:2306.02858 (2023). 21

[265] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley,

Y. Zou, W. Wang, Wavcaps: A chatgpt-assisted weakly-labelled au-

dio captioning dataset for audio-language multimodal research, arXiv

preprint arXiv:2303.17395 (2023). 21

[266] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, Z. Tu, Macaw-

llm: Multi-modal language modeling with image, audio, video, and text

integration, arXiv preprint arXiv:2306.09093 (2023). 21

[267] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing

vision-language understanding with advanced large language models,

arXiv preprint arXiv:2304.10592 (2023). 22

[268] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,

T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,

An image is worth 16x16 words: Transformers for image recognition at

scale, arXiv preprint arXiv:2010.11929 (2020). 22

[269] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung,

S. Hoi, Instructblip: Towards general-purpose vision-language models

with instruction tuning, arXiv preprint arXiv:2305.06500 (2023). 22

[270] Z. Xu, Y. Shen, L. Huang,

,

Multiinstruct: Improving multi-modal zero-

shot learning via instruction tuning, arXiv preprint arXiv:2212.10773

(2022). 22

[271] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, J. Liu,

Chatbridge: Bridging modalities with large language model as a lan-

guage catalyst, arXiv preprint arXiv:2305.16103 (2023). 22

[272] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu,

X. Sun, et al., M3 it: A large-scale dataset towards multi-modal multi-

lingual instruction tuning, arXiv preprint arXiv:2306.04387 (2023). 22

[273] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han,

H. Xu, L. K. T. Zhang, Detgpt: Detect what you need via reasoning,

arXiv preprint arXiv:2305.14167 (2023). 22

[274] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, R. Ji, Cheap and quick:

Efficient vision-language instruction tuning for large language models,

arXiv preprint arXiv:2305.15023 (2023). 22

[275] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, Y. Qiao,

Llama-adapter: Efficient fine-tuning of language models with zero-init

attention, arXiv preprint arXiv:2303.16199 (2023). 22

[276] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever,

Robust speech recognition via large-scale weak supervision, in: Inter-

national Conference on Machine Learning, PMLR, 2023, pp. 28492–

28518. 22

[277] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, A. Smola, Multi-

modal chain-of-thought reasoning in language models, arXiv preprint

arXiv:2302.00923 (2023). 22

[278] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, S. Zhan, Chain of thought prompt

tuning in vision language models, arXiv preprint arXiv:2304.07919

(2023). 22

[279] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talk-

ing, drawing and editing with visual foundation models, arXiv preprint

arXiv:2303.04671 (2023). 22

[280] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu,

M. Zeng, L. Wang, Mm-react: Prompting chatgpt for multimodal rea-

soning and action, arXiv preprint arXiv:2303.11381 (2023). 22

[281] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao,

S. Zhao, Y. Shan, et al., Caption anything: Interactive image descrip-

tion with diverse multimodal controls, arXiv preprint arXiv:2305.02677

(2023). 22

[282] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, P. Gao, Pointclip v2:

Adapting clip for powerful 3d open-world learning, arXiv preprint

arXiv:2211.11682 (2022). 22

[283] T. Gupta, A. Kembhavi, Visual programming: Compositional visual rea-

soning without training, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, 2023, pp. 14953–14962.

22

[284] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, H. Li, Dynamic

fusion with intra-and inter-modality attention flow for visual question

answering, in: Proceedings of the IEEE/CVF conference on computer

vision and pattern recognition, 2019, pp. 6639–6648. 22

[285] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention net-

works for visual question answering, in: Proceedings of the IEEE/CVF

conference on computer vision and pattern recognition, 2019, pp. 6281–

6290. 22

[286] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-

W. Chang, S.-F. Chang, Idealgpt: Iteratively decomposing vision

and language reasoning via large language models, arXiv preprint

arXiv:2305.14985 (2023). 22

[287] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, H. Li,

Prompt, generate, then cache: Cascade of foundation models makes

strong few-shot learners, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.

22

[288] T. Q. Nguyen, J. Salazar, Transformers without tears: Improving the

normalization of self-attention, CoRR abs/1910.05895 (2019). 23

[289] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,

L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pre-

training approach, arXiv preprint arXiv:1907.11692 (2019). 24, 30

[290] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine,

D. Song, Koala: A dialogue model for academic research, Blog post

(April 2023).

URL https://bair.berkeley.edu/blog/2023/04/03/koala/

25

[291] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,

J. Phang, H. He, A. Thite, N. Nabeshima, et al., The pile: An

800gb dataset of diverse text for language modeling, arXiv preprint

arXiv:2101.00027 (2020). 28, 30

[292] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral,

T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen,

et al., The bigscience roots corpus: A 1.6 tb composite multilingual

dataset, Advances in Neural Information Processing Systems 35 (2022)

31809–31826. 28

[293] Wikipedia.

URL https://en.wikipedia.org/wiki/Main_Page 28

[294] Together Computer, Redpajama: An open source recipe to reproduce

llama training dataset (Apr. 2023).

URL https://github.com/togethercomputer/

RedPajama-Data 28

[295] O. Honovich, T. Scialom, O. Levy, T. Schick, Unnatural instructions:

Tuning language models with (almost) no human labor, arXiv preprint

arXiv:2212.09689 (2022). 28

[296] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma,

D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., Training a helpful and

harmless assistant with reinforcement learning from human feedback,

arXiv preprint arXiv:2204.05862 (2022). 28

[297] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song,

J. Steinhardt, Measuring massive multitask language understanding,

arXiv preprint arXiv:2009.03300 (2020). 24, 29

[298] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch,

A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al., Beyond

the imitation game: Quantifying and extrapolating the capabilities of

language models, arXiv preprint arXiv:2206.04615 (2022). 24, 29

[299] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue:

A multi-task benchmark and analysis platform for natural language un-

derstanding, arXiv preprint arXiv:1804.07461 (2018). 24, 29

[300] Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi,

J. Bao, J. Nie, et al., Cuge: A chinese language understanding and gen-

eration evaluation benchmark, arXiv preprint arXiv:2112.13610 (2021).

29

[301] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu,

C. Yu, et al., Clue: A chinese language understanding evaluation bench-

mark, arXiv preprint arXiv:2004.05986 (2020). 29

[302] L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan,

X. Tian, L. Qin, et al., Fewclue: A chinese few-shot learning evaluation

benchmark, arXiv preprint arXiv:2107.07498 (2021). 29

[303] E. M. Smith, M. Williamson, K. Shuster, J. Weston, Y.-L. Boureau, Can

you put it all together: Evaluating conversational agents’ ability to blend

skills, arXiv preprint arXiv:2004.08449 (2020). 29

[304] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga,

Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al., Holistic evaluation of

language models, arXiv preprint arXiv:2211.09110 (2022). 29

41

https://bair.berkeley.edu/blog/2023/04/03/koala/

https://bair.berkeley.edu/blog/2023/04/03/koala/

https://en.wikipedia.org/wiki/Main_Page

https://en.wikipedia.org/wiki/Main_Page

https://github.com/togethercomputer/RedPajama-Data

https://github.com/togethercomputer/RedPajama-Data

https://github.com/togethercomputer/RedPajama-Data

https://github.com/togethercomputer/RedPajama-Data

[305] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim,

Y. Song, T. Oh, et al., Klue: Korean language understanding evaluation,

arXiv preprint arXiv:2105.09680 (2021). 29

[306] S. Reddy, D. Chen, C. D. Manning, Coqa: A conversational question

answering challenge, Transactions of the Association for Computational

Linguistics 7 (2019) 249–266. 25, 29

[307] M. T. Pilehvar, J. Camacho-Collados, Wic: 10,000 example

pairs for evaluating context-sensitive representations, arXiv preprint

arXiv:1808.09121 6 (2018). 25, 29

[308]

,

with APIs to train, fine-tune, infer,

and develop custom models.

DeepSpeed [36]: A library for scalable distributed training and

inference of deep learning models.

Megatron-LM [80]: It provides GPU-optimized techniques for

large-scale training of LLMs.

JAX [83]: A Python library for high-performance numerical

computing and scaleable machine learning. It can differenti-

ate native Python and NumPy functions and execute them on

GPUs.

Colossal-AI [84]: A collection of components to write dis-

tributed deep learning models.

BMTrain [81]: A library to write efficient stand-alone LLMs

training code.

FastMoE [85]: Provides API to build mixture-of-experts

(MoE) model in PyTorch.

MindSpore [86]: A deep learning training and inference frame-

work extendable to mobile, edge, and cloud computing.

PyTorch [87]: A framework developed by Facebook AI Re-

search lab (FAIR) to build deep learning models. The main

features of PyTorch include a dynamic computation graph and

a pythonic coding style.

Tensorflow [88]: A deep learning framework written by

Google. The key features of TensorFlow are graph-based com-

putation, eager execution, scalability, etc.

MXNet [89]: Apache MXNet is a deep learning framework

with support to write programs in multiple languages, includ-

ing, Python, C++, Scala, R, etc. It also provides support for

dynamic and static computation graphs.

2.8. Data PreProcessing

This section briefly summarizes data preprocessing tech-

niques used in LLMs training.

Quality Filtering: For better results, training data quality is

essential. Some approaches to filtering data are: 1) classifier-

based and 2) heuristics-based. Classifier-based approaches

train a classifier on high-quality data and predict the quality of

text for filtering, whereas heuristics-based employ some rules

for filtering like language, metrics, statistics, and keywords.

Data Deduplication: Duplicated data can affect model per-

formance and increase data memorization; therefore, to train

LLMs, data deduplication is one of the preprocessing steps.

This can be performed at multiple levels, like sentences,

documents, and datasets.

Privacy Reduction: Most of the training data for LLMs is

collected through web sources. This data contains private

information; therefore, many LLMs employ heuristics-based

methods to filter information such as names, addresses, and

phone numbers to avoid learning personal information.

2.9. Architectures

Here we discuss the variants of the transformer architectures

used in LLMs. The difference arises due to the application of

Figure 4: An example of attention patterns in language models, image is taken

from [93].

Figure 5: An example of language model training objectives, image from [93].

the attention and the connection of transformer blocks. An il-

lustration of attention patterns of these architectures is shown

in Figure 4.

Encoder Decoder: This architecture processes inputs through

the encoder and passes the intermediate representation to the

decoder to generate the output. Here, the encoder sees the

complete sequence utilizing self-attention whereas the decoder

processes the sequence one after the other with implementing

cross-attention.

Causal Decoder: A type of architecture that does not have an

encoder and processes and generates output using a decoder,

where the predicted token depends only on the previous time

steps.

Prefix Decoder: It is also known as a non-causal decoder,

where the attention calculation is not strictly dependent on the

past information and the attention is bidirectional. An example

of a non-causal attention mask is shown in Figure 4.

Mixture-of-Experts: It is a variant of transformer architecture

with parallel independent experts and a router to route tokens

to experts. These experts are feed-forward layers after the at-

tention block [90]. Mixture-of-Experts (MoE) is an efficient

sparse architecture that offers comparable performance to dense

models and allows increasing the model size without increas-

ing the computational cost by activating only a few experts at a

time [91, 92].

2.10. Pre-Training Objectives

This section describes LLMs pre-training objectives. For

more details see the paper [93].

Full Language Modeling: An autoregressive language model-

ing objective where the model is asked to predict future tokens

given the previous tokens, an example is shown in Figure 5.

Prefix Language Modeling: A non-causal training objective,

where a prefix is chosen randomly and only remaining target

tokens are used to calculate the loss. An example is shown in

Figure 5.

5

Figure 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at

different training stages like pre-training, instruction-tuning, or alignment tuning. “RL” stands for reinforcement learning, “RM” represents reward-modeling, and

“RLHF” represents reinforcement learning with human feedback.

Masked Language Modeling: In this training objective, tokens

or spans (a sequence of tokens) are masked randomly and the

model is asked to predict masked tokens given the past and

future context. An example is shown in Figure 5.

Unified Language Modeling: Unified language modeling [94]

is a combination of causal, non-causal, and masked language

training objectives. Here in masked language modeling, the

attention is not bidirectional but unidirectional, attending either

left-to-right or right-to-left context.

2.11. LLMs Scaling Laws

Scaling laws study the optimal combination of model param-

eters, dataset size, and computational resources that predict the

improvement in the model performance. It has been shown

that the loss scales according to the power-law with model size,

dataset size, and compute resources [95]. This study suggests

larger models are more important than big data for better perfor-

mance. Another variant of scaling law [96] suggests the model

size and the number of training tokens should be scaled equally.

2.12. LLMs Adaptation Stages

This section discusses the fundamentals of LLMs adaptation

stages, from pre-training to fine-tuning for downstream tasks

and utilization. An example of different training stages and in-

ference in LLMs is shown in Figure 6. In this paper, we refer

to alignment-tuning as aligning with human preferences, while

occasionally the literature uses the term alignment for different

purposes.

2.12.1. Pre-Training

In the very first stage, the model is trained in a self-

supervised manner on a large corpus to predict the next to-

kens given the input. The design choices of LLMs vary from

encoder-decoder to decoder-only architectures with different

building blocks and loss functions in sections 2.5, 2.4, 2.10.

2.12.2. Fine-Tuning

There are different styles to fine-tune an LLM. This section

briefly discusses fine-tuning approaches.

Transfer Learning: The pre-trained LLMs perform well for

various tasks [6, 15]. However, to improve the performance for

6

a downstream task, pre-trained models are fine-tuned with the

task-specific data [10, 11], known as transfer learning.

Instruction-tuning: To enable a model to respond to user

queries effectively, the pre-trained model is fine-tuned on in-

struction formatted data i.e., instruction and an input-output

pair. Instructions generally comprise multi-task data in plain

natural language, guiding the model to respond according to the

prompt and the input. This type of fine-tuning improves zero-

shot generalization and downstream task performance. Details

on formatting instruction data and its various styles are avail-

able in [16, 50, 97].

Alignment-tuning: LLMs are prone to generating false, biased,

and harmful text. To make them helpful, honest, and harmless,

models are aligned using human feedback. Alignment involves

asking LLMs to generate unexpected responses and then updat-

ing their parameters to avoid such responses [20, 21, 98].

It ensures LLMs operate according to human intentions and

values. A model is defined to be an “aligned” model

,

S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer sentinel mixture

models, arXiv preprint arXiv:1609.07843 (2016). 25, 29

[309] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lillicrap, Compres-

sive transformers for long-range sequence modelling, arXiv preprint

arXiv:1911.05507 (2019). 25, 29

[310] X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, B. Tang, Lcqmc: A

large-scale chinese question matching corpus, in: Proceedings of the

27th international conference on computational linguistics, 2018, pp.

1952–1962. 26, 29

[311] S. Iyer, N. Dandekar, K. Csernai, First quora dataset re-

lease: Question pairs, https://quoradata.quora.com/

First-Quora-Dataset-Release-Question-Pairs. 29

[312] R. Rudinger, J. Naradowsky, B. Leonard, B. Van Durme, Gender bias in

coreference resolution, arXiv preprint arXiv:1804.09301 (2018). 29

[313] M.-C. De Marneffe, M. Simons, J. Tonhauser, The commitmentbank: In-

vestigating projection in naturally occurring discourse, in: proceedings

of Sinn und Bedeutung, Vol. 23, 2019, pp. 107–124. 29

[314] Z. Li, N. Ding, Z. Liu, H. Zheng, Y. Shen, Chinese relation extraction

with multi-grained information and external linguistic knowledge, in:

Proceedings of the 57th Annual Meeting of the Association for Compu-

tational Linguistics, 2019, pp. 4377–4386. 29

[315] J. Xu, J. Wen, X. Sun, Q. Su, A discourse-level named entity recognition

and relation extraction dataset for chinese literature text, arXiv preprint

arXiv:1711.07010 (2017). 29

[316] J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, B. Tang, The bq corpus: A

large-scale domain-specific chinese corpus for sentence semantic equiv-

alence identification, in: Proceedings of the 2018 conference on empiri-

cal methods in natural language processing, 2018, pp. 4946–4951. 29

[317] B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, Y. Xu, Matching arti-

cle pairs with graphical decomposition and convolutions, arXiv preprint

arXiv:1802.07459 (2018). 29

[318] P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, W. Xu, Dataset and neu-

ral recurrent sequence labeling model for open-domain factoid question

answering, arXiv preprint arXiv:1607.06275 (2016). 29

[319] N. Peng, M. Dredze, Named entity recognition for chinese social media

with jointly trained embeddings, in: Proceedings of the 2015 conference

on empirical methods in natural language processing, 2015, pp. 548–

554. 29

[320] W. Ling, D. Yogatama, C. Dyer, P. Blunsom, Program induction by ratio-

nale generation: Learning to solve and explain algebraic word problems,

arXiv preprint arXiv:1705.04146 (2017). 29

[321] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar-

cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, et al., Ontonotes re-

lease 4.0, LDC2011T03, Philadelphia, Penn.: Linguistic Data Consor-

tium (2011). 29

[322] D. Vilares, C. Gómez-Rodríguez, Head-qa: A healthcare dataset for

complex reasoning, arXiv preprint arXiv:1906.04701 (2019). 29

[323] S. L. Blodgett, L. Green, B. O’Connor, Demographic dialectal variation

in social media: A case study of african-american english, arXiv preprint

arXiv:1608.08868 (2016). 29

[324] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van-

derwende, P. Kohli, J. Allen, A corpus and evaluation framework

for deeper understanding of commonsense stories, arXiv preprint

arXiv:1604.01696 (2016). 26, 29

[325] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi,

S. Pezzelle, M. Baroni, G. Boleda, R. Fernández, The lambada dataset:

Word prediction requiring a broad discourse context, arXiv preprint

arXiv:1606.06031 (2016). 26, 29

[326] B. Hu, Q. Chen, F. Zhu, Lcsts: A large scale chinese short text summa-

rization dataset, arXiv preprint arXiv:1506.05865 (2015). 29

[327] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long and diverse text gener-

ation with planning-based hierarchical variational model, arXiv preprint

arXiv:1908.06605 (2019). 29

[328] J. Novikova, O. Dušek, V. Rieser, The e2e dataset: New challenges for

end-to-end generation, arXiv preprint arXiv:1706.09254 (2017). 29

[329] C. Zheng, M. Huang, A. Sun, Chid: A large-scale chinese idiom dataset

for cloze test, arXiv preprint arXiv:1906.01265 (2019). 29

[330] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al., Piqa: Reasoning about phys-

ical commonsense in natural language, in: Proceedings of the AAAI

conference on artificial intelligence, Vol. 34, 2020, pp. 7432–7439. 26,

29

[331] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: A large scale

distantly supervised challenge dataset for reading comprehension, arXiv

preprint arXiv:1705.03551 (2017). 26, 29, 31

[332] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,

O. Tafjord, Think you have solved question answering? try arc, the ai2

reasoning challenge, arXiv preprint arXiv:1803.05457 (2018). 26, 29,

31

[333] S. Aroca-Ouellette, C. Paik, A. Roncone, K. Kann, Prost: Phys-

ical reasoning of objects through space and time, arXiv preprint

arXiv:2106.03634 (2021). 29

[334] T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor con-

duct electricity? a new dataset for open book question answering, arXiv

preprint arXiv:1809.02789 (2018). 29

[335] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille,

D. Moussallem, A. Shimorina, The 2020 bilingual, bi-directional

webnlg+ shared task overview and evaluation results (webnlg+ 2020),

in: Proceedings of the 3rd International Workshop on Natural Language

Generation from the Semantic Web (WebNLG+), 2020. 29

[336] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Blow the dog whistle:

A chinese dataset for cant understanding with common sense and world

knowledge, arXiv preprint arXiv:2104.02704 (2021). 29

[337] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large-scale

reading comprehension dataset from examinations, arXiv preprint

arXiv:1704.04683 (2017). 26, 29

[338] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang,

L. Zettlemoyer, Quac: Question answering in context, arXiv preprint

arXiv:1808.07036 (2018). 27, 29

[339] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did aristo-

tle use a laptop? a question answering benchmark with implicit reason-

ing strategies, Transactions of the Association for Computational Lin-

guistics 9 (2021) 346–361. 29

[340] J. Boyd-Graber, B. Satinoff, H. He, H. Daumé III, Besting the quiz mas-

ter: Crowdsourcing incremental classification games, in: Proceedings of

the 2012 joint conference on empirical methods in natural language pro-

cessing and computational natural language learning, 2012, pp. 1290–

1301. 29

[341] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, Z. Ding, Chinese medical

question answer matching using end-to-end character-level multi-scale

cnns, Applied Sciences 7 (8) (2017) 767. 29

[342] S. Zhang, X. Zhang, H. Wang, L. Guo, S. Liu, Multi-scale attentive in-

teraction networks for chinese medical question answer selection, IEEE

Access 6 (2018) 74061–74071. 29

[343] C. Xu, J. Pei, H. Wu, Y. Liu, C. Li, Matinf: A jointly labeled large-scale

dataset for classification, question answering and summarization, arXiv

preprint arXiv:2004.12302 (2020). 29

[344] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An

adversarial winograd schema challenge at scale, Communications of the

ACM 64 (9) (2021) 99–106. 25, 29

[345] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a

machine really finish your sentence?, arXiv preprint arXiv:1905.07830

(2019). 27, 29

[346] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice of plausible alter-

natives: An evaluation of commonsense causal reasoning., in: AAAI

spring symposium: logical formalizations of commonsense reasoning,

2011, pp. 90–95. 29

[347] H. Levesque, E. Davis, L. Morgenstern, The winograd schema chal-

lenge, in: Thirteenth international conference on the principles of knowl-

edge representation and reasoning, 2012. 25, 27, 29

[348] A. Talmor, J. Herzig, N. Lourie, J. Berant, Commonsenseqa: A question

answering challenge targeting commonsense

,

knowledge, arXiv preprint

arXiv:1811.00937 (2018). 27, 29

[349] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa:

42

https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

Commonsense reasoning about social interactions, arXiv preprint

arXiv:1904.09728 (2019). 29

[350] K. Sun, D. Yu, D. Yu, C. Cardie, Investigating prior knowledge for chal-

lenging chinese machine reading comprehension, Transactions of the

Association for Computational Linguistics 8 (2020) 141–155. 29

[351] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. Van Durme, Record: Bridg-

ing the gap between human and machine commonsense reading compre-

hension, arXiv preprint arXiv:1810.12885 (2018). 29

[352] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions

for machine comprehension of text, arXiv preprint arXiv:1606.05250

(2016). 28, 29

[353] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins,

K. Toutanova, Boolq: Exploring the surprising difficulty of natural

yes/no questions, arXiv preprint arXiv:1905.10044 (2019). 28, 29

[354] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer-

able questions for squad, arXiv preprint arXiv:1806.03822 (2018). 28,

29

[355] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, Drop:

A reading comprehension benchmark requiring discrete reasoning over

paragraphs, arXiv preprint arXiv:1903.00161 (2019). 28, 29

[356] I. Dagan, O. Glickman, B. Magnini, The pascal recognising textual en-

tailment challenge, in: Machine learning challenges workshop, Springer,

2005, pp. 177–190. 28, 29

[357] Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, Y. Bisk, Webqa: Mul-

tihop and multimodal qa, in: Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, 2022, pp. 16495–16504.

28, 29

[358] Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, G. Hu, Dataset for the first

evaluation on chinese machine reading comprehension, arXiv preprint

arXiv:1709.08299 (2017). 29

[359] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, G. Hu,

A span-extraction dataset for chinese machine reading comprehension,

arXiv preprint arXiv:1810.07366 (2018). 28, 29

[360] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu,

A sentence cloze dataset for chinese machine reading comprehension,

arXiv preprint arXiv:2004.03116 (2020). 29

[361] Y. Li, T. Liu, D. Li, Q. Li, J. Shi, Y. Wang, Character-based bilstm-crf

incorporating pos and dictionaries for chinese opinion target extraction,

in: Asian Conference on Machine Learning, PMLR, 2018, pp. 518–533.

29

[362] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, D. Roth, Look-

ing beyond the surface: A challenge set for reading comprehension

over multiple sentences, in: Proceedings of the 2018 Conference of the

North American Chapter of the Association for Computational Linguis-

tics: Human Language Technologies, Volume 1 (Long Papers), 2018,

pp. 252–262. 29

[363] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Al-

berti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., Natural ques-

tions: a benchmark for question answering research, Transactions of the

Association for Computational Linguistics 7 (2019) 453–466. 29

[364] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, S. Tsai, Drcd: A chinese ma-

chine reading comprehension dataset, arXiv preprint arXiv:1806.00920

(2018). 29

[365] W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu,

Q. She, et al., Dureader: a chinese machine reading comprehension

dataset from real-world applications, arXiv preprint arXiv:1711.05073

(2017). 29

[366] H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, H. Wang, Dureaderrobust: A

chinese dataset towards evaluating the robustness of machine reading

comprehension models, arXiv preprint arXiv:2004.11142 (2020). 29

[367] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing multiple choice science

questions, arXiv preprint arXiv:1707.06209 (2017). 29

[368] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc

ranking with kernel pooling, in: Proceedings of the 40th International

ACM SIGIR conference on research and development in information

retrieval, 2017, pp. 55–64. 29

[369] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, R. Morante,

Qa4mre 2011-2013: Overview of question answering for machine read-

ing evaluation, in: Information Access Evaluation. Multilinguality, Mul-

timodality, and Visualization: 4th International Conference of the CLEF

Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Pro-

ceedings 4, Springer, 2013, pp. 303–320. 29

[370] S. Lim, M. Kim, J. Lee, Korquad1. 0: Korean qa dataset for machine

reading comprehension, arXiv preprint arXiv:1909.07005 (2019). 29

[371] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han,

Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judg-

ment prediction, arXiv preprint arXiv:1807.02478 (2018). 29

[372] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,

C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge

competence with apps, arXiv preprint arXiv:2105.09938 (2021). 28, 29

[373] Y. Wang, X. Liu, S. Shi, Deep neural solver for math word problems,

in: Proceedings of the 2017 conference on empirical methods in natural

language processing, 2017, pp. 845–854. 28, 29

[374] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,

M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers

to solve math word problems, arXiv preprint arXiv:2110.14168 (2021).

29

[375] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,

E. Jiang, C. J. Cai, M. Terry, Q. V. Le, C. Sutton, Program synthesis with

large language models, CoRR abs/2108.07732 (2021). 29

[376] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.

Chung, Y. Tay, S. Ruder, D. Zhou, et al., Language models are mul-

tilingual chain-of-thought reasoners, arXiv preprint arXiv:2210.03057

(2022). 29

[377] S. Roy, D. Roth, Solving general arithmetic word problems, arXiv

preprint arXiv:1608.01413 (2016). 29

[378] S.-Y. Miao, C.-C. Liang, K.-Y. Su, A diverse corpus for evaluating

and developing english math word problem solvers, arXiv preprint

arXiv:2106.15772 (2021). 29

[379] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, H. Hajishirzi,

Mawps: A math word problem repository, in: Proceedings of the 2016

conference of the north american chapter of the association for computa-

tional linguistics: human language technologies, 2016, pp. 1152–1157.

29

[380] A. Patel, S. Bhattamishra, N. Goyal, Are nlp models really able to solve

simple math word problems?, arXiv preprint arXiv:2103.07191 (2021).

29

[381] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih,

D. Fried, S. Wang, T. Yu, Ds-1000: A natural and reliable benchmark for

data science code generation, in: International Conference on Machine

Learning, PMLR, 2023, pp. 18319–18345. 29

[382] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,

E. Jiang, C. Cai, M. Terry, Q. Le, et al., Program synthesis with large

language models, arXiv preprint arXiv:2108.07732 (2021). 29

[383] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adver-

sarial nli: A new benchmark for natural language understanding, arXiv

preprint arXiv:1910.14599 (2019). 29

[384] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge

corpus for sentence understanding through inference, arXiv preprint

arXiv:1704.05426 (2017). 29

[385] R. T. McCoy, E. Pavlick, T. Linzen, Right for the wrong reasons: Diag-

nosing syntactic heuristics in natural language inference, arXiv preprint

arXiv:1902.01007 (2019). 29

[386] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, Y. Zhang, Logiqa: A chal-

lenge dataset for machine reading comprehension with logical reason-

ing, arXiv preprint arXiv:2007.08124 (2020). 29

[387] P. Lewis, B. Oğuz, R. Rinott, S. Riedel, H. Schwenk, Mlqa: Eval-

uating cross-lingual

,

extractive question answering, arXiv preprint

arXiv:1910.07475 (2019). 29

[388] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman,

H. Schwenk, V. Stoyanov, Xnli: Evaluating cross-lingual sentence rep-

resentations, arXiv preprint arXiv:1809.05053 (2018). 29

[389] Y. Yang, Y. Zhang, C. Tar, J. Baldridge, Paws-x: A cross-

lingual adversarial dataset for paraphrase identification, arXiv preprint

arXiv:1908.11828 (2019). 29

[390] S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the

summary!, Topic-Aware Convolutional Neural Networks for Extreme

Summarization. ArXiv, abs (1808). 29

[391] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, A. Korhonen,

Xcopa: A multilingual dataset for causal commonsense reasoning, arXiv

preprint arXiv:2005.00333 (2020). 27, 29

[392] A. Tikhonov, M. Ryabinin, It’s all in the heads: Using attention heads

43

as a baseline for cross-lingual transfer in commonsense reasoning, arXiv

preprint arXiv:2106.12066 (2021). 29

[393] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Niko-

laev, J. Palomaki, Tydi qa: A benchmark for information-seeking ques-

tion answering in typologically diverse languages, Transactions of the

Association for Computational Linguistics 8 (2020) 454–470. 29

[394] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano,

Mlsum: The multilingual summarization corpus, arXiv preprint

arXiv:2004.14900 (2020). 29

[395] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic

human falsehoods, arXiv preprint arXiv:2109.07958 (2021). 29

[396] I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen,

C. Hansen, J. G. Simonsen, Multifc: A real-world multi-domain

dataset for evidence-based fact checking of claims, arXiv preprint

arXiv:1909.03242 (2019). 29

[397] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a

large-scale dataset for fact extraction and verification, arXiv preprint

arXiv:1803.05355 (2018). 29

[398] I. Mollas, Z. Chrysopoulou, S. Karlos, G. Tsoumakas, Ethos: an online

hate speech detection dataset, arXiv preprint arXiv:2006.08328 (2020).

29, 31

[399] M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical

bias in pretrained language models, arXiv preprint arXiv:2004.09456

(2020). 29, 31

[400] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thomp-

son, P. M. Htut, S. R. Bowman, Bbq: A hand-built bias benchmark for

question answering, arXiv preprint arXiv:2110.08193 (2021). 29

[401] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K.-W. Chang, Gender bias

in coreference resolution: Evaluation and debiasing methods, arXiv

preprint arXiv:1804.06876 (2018). 29

[402] N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A chal-

lenge dataset for measuring social biases in masked language models,

arXiv preprint arXiv:2010.00133 (2020). 29

[403] S. Gehman, S. Gururangan, M. Sap, Y. Choi, N. A. Smith, Realtoxic-

ityprompts: Evaluating neural toxic degeneration in language models,

arXiv preprint arXiv:2009.11462 (2020). 29

[404] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced

metrics for measuring unintended bias with real data for text classifica-

tion, in: Companion proceedings of the 2019 world wide web confer-

ence, 2019, pp. 491–500. 29

[405] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,

M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz, et al., Find-

ings of the 2016 conference on machine translation, in: Proceedings of

the First Conference on Machine Translation: Volume 2, Shared Task

Papers, 2016, pp. 131–198. 29

[406] B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Ro-

man, H. Barry, H. Matthias, J. Eric, K. Tom, et al., Findings of the

2020 conference on machine translation (wmt20), in: Proceedings of

the Fifth Conference on Machine Translation, Association for Compu-

tational Linguistics„ 2020, pp. 1–55. 29

[407] W. Li, F. Qi, M. Sun, X. Yi, J. Zhang, Ccpm: A chinese classical poetry

matching dataset, arXiv preprint arXiv:2106.01979 (2021). 29

[408] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, J. Weston, Wizard of

wikipedia: Knowledge-powered conversational agents, arXiv preprint

arXiv:1811.01241 (2018). 29

[409] H. Rashkin, E. M. Smith, M. Li, Y.-L. Boureau, Towards empathetic

open-domain conversation models: A new benchmark and dataset, arXiv

preprint arXiv:1811.00207 (2018). 29

[410] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek,

D. Kiela, A. Szlam, I. Serban, R. Lowe, et al., The second conversa-

tional intelligence challenge (convai2), in: The NeurIPS’18 Competi-

tion: From Machine Learning to Intelligent Conversations, Springer,

2020, pp. 187–208. 29

[411] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kdconv: A chinese

multi-domain dialogue dataset towards multi-turn knowledge-driven

conversation, arXiv preprint arXiv:2004.04100 (2020). 29

[412] L. CO, Iflytek: a multiple categories chinese text classifier. competition

official website (2019). 29

[413] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The

pushshift reddit dataset, in: Proceedings of the international AAAI con-

ference on web and social media, Vol. 14, 2020, pp. 830–839. 30

[414] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long

form question answering, arXiv preprint arXiv:1907.09190 (2019). 31

[415] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,

A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al.,

Benchmarking generalization via in-context instructions on 1,600+ lan-

guage tasks, arXiv preprint arXiv:2204.07705 (2022). 31

[416] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu,

M. Zhong, P. Yin, S. I. Wang, et al., Unifiedskg: Unifying and multi-

tasking structured knowledge grounding with text-to-text language mod-

els, arXiv preprint arXiv:2201.05966 (2022). 31

[417] Q. Ye, B. Y. Lin, X. Ren, Crossfit: A few-shot learning challenge

for cross-task generalization in nlp, arXiv preprint arXiv:2104.08835

(2021). 31

[418] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V. Mehta,

H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, et al., Ext5: Towards extreme

multi-task scaling for transfer learning, arXiv preprint arXiv:2111.10952

(2021). 31

[419] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge cor-

pus for sentence understanding through inference, in: Proceedings of

the 2018 Conference of the North American Chapter of the Associ-

ation for Computational Linguistics: Human Language Technologies,

Volume 1 (Long Papers), Association for Computational Linguistics,

New Orleans, Louisiana, 2018, pp. 1112–1122. doi:10.18653/v1/

N18-1101.

URL https://aclanthology.org/N18-1101 29

[420] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase adversaries from word

scrambling, in: Proceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long and Short Papers), Associa-

tion for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.

1298–1308. doi:10.18653/v1/N19-1131.

URL https://aclanthology.org/N19-1131 29

[421] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is chat-

GPT a general-purpose natural language processing task solver?, in: The

2023 Conference on Empirical Methods in Natural Language Process-

ing, 2023.

URL https://openreview.net/forum?id=u03xn1COsO 31

[422] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh,

N. Akhtar, J. Wu, S. Mirjalili, et al., Large language models: a com-

prehensive survey of its applications, challenges, limitations, and future

prospects, TechRxiv (2023). 31

[423] X. L. Dong, S. Moon, Y. E. Xu, K. Malik, Z. Yu, Towards next-

generation intelligent assistants leveraging llm techniques, in: Proceed-

ings of the 29th ACM SIGKDD Conference on Knowledge Discovery

and Data Mining, 2023, pp. 5792–5793. 31

[424] K. Pandya, M. Holia, Automating customer service using langchain:

Building custom open-source gpt

,

chatbot for organizations, arXiv

preprint arXiv:2310.05421 (2023). 31

[425] J. Li, B. Hui, G. Qu, B. Li, J. Yang, B. Li, B. Wang, B. Qin, R. Cao,

R. Geng, et al., Can llm already serve as a database interface? a

big bench for large-scale database grounded text-to-sqls, arXiv preprint

arXiv:2305.03111 (2023). 31

[426] A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, M. D. Succi, Evaluating

chatgpt as an adjunct for radiologic decision-making, medRxiv (2023)

2023–02. 31

[427] M. Benary, X. D. Wang, M. Schmidt, D. Soll, G. Hilfenhaus, M. Nas-

sir, C. Sigler, M. Knödler, U. Keller, D. Beule, et al., Leveraging large

language models for decision support in personalized oncology, JAMA

Network Open 6 (11) (2023) e2343689–e2343689. 31

[428] C. M. Chiesa-Estomba, J. R. Lechien, L. A. Vaira, A. Brunet, G. Cam-

maroto, M. Mayo-Yanez, A. Sanchez-Barrueco, C. Saga-Gutierrez, Ex-

ploring the potential of chat-gpt as a supportive tool for sialendoscopy

clinical decision making and patient information support, European

Archives of Oto-Rhino-Laryngology (2023) 1–6. 31

[429] S. Montagna, S. Ferretti, L. C. Klopfenstein, A. Florio, M. F. Pengo,

Data decentralisation of llm-based chatbot systems in chronic disease

self-management, in: Proceedings of the 2023 ACM Conference on In-

formation Technology for Social Good, 2023, pp. 205–212. 31

[430] D. Bill, T. Eriksson, Fine-tuning a llm using reinforcement learning from

human feedback for a therapy chatbot application (2023). 31

[431] M. Abbasian, I. Azimi, A. M. Rahmani, R. Jain, Conversational health

44

https://aclanthology.org/N18-1101

https://aclanthology.org/N18-1101

https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101

https://aclanthology.org/N18-1101

https://aclanthology.org/N19-1131

https://aclanthology.org/N19-1131

https://doi.org/10.18653/v1/N19-1131

https://aclanthology.org/N19-1131

https://openreview.net/forum?id=u03xn1COsO

https://openreview.net/forum?id=u03xn1COsO

https://openreview.net/forum?id=u03xn1COsO

agents: A personalized llm-powered agent framework, arXiv preprint

arXiv:2310.02374 (2023). 31

[432] K. V. Lemley, Does chatgpt help us understand the medical literature?,

Journal of the American Society of Nephrology (2023) 10–1681. 31

[433] S. Pal, M. Bhattacharya, S.-S. Lee, C. Chakraborty, A domain-specific

next-generation large language model (llm) or chatgpt is required for

biomedical engineering and research, Annals of Biomedical Engineering

(2023) 1–4. 31

[434] Y. Du, S. Zhao, Y. Chen, R. Bai, J. Liu, H. Wu, H. Wang, B. Qin, The

calla dataset: Probing llms’ interactive knowledge acquisition from chi-

nese medical literature, arXiv preprint arXiv:2309.04198 (2023). 31

[435] A. Abd-Alrazaq, R. AlSaad, D. Alhuwail, A. Ahmed, P. M. Healy,

S. Latifi, S. Aziz, R. Damseh, S. A. Alrazak, J. Sheikh, et al., Large

language models in medical education: Opportunities, challenges, and

future directions, JMIR Medical Education 9 (1) (2023) e48291. 31

[436] A. B. Mbakwe, I. Lourentzou, L. A. Celi, O. J. Mechanic, A. Dagan,

Chatgpt passing usmle shines a spotlight on the flaws of medical educa-

tion (2023). 31

[437] S. Ahn, The impending impacts of large language models on medical

education, Korean Journal of Medical Education 35 (1) (2023) 103. 31

[438] E. Waisberg, J. Ong, M. Masalkhi, A. G. Lee, Large language model

(llm)-driven chatbots for neuro-ophthalmic medical education, Eye

(2023) 1–3. 31

[439] G. Deiana, M. Dettori, A. Arghittu, A. Azara, G. Gabutti, P. Castiglia,

Artificial intelligence and public health: Evaluating chatgpt responses to

vaccination myths and misconceptions, Vaccines 11 (7) (2023) 1217. 31

[440] L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E.

Tozzi, C. Rizzo, Chatgpt and the rise of large language models: the new

ai-driven infodemic threat in public health, Frontiers in Public Health 11

(2023) 1166120. 31

[441] N. L. Rane, A. Tawde, S. P. Choudhary, J. Rane, Contribution and per-

formance of chatgpt and other large language models (llm) for scientific

and research advancements: a double-edged sword, International Re-

search Journal of Modernization in Engineering Technology and Science

5 (10) (2023) 875–899. 31, 32

[442] W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, G. Chen, Can large

language models provide feedback to students? a case study on chatgpt,

in: 2023 IEEE International Conference on Advanced Learning Tech-

nologies (ICALT), IEEE, 2023, pp. 323–325. 32

[443] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva,

F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al.,

Chatgpt for good? on opportunities and challenges of large language

models for education, Learning and individual differences 103 (2023)

102274. 32

[444] N. Rane, Enhancing the quality of teaching and learning through chat-

gpt and similar large language models: Challenges, future prospects,

and ethical considerations in education, Future Prospects, and Ethical

Considerations in Education (September 15, 2023) (2023). 32

[445] J. C. Young, M. Shishido, Investigating openai’s chatgpt potentials in

generating chatbot’s dialogue for english as a foreign language learning,

International Journal of Advanced Computer Science and Applications

14 (6) (2023). 32

[446] J. Irons, C. Mason, P. Cooper, S. Sidra, A. Reeson, C. Paris, Exploring

the impacts of chatgpt on future scientific work, SocArXiv (2023). 32

[447] P. G. Schmidt, A. J. Meir, Using generative ai for literature searches and

scholarly writing: Is the integrity of the scientific discourse in jeopardy?,

arXiv preprint arXiv:2311.06981 (2023). 32

[448] Y. Zheng, H. Y. Koh, J. Ju, A. T. Nguyen, L. T. May, G. I. Webb, S. Pan,

Large language models for scientific synthesis, inference and explana-

tion, arXiv preprint arXiv:2310.07984 (2023). 32

[449] B. Aczel, E.-J. Wagenmakers, Transparency guidance for chatgpt usage

in scientific writing, PsyArXiv (2023). 32

[450] S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in sci-

entific writing: a friend or a foe?, Reproductive BioMedicine Online

(2023). 32

[451] S. Imani, L. Du, H. Shrivastava, Mathprompter: Mathematical reasoning

using large language models, arXiv preprint arXiv:2303.05398 (2023).

32

[452] Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, C. Zhou, Scaling relationship

on learning mathematical reasoning with large language models, arXiv

preprint arXiv:2308.01825 (2023). 32

[453] K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,

R. Prenger, A. Anandkumar, Leandojo: Theorem proving with retrieval-

augmented language models, arXiv preprint arXiv:2306.15626 (2023).

32

[454] K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt,

T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart, et al., Evaluating

language models for mathematics through interactions, arXiv preprint

arXiv:2306.01694 (2023). 32

[455] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He,

Z. Liu, et al., Summary of chatgpt-related research and perspective

towards the future of large language models, Meta-Radiology (2023)

100017. 32

[456] J. Drápal, H. Westermann, J. Savelka, Using large language models

to support thematic analysis in empirical legal studies, arXiv preprint

arXiv:2310.18729 (2023). 32

[457] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, H. Xu, Explain-

ing legal concepts with augmented large language models (gpt-4), arXiv

preprint arXiv:2306.09525 (2023). 32

[458] N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana,

A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, et al., Legal-

bench: A collaboratively built benchmark for measuring legal reasoning

in large language models, arXiv preprint arXiv:2308.11462 (2023). 32

[459] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: Open-source legal

large language model with integrated external knowledge bases, arXiv

preprint arXiv:2306.16092 (2023). 32

[460] H. Yang, X.-Y. Liu, C. D. Wang, Fingpt: Open-source financial large

language models, arXiv

,

preprint arXiv:2306.06031 (2023). 32

[461] Y. Li, S. Wang, H. Ding, H. Chen, Large language models in finance: A

survey, in: Proceedings of the Fourth ACM International Conference on

AI in Finance, 2023, pp. 374–382. 33

[462] A. Lykov, D. Tsetserukou, Llm-brain: Ai-driven fast generation of

robot behaviour tree based on large language model, arXiv preprint

arXiv:2305.19352 (2023). 33

[463] E. Billing, J. Rosén, M. Lamb, Language models for human-robot inter-

action, in: ACM/IEEE International Conference on Human-Robot Inter-

action, March 13–16, 2023, Stockholm, Sweden, ACM Digital Library,

2023, pp. 905–906. 33

[464] Y. Ye, H. You, J. Du, Improved trust in human-robot collaboration with

chatgpt, IEEE Access (2023). 33

[465] Y. Ding, X. Zhang, C. Paxton, S. Zhang, Leveraging commonsense

knowledge from large language models for task and motion planning,

in: RSS 2023 Workshop on Learning for Task and Motion Planning,

2023. 33

[466] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg,

S. Rusinkiewicz, T. Funkhouser, Tidybot: Personalized robot assistance

with large language models, arXiv preprint arXiv:2305.05658 (2023).

33

[467] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations

for deep learning in nlp, arXiv preprint arXiv:1906.02243 (2019). 33

[468] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dan-

gers of stochastic parrots: Can language models be too big?, in: Pro-

ceedings of the 2021 ACM conference on fairness, accountability, and

transparency, 2021, pp. 610–623. 33

[469] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding

deep learning (still) requires rethinking generalization, Communications

of the ACM 64 (3) (2021) 107–115. 33

[470] M. Tänzer, S. Ruder, M. Rei, Memorisation versus generalisation in pre-

trained language models, arXiv preprint arXiv:2105.00828 (2021). 33

[471] S. M. West, M. Whittaker, K. Crawford, Discriminating systems, AI

Now (2019) 1–33. 33

[472] K. Valmeekam, A. Olmo, S. Sreedharan, S. Kambhampati, Large lan-

guage models still can’t plan (a benchmark for llms on planning and

reasoning about change), arXiv preprint arXiv:2206.10498 (2022). 33

[473] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao,

Y. Zhang, Y. Chen, et al., Siren’s song in the ai ocean: A survey on hal-

lucination in large language models, arXiv preprint arXiv:2309.01219

(2023). 33

[474] A. Webson, E. Pavlick, Do prompt-based models really understand the

meaning of their prompts?, arXiv preprint arXiv:2109.01247 (2021). 33

[475] O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second

thought, let’s not think step by step! bias and toxicity in zero-shot rea-

45

soning, arXiv preprint arXiv:2212.08061 (2022). 33

[476] X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, J. Gao, Adversar-

ial training for large neural language models, ArXiv (April 2020).

URL https://www.microsoft.com/en-us/research/

publication/adversarial-training-for-large-neural-language-models/

34

[477] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, N. Abu-

Ghazaleh, Survey of vulnerabilities in large language models revealed

by adversarial attacks (2023). arXiv:2310.10844. 34

[478] X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, M. Kankanhalli, An

llm can fool itself: A prompt-based adversarial attack (2023). arXiv:

2310.13345. 34

[479] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin,

M. Du, Explainability for large language models: A survey (2023).

arXiv:2309.01029. 34

[480] S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, L. H. Gilpin, Can large

language models explain themselves? a study of llm-generated self-

explanations (2023). arXiv:2310.11207. 34

[481] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, What does it

mean for a language model to preserve privacy?, in: Proceedings of the

2022 ACM Conference on Fairness, Accountability, and Transparency,

2022, pp. 2280–2292. 34

[482] R. Plant, V. Giuffrida, D. Gkatzia, You are what you write: Pre-

serving privacy in the era of large language models, arXiv preprint

arXiv:2204.09391 (2022). 34

[483] W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu,

B. Ren, Y. Wang, Real-time execution of large-scale language models

on mobile (2020). arXiv:2009.06823. 34

[484] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo,

Y. Zhu, Olive: Accelerating large language models via hardware-

friendly outlier-victim pair quantization, in: Proceedings of the 50th

Annual International Symposium on Computer Architecture, 2023, pp.

1–15. 34

[485] B. Meskó, E. J. Topol, The imperative for regulatory oversight of large

language models (or generative ai) in healthcare, npj Digital Medicine

6 (1) (2023) 120. 34

[486] J. Zhang, X. Ji, Z. Zhao, X. Hei, K.-K. R. Choo, Ethical considerations

and policy implications for large language models: Guiding responsible

development and deployment, arXiv preprint arXiv:2308.02678 (2023).

34

[487] J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Auditing large language

models: a three-layered approach, AI and Ethics (2023) 1–31. 34

46

https://www.microsoft.com/en-us/research/publication/adversarial-training-for-large-neural-language-models/

https://www.microsoft.com/en-us/research/publication/adversarial-training-for-large-neural-language-models/

https://www.microsoft.com/en-us/research/publication/adversarial-training-for-large-neural-language-models/

https://www.microsoft.com/en-us/research/publication/adversarial-training-for-large-neural-language-models/

http://arxiv.org/abs/2310.10844

http://arxiv.org/abs/2310.13345

http://arxiv.org/abs/2310.13345

http://arxiv.org/abs/2309.01029

http://arxiv.org/abs/2310.11207

http://arxiv.org/abs/2009.06823

,

if the

model fulfills three criteria of helpful, honest, and harmless or

“HHH” [99].

Researchers employ reinforcement learning with human feed-

back (RLHF) [100] for model alignment. In RLHF, a fine-tuned

model on demonstrations is further trained with reward model-

ing (RM) and reinforcement learning (RL), shown in Figure 6.

Below we briefly discuss RM and RL pipelines in RLHF.

Reward modeling: trains a model to rank generated responses

according to human preferences using a classification objec-

tive. To train the classifier humans annotate LLMs generated

responses based on the HHH criteria.

Reinforcement learning: in combination with the reward model

is used for alignment in the next stage. The previously trained

reward model ranks LLM-generated responses into preferred

vs. non-preferred, which is used to align the model with proxi-

mal policy optimization (PPO). This process repeats iteratively

until convergence.

2.12.3. Prompting/Utilization

Prompting is a method to query trained LLMs for generating

responses, as illustrated in Figure 6. LLMs can be prompted in

various prompt setups, where they can be adapted to the instruc-

tions without fine-tuning and in other cases with fine-tuning on

data containing different prompt styles [16, 101, 102]. A good

guide on prompt engineering is available at [32]. Below, we

will discuss various widely used prompt setups.

Zero-Shot Prompting: LLMs are zero-shot learners and ca-

pable of answering queries never seen before. This style of

prompting requires LLMs to answer user questions without see-

ing any examples in the prompt.

In-context Learning: Also known as few-shot learning, here,

multiple input-output demonstration pairs are shown to the

model to generate the desired response. This adaptation style

is also called few-shot learning. A discussion on formatting in-

context learning (ICL) templates is available in [54, 50, 18, 16].

Reasoning in LLMs: LLMs are zero-shot reasoners and can

be provoked to generate answers to logical problems, task

planning, critical thinking, etc. with reasoning. Generating

reasons is possible only by using different prompting styles,

whereas to improve LLMs further on reasoning tasks many

methods [16, 97] train them on reasoning datasets. We discuss

various prompting techniques for reasoning below.

Chain-of-Thought (CoT): A special case of prompting where

demonstrations contain reasoning information aggregated with

inputs and outputs so that the model generates outcomes with

step-by-step reasoning. More details on CoT prompts are avail-

able in [55, 103, 101].

Self-Consistency: Improves CoT performance by generat-

ing multiple responses and selecting the most frequent an-

swer [104].

Tree-of-Thought (ToT): Explores multiple reasoning paths

with possibilities to look ahead and backtrack for problem-

solving [105].

Single-Turn Instructions: In this prompting setup, LLMs are

queried only once with all the relevant information in the

prompt. LLMs generate responses by understanding the con-

text either in a zero-shot or few-shot setting.

Multi-Turn Instructions: Solving a complex task requires mul-

tiple interactions with LLMs, where feedback and responses

from the other tools are given as input to the LLM for the next

rounds. This style of using LLMs in the loop is common in

autonomous agents.

3. Large Language Models

This section reviews LLMs, briefly describing their architec-

tures, training objectives, pipelines, datasets, and fine-tuning

details.

3.1. Pre-Trained LLMs

Here, we provide summaries of various well-known pre-

trained LLMs with significant discoveries, changing the course

of research and development in NLP. These LLMs have consid-

erably improved the performance in NLU and NLG domains,

and are widely fine-tuned for downstream tasks. Moreover, We

also identify key findings and insights of pre-trained LLMs in

Table 1 and 2 that improve their performance.

3.1.1. General Purpose

T5 [10]: An encoder-decoder model employing a unified text-

to-text training for all NLP problems is shown in Figure 7. T5

places layer normalization outside the residual path in a conven-

tional transformer model [64]. It uses masked language mod-

eling as a pre-training objective where spans (consecutive to-

kens) are replaced with a single mask instead of separate masks

for each token. This type of masking speeds up the training as

it produces shorter sequences. After pre-training, the model is

fine-tuned using adapter layers [106] for downstream tasks.

GPT-3 [6]: The GPT-3 architecture is the same as the GPT-

2 [5] but with dense and sparse attention in transformer layers

similar to the Sparse Transformer [67]. It shows that large mod-

els can train on larger batch sizes with a lower learning rate to

decide the batch size during training, GPT-3 uses the gradient

noise scale as in [107]. Overall, GPT-3 increases model param-

eters to 175B showing that the performance of large language

7

Figure 7: Unified text-to-text training example, source image from [10].

Figure 8: The image is the article of [108], showing an example of PanGu-α

architecture.

models improves with the scale and is competitive with the fine-

tuned models.

mT5 [11]: A multilingual T5 model [10] trained on the mC4

dataset with 101 languages. The dataset is extracted from the

public common crawl scrape. The model uses a larger vocab-

ulary size of 250,000 to cover multiple languages. To avoid

over-fitting or under-fitting for a language, mT5 employs a data

sampling procedure to select samples from all languages. The

paper suggests using a small amount of pre-training datasets,

including all languages when fine-tuning for a task using En-

glish language data. This allows the model to generate correct

non-English outputs.

PanGu-α [108]: An autoregressive model that has a query

layer at the end of standard transformer layers, example shown

in Figure 8, to predict the next token. Its structure is similar to

the transformer layer but with an additional embedding for the

next position in the attention mechanism, given in Eq. 3.

a = pnWq

h Wk

hT HT

L (3)

CPM-2 [12]: Cost-efficient Pre-trained language Models

(CPM-2) pre-trains bilingual (English and Chinese) 11B and

198B mixture-of-experts (MoE) models on the WuDaoCor-

pus [109] dataset. The tokenization process removes “_” white

space tokens in the sentencepiece tokenizer. The models are

trained with knowledge inheritance, starting with only the Chi-

nese language in the first stage and then adding English and

Chinese data. This trained model gets duplicated multiple times

to initialize the 198B MoE model. Moreover, to use the model

for downstream tasks, CPM-2 experimented with both com-

plete fine-tuning and prompt fine-tuning as in [40] where only

prompt-related parameters are updated by inserting prompts at

various positions, front, middle, and back. CPM-2 also pro-

poses the INFMOE, a memory-efficient framework with a strat-

egy to dynamically offload parameters to the CPU for inference

at a 100B scale. It overlaps data movement with inference com-

putation for lower inference time.

ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi-

task learning to build a modular architecture using Transformer-

XL [111] as the backbone. The universal representation mod-

ule is shared by all the tasks, which serve as the basic block

for task-specific representation modules, which are all trained

jointly for natural language understanding, natural language

generation, and knowledge extraction. This LLM is primar-

ily focused on the Chinese language. It claims to train on the

largest Chinese text corpora for LLM training, and achieved

state-of-the-art in 54 Chinese NLP tasks.

Jurassic-1 [112]: A pair of auto-regressive language mod-

els, including a 7B-parameter J1-Large model and a 178B-

parameter J1-Jumbo model. The training vocabulary of

Jurassic-1 comprise word pieces, complete words, and multi-

word expressions without any word boundaries, where possible

out-of-vocabulary instances are interpreted as Unicode

,

bytes.

Compared to the GPT-3 counterparts, the Jurassic-1 models

apply a more balanced depth-to-width self-attention architec-

ture [113] and an improved tokenizer for a faster prediction

based on broader resources, achieving a comparable perfor-

mance in zero-shot learning tasks and a superior performance in

few-shot learning tasks given the ability to feed more examples

as a prompt.

HyperCLOVA [114]: A Korean language model with GPT-3

architecture.

Yuan 1.0 [115]: Trained on a Chinese corpus with 5TB of

high-quality text collected from the Internet. A Massive Data

Filtering System (MDFS) built on Spark is developed to pro-

cess the raw data via coarse and fine filtering techniques. To

speed up the training of Yuan 1.0 to save energy expenses and

carbon emissions, various factors that improve the performance

of distributed training are incorporated in architecture and train-

ing: like increasing the hidden state size improves pipeline and

tensor parallelism performance, larger micro batches improve

pipeline parallelism performance, and larger global batch size

improve data parallelism performance. In practice, the Yuan 1.0

model performs well on text classification, Winograd Schema,

natural language inference, and reading comprehension tasks.

Gopher [116]: The Gopher family of models ranges from

44M to 280B parameters in size to study the effect of scale

on the LLMs performance. The 280B model beats GPT-3 [6],

Jurrasic-1 [112], MT-NLG [117], and others on 81% of the

evaluated tasks.

ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends ERNIE 3.0

by training a larger model with 26x the number of parameters

of the latter. This bigger model outperformed other state-of-the-

art models in 68 NLP tasks. LLMs produce text with incorrect

facts. In order to have control of the generated text with fac-

tual consistency, ERNIE 3.0 Titan adds another task, Credible

and Controllable Generations, to its multi-task learning setup.

8

It introduces additional self-supervised adversarial and control-

lable language modeling losses to the pre-training step, which

enables ERNIE 3.0 Titan to beat other LLMs in their manually

selected Factual QA task set evaluations.

GPT-NeoX-20B [118]: An auto-regressive model that largely

follows GPT-3 with a few deviations in architecture design,

trained on the Pile dataset without any data deduplication. GPT-

NeoX has parallel attention and feed-forward layers in a trans-

former block, given in Eq. 4, that increases throughput by 15%.

It uses rotary positional embedding [66], applying it to only

25% of embedding vector dimension as in [119]. This reduces

the computation without performance degradation. As opposed

to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B

uses only dense layers. The hyperparameter tuning at this scale

is difficult; therefore, the model chooses hyperparameters from

the method [6] and interpolates values between 13B and 175B

models for the 20B model. The model training is distributed

among GPUs using both tensor and pipeline parallelism.

x + Attn(LN1(x)) + FF(LN2(x)) (4)

OPT [14]: It is a clone of GPT-3, developed to open-source

a model that replicates GPT-3 performance. Training of OPT

employs dynamic loss scaling [120] and restarts from an earlier

checkpoint with a lower learning rate whenever loss divergence

is observed. Overall, the performance of OPT-175B models is

comparable to the GPT3-175B model.

BLOOM [13]: A causal decoder model trained on the ROOTS

corpus to open-source an LLM. The architecture of BLOOM is

shown in Figure 9, with differences like ALiBi positional em-

bedding, an additional normalization layer after the embedding

layer as suggested by the bitsandbytes1 library. These changes

stabilize training with improved downstream performance.

GLaM [91]: Generalist Language Model (GLaM) represents a

family of language models using a sparsely activated decoder-

only mixture-of-experts (MoE) structure [121, 90]. To gain

more model capacity while reducing computation, the experts

are sparsely activated where only the best two experts are used

to process each input token. The largest GLaM model, GLaM

(64B/64E), is about 7× larger than GPT-3 [6], while only part of

the parameters are activated per input token. The largest GLaM

(64B/64E) model achieves better overall results as compared

to GPT-3 while consuming only one-third of GPT-3’s training

energy.

MT-NLG [117]: A 530B causal decoder based on the GPT-

2 architecture that has roughly 3× GPT-3 model parameters.

MT-NLG is trained on filtered high-quality data collected from

various public datasets and blends various types of datasets in a

single batch, which beats GPT-3 on several evaluations.

Chinchilla [96]: A causal decoder trained on the same dataset

as the Gopher [116] but with a little different data sampling

distribution (sampled from MassiveText). The model architec-

ture is similar to the one used for Gopher, with the exception of

AdamW optimizer instead of Adam. Chinchilla identifies the

1https://github.com/TimDettmers/bitsandbytes

Figure 9: The BLOOM architecture example sourced from [13].

relationship that model size should be doubled for every dou-

bling of training tokens. Over 400 language models ranging

from 70 million to over 16 billion parameters on 5 to 500 bil-

lion tokens are trained to get the estimates for compute-optimal

training under a given budget. The authors train a 70B model

with the same compute budget as Gopher (280B) but with 4

times more data. It outperforms Gopher [116], GPT-3 [6], and

others on various downstream tasks, after fine-tuning.

AlexaTM [122]: An encoder-decoder model, where encoder

weights and decoder embeddings are initialized with a pre-

trained encoder to speed up training. The encoder stays frozen

for the initial 100k steps and is later unfrozen for end-to-end

training. The model is trained on a combination of denoising

and causal language modeling (CLM) objectives, concatenat-

ing a [CLM] token at the beginning for mode switching. Dur-

ing training, the CLM task is applied for 20% of the time, which

improves the in-context learning performance.

PaLM [15]: A causal decoder with parallel attention and

feed-forward layers similar to Eq. 4, speeding up training by

a factor of 15. Additional changes to the conventional trans-

former model include SwiGLU activation, RoPE embeddings,

multi-query attention that saves computation cost during decod-

ing, and shared input-output embeddings. During training, loss

spiking was observed, and to fix it, model training was restarted

from a 100-step earlier checkpoint by skipping 200-500 batches

around the spike. Moreover, the model was found to memo-

rize around 2.4% of the training data at the 540B model scale,

whereas this number was lower for smaller models.

PaLM-2 [123]: A smaller multi-lingual variant of PaLM,

trained for larger iterations on a better quality dataset. PaLM-

2 shows significant improvements over PaLM, while reducing

training and inference costs due to its smaller size. To lessen

toxicity and memorization, it appends special tokens with a

fraction of pre-training data, which shows a reduction in gener-

ating harmful responses.

U-PaLM [124]: This method trains PaLM for 0.1% addi-

tional compute with the UL2 (also named as UL2Restore) ob-

jective [125], using the same dataset it outperforms the baseline

significantly on various NLP tasks, including zero-shot, few-

shot, commonsense reasoning, CoT, etc. Training with UL2R

involves converting a causal decoder PaLM to a non-causal de-

coder PaLM and employing 50% sequential denoising, 25%

regular denoising, and 25% extreme denoising loss functions.

9

UL2 [125]: An encoder-decoder architecture trained using a

mixture of denoisers (MoD) objective. Denoisers include 1)

R-Denoiser: a regular span masking, 2) S-Denoiser: which cor-

rupts consecutive tokens of a large sequence and 3) X-Denoiser:

which corrupts a large number of tokens randomly. During pre-

training, UL2 includes a denoiser token from R, S , X to rep-

resent a

,

denoising setup. It helps improve fine-tuning perfor-

mance for downstream tasks that bind the task to one of the up-

stream training modes. This MoD style of training outperforms

the T5 model on many benchmarks.

GLM-130B [33]: GLM-130B is a bilingual (English and Chi-

nese) model trained using an auto-regressive mask infilling pre-

training objective similar to the GLM [126]. This training style

makes the model bidirectional as compared to GPT-3, which is

unidirectional. As opposed to GLM, the training of GLM-130B

includes a small amount of multi-task instruction pre-training

data (5% of the total data) along with self-supervised mask in-

filling. To stabilize the training, it applies embedding layer gra-

dient shrink.

LLaMA [127, 21]: A set of decoder-only language models

varying from 7B to 70B parameters. LLaMA models series is

the most famous among the community for parameter efficiency

and instruction tuning.

LLaMA-1 [127]: Implements efficient causal attention [128]

by not storing and computing masked attention weights and

key/query scores. Another optimization is reducing the number

of activations recomputed in the backward pass, as in [129].

LLaMA-2 [21]: This work is more focused on fine-tuning a

safer and better LLaMA-2-Chat model for dialogue generation.

The pre-trained model has 40% more training data with a larger

context length and grouped-query attention.

PanGu-Σ [92]: An autoregressive model with parameters

copied from PanGu-α and extended to a trillion scale with Ran-

dom Routed Experts (RRE), the architectural diagram is shown

in Figure 10. RRE is similar to the MoE architecture, with

distinctions at the second level, where tokens are randomly

routed to experts in a domain instead of using a learnable gat-

ing method. The model has bottom layers densely activated

and shared across all domains, whereas top layers are sparsely

activated according to the domain. This training style allows

extracting task-specific models and reduces catastrophic forget-

ting effects in the case of continual learning.

3.1.2. Coding

CodeGen [130]: CodeGen has similar architecture to

PaLM [15], i.e., parallel attention, MLP layers, and RoPE em-

beddings. The model is trained on both natural language and

programming language data sequentially (trained on the first

dataset, then the second and so on) on the following datasets

1) PILE, 2) BIGQUERY and 3) BIGPYTHON. CodeGen pro-

posed a multi-step approach to synthesizing code. The purpose

is to simplify the generation of long sequences where the previ-

ous prompt and generated code are given as input with the next

prompt to generate the next code sequence. CodeGen open-

source a Multi-Turn Programming Benchmark (MTPB) to eval-

uate multi-step program synthesis.

Codex [131]: This LLM is trained on a subset of public Python

Github repositories to generate code from docstrings. Com-

puter programming is an iterative process where the programs

are often debugged and updated before fulfilling the require-

ments. Similarly to this, Codex generates 100 versions of a

program by repetitive sampling for a given description, which

produces a working solution for 77.5% of the problems passing

unit tests. Its powerful version powers Github Copilot2.

AlphaCode [132]: A set of large language models, ranging

from 300M to 41B parameters, designed for competition-level

code generation tasks. It uses the multi-query attention [133] to

reduce memory and cache costs. Since competitive program-

ming problems highly require deep reasoning and an under-

standing of complex natural language algorithms, the Alpha-

Code models are pre-trained on filtered GitHub code in popular

languages and then fine-tuned on a new competitive program-

ming dataset named CodeContests. The CodeContests dataset

mainly contains problems, solutions, and test cases collected

from the Codeforces platform3. The pre-training employs stan-

dard language modeling objectives, while GOLD [134] with

tempering [135] serves as the training objective for the fine-

tuning on CodeContests data. To evaluate the performance of

AlphaCode, simulated programming competitions are hosted

on the Codeforces platform: overall, AlphaCode ranks at the

top 54.3% among over 5000 competitors, where its Codeforces

rating is within the top 28% of recently participated users.

CodeT5+ [34]: CodeT5+ is based on CodeT5 [136], with

shallow encoder and deep decoder, trained in multiple stages

initially unimodal data (code) and later bimodal data (text-code

pairs). Each training stage has different training objectives and

activates different model blocks encoder, decoder, or both ac-

cording to the task. The unimodal pre-training includes span

denoising and CLM objectives, whereas bimodal pre-training

objectives contain contrastive learning, matching, and CLM for

text-code pairs. CodeT5+ adds special tokens with the text to

enable task modes, for example, [CLS ] for contrastive loss,

[Match] for text-code matching, etc.

StarCoder [137]: A decoder-only model with the SantaCoder

architecture, employing Flash attention to scale up the context

length to 8k. The StarCoder trains an encoder to filter names,

emails, and other personal data from the training data. Its fine-

tuned variant outperforms PaLM, LLaMA, and LAMDA on

HumanEval and MBPP benchmarks.

3.1.3. Scientific Knowledge

Galactica [138]: A large curated corpus of human scientific

knowledge with 48 million papers, textbooks, lecture notes,

millions of compounds and proteins, scientific websites, en-

cyclopedias, and more are trained using the metaseq library3,

which is built on PyTorch and fairscale [139]. The model wraps

reasoning datasets with the < work > token to provide step-by-

step reasoning context to the model, which has been shown to

improve the performance on reasoning tasks.

2https://github.com/features/copilot

3https://codeforces.com/

10

Figure 10: This example illustrates the PanGu-

architecture, as depicted in

the image sourced from [92].

3.1.4. Dialog

LaMDA [140]: A decoder-only model pre-trained on pub-

lic dialog data, public dialog utterances, and public web doc-

uments, where more than 90% of the pre-training data is in

English. LaMDA is trained with the objective of producing re-

sponses that exhibit high levels of quality, safety, and grounded-

ness. To achieve this, discriminative and generative fine-tuning

techniques are incorporated to enhance the model’s safety and

quality aspects. As a result, the LaMDA models can be utilized

as a general language model performing various tasks.

3.1.5. Finance

BloombergGPT [141]: A non-causal decoder model trained

using both financial ("FINPILE" from the Bloomberg archive)

and general-purpose datasets. The model’s architecture is sim-

ilar to the BLOOM [13] and OPT [14]. It allocates 50B param-

eters to different blocks of the model using the approach [113].

For effective training, BloombergGPT packs documents to-

gether with < |endo f text| > to use the maximum sequence

length, uses warmup batch size starting from 1024 to 2048, and

manually reduces the learning rate multiple times during the

training.

Xuan Yuan 2.0 [142]: A Chinese financial chat model with

BLOOM’s [13] architecture trained on a combination of general

purpose, financial, general purpose instructions, and financial

institutions datasets. Xuan Yuan 2.0 combined the pre-training

and fine-tuning stages to avoid catastrophic forgetting.

3.2. Fine-Tuned LLMs

Pre-trained LLMs have excellent generalization abilities to

unseen tasks. However, because they are generally trained with

the objective of next token prediction, LLMs have limited ca-

pacity to follow user intent and are prone to generate unethical,

toxic or inaccurate responses [20]. For their effective utiliza-

tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and

generate safe responses [20], which also results in increasing

zero-shot, few-shot, and cross-task generalization [97, 16, 18],

Figure 11: An example image shows an instance of the Flan training paradigm,

,

taken from [16].

with minimal compute increment, e.g., 0.2% of the total pre-

training for PaLM 540B [16].

We review various fine-tuned LLMs and strategies for effective

fine-tuning in this section.

3.2.1. Instruction-Tuning with Manually Created Datasets

Numerous hand-crafted instruction-tuning datasets with

different design choices are proposed in the literature to

instruction-tune LLMs. The performance of fine-tuned LLMs

depends on multiple factors, such as dataset, instruction diver-

sity, prompting templates, model size, and training objectives.

Keeping this in view, diverse fine-tuned models have emerged

in the literature using manually created datasets.

The models T0 [17] and mT0 (multi-lingual) [144] employ

templates to convert existing datasets into prompt datasets.

They have shown improvements in generalization to zero-shot

and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model

with in-context instructions to study generalization on unseen

tasks when given in-context instructions during test time. The

model outperformed Instruct-GPT, despite being smaller in

size, i.e., 11B parameters as compared to 175B of GPT-3.

Increasing Tasks and Prompt Setups: Zero-shot and few-shot

performance improves significantly by expanding task collec-

tion and prompt styles. OPT-IML [97] and Flan [16] curated

larger 2k and 1.8k task datasets, respectively. While increasing

task size alone is not enough, OPT-IML and Flan add more

prompting setups in their datasets, zero-shot, few-shot, and

CoT. In continuation, CoT Collection [101] fine-tunes Flan-T5

further on 1.88M CoT samples. Another method [102] uses

symbolic tasks with tasks in T0, Flan, etc.

3.2.2. Instruction-Tuning with LLMs Generated Datasets

Generating an instruction-tuning dataset requires carefully

writing instructions and input-output pairs, which are often

written by humans, smaller in size, and less diverse. To

overcome this, self-instruct [19] proposed an approach to

prompt available LLMs to generate instruction-tuning datasets.

Self-instruct outperformed models trained on manually created

dataset SUPER-NATURALINSTRUCTIONS (a dataset with

1600+ tasks) [18] by 33%. It starts with a seed of 175 tasks,

1 instruction, and 1 sample per task and iteratively generates

11

Table 1: Noteworthy findings and insights of pre-trained Large Language Models.

Models Findings & Insights

T5

• Encoder and decoder with shared parameters perform equivalently when parameters are not shared

• Fine-tuning model layers (adapter layers) work better than the conventional way of training on only

classification layers

GPT-3

• Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-

learners

mT5

• Large multi-lingual models perform equivalently to single language models on downstream tasks.

However, smaller multi-lingual models perform worse

PanGu-α • LLMs have good few shot capabilities

CPM-2

• Prompt fine-tuning requires updating very few parameters while achieving performance compara-

ble to full model fine-tuning

• Prompt fine-tuning takes more time to converge as compared to full model fine-tuning

• Inserting prompt tokens in-between sentences can allow the model to understand relations between

sentences and long sequences

• In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator

(aggregate information with the input text) for the model

ERNIE 3.0

• A modular LLM architecture with a universal representation module and task-specific representa-

tion module helps in the finetuning phase

• Optimizing the parameters of a task-specific representation network during the fine-tuning phase is

an efficient way to take advantage of the powerful pre-trained model

Jurassic-1

• The performance of LLM is highly related to the network size

• To improve runtime performance, more operations can be performed in parallel (width) rather than

sequential (depth)

• To efficiently represent and fit more text in the same context length, the model uses a larger vo-

cabulary to train a SentencePiece tokenizer without restricting it to word boundaries. This further

benefits in few-shot learning tasks

HyperCLOVA

• By employing prompt-based tuning, the performances of models can be improved, often surpassing

those of state-of-the-art models when the backward gradients of inputs are accessible

Yuan 1.0

• The model architecture that excels in pre-training and fine-tuning cases may exhibit contrasting

behavior in zero-shot and few-shot learning

Gopher • Relative encodings enable the model to evaluate for longer sequences than training.

ERNIE 3.0 Titan

• Additional self-supervised adversarial loss to distinguish between real and generated text improves

the model performance as compared to ERNIE 3.0

GPT-NeoX-20B

• Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded

layers

• Initializing feed-forward output layers before residuals with scheme in [143] avoids activations

from growing with increasing depth and width

• Training on Pile outperforms GPT-3 on five-shot

Table Continued on Next Page

12

Models Findings & Insights

OPT

• Restart training from an earlier checkpoint with a lower learning rate if loss diverges

• Model is prone to generate repetitive text and stuck in a loop

Galactica

• Galactica’s performance has continued to improve across validation set, in-domain, and out-of-

domain benchmarks, even with multiple repetitions of the corpus, which is superior to existing

research on LLMs

• A working memory token approach can achieve strong performance over existing methods on

mathematical MMLU and MATH benchmarks. It sets a new state-of-the-art on several downstream

tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%)

GLaM

• The model capacity can be maintained at reduced computation by replacing the feed-forward layer

in each transformer layer with a mixture-of-experts (MoE)

• The model trained on filtered data shows consistently better performances on both NLG and NLU

tasks, where the effect of filtering is more significant on the former tasks

• Filtered pretraining corpora play a crucial role in the generation capability of LLMs, especially for

the downstream tasks

• The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in

the MoE layer. Given a fixed budget of computation, more experts contribute to a better perfor-

mance

LaMDA • The model can be fine-tuned to learn to call different external information resources and tools

AlphaCode

• For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed

with a shallower encoder and a deeper decoder

• To achieve better performances, it is necessary to employ strategies such as massively scaling

upsampling, followed by the filtering and clustering of samples into a compact set

• The utilization of novel sampling-efficient transformer architectures designed to facilitate large-

scale sampling is crucial

• Simplifying problem descriptions can effectively improve the model’s performance

Chinchilla

• The model size and the number of training tokens should be scaled proportionately: for each dou-

bling of the model size, the number of training tokens should be doubled as well

PaLM

• English-centric models produce better translations when translating to English as compared to non-

English

• Generalized models can have equivalent performance for language translation to specialized small

models

• Larger models have a higher percentage of training data memorization

• Performance has not yet saturated even at 540B scale, which means larger models are likely to

perform better

AlexaTM

• Encoder-decoder architecture is more suitable to train LLMs given bidirectional attention to the

context than decoder-only

• Causal Language Modeling (CLM) task can be added to benefit the model with efficient in-context

learning

• Placing layer norm at the beginning of each transformer layer improves the training stability

,

Table Continued on Next Page

13

Models Findings & Insights

U-PaLM

• Training with a mixture of denoisers outperforms PaLM when trained further for a few more FLOPs

• Training with a mixture of denoisers improves the infilling ability and open-ended text generation

diversity

UL2

• Mode switching training enables better performance on downstream tasks

• CoT prompting outperforms standard prompting for UL2

GLM-130B

• Pre-training data with a small proportion of multi-task instruction data improves the overall model

performance

CodeGen

• Multi-step prompting for code synthesis leads to a better user intent understanding and code gen-

eration

LLaMA

• A constant performance improvement is observed when scaling the model

• Smaller models can achieve good performances with more training data and computing time

PanGu-Σ

• Sparse models provide the benefits of large models at a lower computation cost

• Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for

continual learning

• Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is

cost-efficient while maintaining a performance similar to the original

BloombergGPT

• Pre-training with general-purpose and task-specific data improves task performance without hurt-

ing other model capabilities

XuanYuan 2.0 • Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting

CodeT5+

• Causal LM is crucial for a model’s generation capability in encoder-decoder architectures

• Multiple training objectives like span corruption, Causal LM, matching, etc complement each other

for better performance

StarCoder • HHH prompt by Anthropic allows the model to follow instructions without fine-tuning

LLaMA-2

• Model trained on unfiltered data is more toxic but may perform better on downstream tasks after

fine-tuning

• Model trained on unfiltered data requires fewer samples for safety alignment

PaLM-2

• Data quality is important to train better models

• Model and data size should be scaled with 1:1 proportions

• Smaller models trained for larger iterations outperform larger models

14

Table 2: Key insights and findings from the study of instruction-tuned Large Language Models.

Models Findings & Insights

T0

• Multi-task prompting enables zero-shot generalization and outperforms baselines

• Even a single prompt per dataset task is enough to improve performance

WebGPT

• To aid the model in effectively filtering and utilizing relevant information, human labelers play a

crucial role in answering questions regarding the usefulness of the retrieved documents

• Interacting a fine-tuned language model with a text-based web-browsing environment can improve

end-to-end retrieval and synthesis via imitation learning and reinforcement learning

• Generating answers with references can make labelers easily judge the factual accuracy of answers

Tk-INSTRUCT

• Instruction tuning leads to a stronger generalization of unseen tasks

• More tasks improve generalization whereas only increasing task instances does not help

• Supervised trained models are better than generalized models

• Models pre-trained with instructions and examples perform well for different types of inputs

mT0 and BLOOMZ

• Instruction tuning enables zero-shot generalization to tasks never seen before

• Multi-lingual training leads to even better zero-shot generalization for both English and non-

English

• Training on machine-translated prompts improves performance for held-out tasks with non-English

prompts

• English only fine-tuning on multilingual pre-trained language model is enough to generalize to

other pre-trained language tasks

OPT-IML

• Creating a batch with multiple task examples is important for better performance

• Only example proportional sampling is not enough, training datasets should also be proportional

for better generalization/performance

• Fully held-out and partially supervised tasks performance improves by scaling tasks or categories

whereas fully supervised tasks have no effect

• Including small amounts i.e. 5% of pretraining data during fine-tuning is effective

• Only 1% reasoning data improves the performance, adding more deteriorates performance

• Adding dialogue data makes the performance worse

Sparrow

• Labelers’ judgment and well-defined alignment rules help the model generate better responses

• Good dialogue goals can be broken down into detailed natural language rules for the agent and the

raters

• The combination of reinforcement learning (RL) with reranking yields optimal performance in

terms of preference win rates and resilience against adversarial probing

Flan

• Finetuning with CoT improves performance on held-out tasks

• Fine-tuning along with CoT data improves reasoning abilities

• CoT tuning improves zero-shot reasoning

• Performance improves with more tasks

• Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models

• Improving the model’s performance with instruction tuning is compute-efficient

• Multitask prompting enables zero-shot generalization abilities in LLM

WizardCoder • Fine-tuning with re-written instruction-tuning data into a complex set improves performance

LLaMA-2-Chat

• Model learns to write safe responses with fine-tuning on safe demonstrations, while additional

RLHF step further improves model safety and make it less prone to jailbreak attacks

LIMA • Less high quality data is enough for fine-tuned model generalization

15

new instructions (52k) and instances (82k input-output pairs)

using GPT-3 [6]. Contrary to this, Dynosaur [145] uses the

meta-data of datasets on Huggingface to prompt LLMs to

generate multiple task instruction-tuning datasets.

LLaMA Tuned: Various models in the literature instruction-

tune LLaMA [146] with GPT-3 [6] or GPT-4 [147] generated

datasets. Among these, Alpaca [148], Vicuna [149], and

LLaMA-GPT-4 [150] are a few general-purpose fine-tuned

models, where Alpaca is trained on 52k samples from text-

davinci-003, Vicuna on 70k samples from ShareGPT.com,

and LLaMA-GPT-4 by re-creating Alpaca instructions from

GPT-4. Goat [151] fine-tunes LLaMA for arithmetic tasks

(1 million samples) by generating data from ChatGPT and

outperforms GPT-4, PaLM, BLOOM, OPT, etc., attributing its

success to the LLaMA’s consistent tokenization of numbers.

HuaTuo [152] is a medical knowledge model, fine-tuned with

a generated QA dataset of 8k instructions.

Complex Instructions: Evol-Instruct [153, 154] prompts

LLMs to convert given instructions into a more complex set.

The instructions are iteratively evolved with re-writing instruc-

tions in complex wording and creating new instructions. With

this style of automated instruction generation, WizardLM [153]

(fine-tuned LLaMA on 250k instructions), outperforms Vicuna

and Alpaca, and WizardCoder [154] (fine-tuned StarCoder)

beats Claude-Plus, Bard, and others.

3.2.3. Aligning with Human Preferences

Incorporating human preferences into LLMs presents a

significant advantage in mitigating undesirable behaviors and

ensuring accurate outputs. The initial work on alignment, such

as InstructGPT [20] aligns GPT-3 using a 3-step approach,

instruction-tuning, reward modeling, and fine-tuning with

reinforcement learning (RL). The supervised fine-tuned GPT-3

on demonstrations is queried to generate responses, which

human labelers rank according to human values, and a reward

model is trained on the ranked data. Lastly, the GPT-3 is trained

with proximal policy optimization (PPO) using rewards on the

generated data from the reward model. LLaMA 2-Chat [21]

improves alignment by dividing reward modeling into help-

fulness and safety rewards and using rejection sampling in

addition to PPO. The initial four versions of LLaMA 2-Chat

are fine-tuned with rejection sampling and then with PPO on

top of rejection sampling.

Aligning with Supported Evidence: This style of alignment

allows the model to generate responses with proofs and facts,

reduces

,

hallucination, and assists humans more effectively,

which increases trust in the model’s output. Similar to

the RLHF training style, a reward model is trained to rank

generated responses containing web citations in answers

to questions, which is later used to train the model, as in

GopherCite [155], WebGPT [156], and Sparrow [157]. The

ranking model in Sparrow [157] is divided into two branches,

preference reward and rule reward, where human annotators

adversarial probe the model to break a rule. These two rewards

together rank a response to train with RL.

Aligning Directly with SFT: The PPO in the RLHF pipeline

is complex, memory-intensive, and unstable, requiring mul-

tiple models, reward, value, policy, and reference models.

Avoiding this sophisticated alignment pipeline is possible by

incorporating minimal changes in the supervised fine-tuning

(SFT) pipeline as in [158, 159, 160], with better or compa-

rable performance to PPO. Direct preference optimization

(DPO) [158] trains a model directly on the human-preferred

responses to maximize the likelihood of preferred against

unpreferred responses, with per-sample importance weight.

Reward ranked fine-tuning RAFT [159] fine-tunes the model

on ranked responses by the reward model. Preference ranking

optimization (PRO) [161] and RRHF [160] penalize the model

to rank responses with human preferences and supervised loss.

On the other hand, chain-of-hindsight (CoH) [162] provides

feedback to the model in language rather than reward, to learn

good versus bad responses.

Aligning with Synthetic Feedback: Aligning LLMs with

human feedback is slow and costly. The literature suggests a

semi-automated process to align LLMs by prompting LLMs to

generate helpful, honest, and ethical responses to the queries,

and fine-tuning using the newly created dataset. Constitutional

AI [163] replaces human feedback in RLHF with AI, calling

it RL from AI feedback (RLAIF). AlpacaFarm [164] designs

prompts to imitate human feedback using LLMs APIs. Oppo-

site to constitutional AI, AlpacaFarm injects noise in feedback

to replicate human mistakes. Self-Align [98] prompts the

LLM with ICL examples, instructing the LLM about what the

response should contain to be considered useful and ethical.

The same LLM is later fine-tuned with the new dataset.

Aligning with Prompts: LLMs can be steered with prompts to

generate desirable responses without training [165, 166]. The

self-correction prompting in [166] concatenates instructions

and CoT with questions, guiding the model to answer its

instruction following a strategy to ensure moral safety before

the actual answer. This strategy is shown to reduce the harm in

generated responses significantly.

Red-Teaming/Jailbreaking/Adversarial Attacks: LLMs

exhibit harmful behaviors, hallucinations, leaking personal in-

formation, and other shortcomings through adversarial probing.

The models are susceptible to generating harmful responses

even though they are aligned for safety [167, 168]. Red-

teaming is a common approach to address illicit outputs, where

the LLMs are prompted to generate harmful outputs [168, 169].

The dataset collected through red-teaming is used to fine-tune

models for safety. While red-teaming largely relies on human

annotators, another work [170] red-team LLMs to find prompts

that lead to harmful outputs for other LLMs.

3.2.4. Continue Pre-Training

Although fine-tuning boosts a model’s performance, it leads

to catastrophic forgetting of previously learned information.

Concatenating fine-tuning data with a few randomly selected

pre-training samples in every iteration avoids network forget-

ting [171, 142]. This is also effective in adapting LLMs for

cases where fine-tuning data is small and the original capac-

16

ity is to be maintained. Prompt-based continued pre-training

(PCP) [172] trains the model with text and instructions related

to tasks and then finally instruction-tunes the model for down-

stream tasks.

3.2.5. Sample Efficiency

While fine-tuning data is generally many-fold smaller than

the pre-training data, it still has to be large enough for accept-

able performance [16, 97, 18] and requires proportional com-

puting resources. Studying the effects on performance with less

data, existing literature [173, 174] finds that models trained

on less data can outperform models trained with more data.

In [173], 25% of the total downstream data is found enough

for state-of-the-art performance. Selecting coreset-based 0.5%

of the total instruction-tuning data improves the model perfor-

mance by 2% in [174], as compared to the complete data tun-

ing. Less is more for alignment (LIMA) [175] uses only 1000

carefully created demonstrations to fine-tune the model and has

achieved comparable performance to GPT-4.

3.3. Increasing Context Window

LLMs are trained with limited context windows due to ex-

pensive attention and high memory requirements. A model

trained on limited sequence lengths fails to generalize to unseen

lengths at inference time [176, 49]. Alternatively, LLMs with

ALiBi [65] positional encodings can perform zero-shot length

extrapolation. However, ALiBi has less expressive power [66]

and inferior performance on multiple benchmarks [46], and

many LLMs use RoPE positional embedding that is unable to

perform zero-shot extrapolation. A larger context length has

benefits such as a better understanding of longer documents,

more samples in in-context learning, execution of bigger rea-

soning processes, etc. Expanding context length during fine-

tuning is slow, inefficient, and computationally expensive [49].

Therefore, researchers employ various context window extrap-

olation techniques discussed below.

Position Interpolation: Rather than extrapolating, [49] shows

that interpolating position encodings within the pre-trained con-

text window are more effective. The work demonstrates that

only 1000 steps of fine-tuning are enough to achieve better re-

sults on larger windows without reducing performance com-

pared to the original context size. Giraffe [46] uses power scal-

ing in RoPE, and YaRN [47] proposed NTK-aware interpola-

tion.

Efficient Attention Mechanism: Dense global attention is

one of the major constraints in training larger context win-

dow LLMs. Using efficient attention variants, such as lo-

cal, sparse, and dilated attention, reduces the computation cost

significantly. LongT5 [48] proposes transient global atten-

tion (TGlobal), applying attention to local and global tokens

(windowed token averaging). The model replaces attention

in T5 [10] with TGlobal attention, pre-trains the model on

4098 sequence length, fine-tunes on larger window sizes, as

large as 16k, and improves task performance on longer inputs.

This shows the extrapolation ability of TGlobal attention with

only fine-tuning. COLT5 [177] uses two branches, one with

lightweight and the other with heavyweight attention and feed-

forward layers. All tokens are processed from the lightweight

branch, and only important tokens are routed to the heavy-

weight branch. LongNet [178] replaces standard attention with

dilated attention, expanding sequence length to 1 billion tokens.

LongLoRA [179] proposes shift-short attention, used during

fine-tuning to reduce dense attention costs. However, the model

during inference uses dense attention and achieves similar per-

formance as full attention fine-tuning.

Extrapolation without Training: LM-Infinite [176] and par-

allel context windows (PCW) [180] show length extrapolation

is possible using pre-trained LLMs. LM-Infinite suggested Λ-

shaped attention applied within the original context window

limits. Likewise, PCW chunks larger inputs into the pre-trained

context lengths and applies the same positional encodings to

each chunk.

3.4. Augmented LLMs

LLMs are capable of learning from the examples concate-

nated with the input, known as context augmentation, in-

context learning (ICL), or few-shot prompting. They show ex-

cellent generalization to unseen tasks with few-shot prompt-

ing, enabling LLMs

A Comprehensive Overview of Large Language Models - Anatomia (2024)
Top Articles
Latest Posts
Article information

Author: Domingo Moore

Last Updated:

Views: 6742

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.