GLM-4

📄 Report • 🤗 HF Repo • 🤖 ModelScope • 🟣 WiseModel • 🐦 Twitter • 👋 加入我们的 Discord 和微信

📍在智谱AI开放平台体验和使用更大规模的 GLM 商业模型。

Read this in English

项目更新

🔥🔥 News: 2024/12/10: 本仓库微调代码支持使用Ascend NPU进行微调。请更新微调代码并查看代码内注释。
🔥 News: 2024/11/01: 本仓库依赖进行升级，请更新requirements.txt中的依赖以保证正常运行模型。glm-4-9b-chat-hf 是适配 transformers>=4.46.2 的模型权重，使用 transformers 库中的 GlmModel 类实现。同时，glm-4-9b-chat, glm-4v-9b 中的 tokenzier_chatglm.py 已经更新以适配最新版本的 transformers库。请前往 HuggingFace 更新文件。
🔥 News: 2024/10/27: 我们开源了 LongReward，这是一个使用 AI 反馈改进长上下文大型语言模型。
🔥 News: 2024/10/25: 我们开源了端到端中英语音对话模型 GLM-4-Voice。
🔥 News: 2024/09/05 我们开源了使LLMs能够在长上下文问答中生成细粒度引用的模型 longcite-glm4-9b 以及数据集 LongCite-45k, 欢迎在 Huggingface Space 在线体验。
🔥News: 2024/08/15: 我们开源具备长文本输出能力(单轮对话大模型输出可超过1万token) 的模型 longwriter-glm4-9b 以及数据集 LongWriter-6k, 欢迎在 Huggingface Space 或魔搭社区空间在线体验。
🔥 News: 2024/07/24: 我们发布了与长文本相关的最新技术解读，关注这里查看我们在训练 GLM-4-9B 开源模型中关于长文本技术的技术报告。
🔥 News: 2024/07/09: GLM-4-9B-Chat 模型已适配 Ollama, Llama.cpp，您可以在 PR 查看具体的细节。
🔥 News: 2024/06/18: 我们发布技术报告, 欢迎查看。
🔥 News: 2024/06/05: 我们发布 GLM-4-9B 系列开源模型。

模型介绍

GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。在语义、数学、推理、代码和知识等多方面的数据集测评中， GLM-4-9B 及其人类偏好对齐的版本 GLM-4-9B-Chat 均表现出超越 Llama-3-8B 的卓越性能。除了能进行多轮对话，GLM-4-9B-Chat 还具备网页浏览、代码执行、自定义工具调用（Function Call）和长文本推理（支持最大 128K 上下文）等高级功能。本代模型增加了多语言支持，支持包括日语，韩语，德语在内的 26 种语言。我们还推出了支持 1M 上下文长度（约 200 万中文字符）的 GLM-4-9B-Chat-1M 模型和基于 GLM-4-9B 的多模态模型 GLM-4V-9B。GLM-4V-9B 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力，在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中，GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini 1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。

Model List

Model	Type	Seq Length	Transformers Version	Download	Online Demo
GLM-4-9B	Base	8K	`4.44.0 - 4.45.0`	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	/
GLM-4-9B-Chat	Chat	128K	`>= 4.44.0`	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	🤖 ModelScope CPU 🤖 ModelScope vLLM
GLM-4-9B-Chat-HF	Chat	128K	`>= 4.46.0`	🤗 Huggingface 🤖 ModelScope	🤖 ModelScope CPU 🤖 ModelScope vLLM
GLM-4-9B-Chat-1M	Chat	1M	`>= 4.44.0`	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	/
GLM-4-9B-Chat-1M-HF	Chat	1M	`>= 4.46.0`	🤗 Huggingface 🤖 ModelScope	/
GLM-4V-9B	Chat	8K	`>= 4.46.0`	🤗 Huggingface 🤖 ModelScope 🟣 WiseModel	🤖 ModelScope

评测结果

对话模型典型任务

Model	AlignBench	MT-Bench	IFEval	MMLU	C-Eval	GSM8K	MATH	HumanEval	NaturalCodeBench
Llama-3-8B-Instruct	6.40	8.00	68.6	68.4	51.3	79.6	30.0	62.2	24.7
ChatGLM3-6B	5.18	5.50	28.1	61.4	69.0	72.3	25.7	58.5	11.3
GLM-4-9B-Chat	7.01	8.35	69.0	72.4	75.6	79.6	50.6	71.8	32.2

基座模型典型任务

Model	MMLU	C-Eval	GPQA	GSM8K	MATH	HumanEval
Llama-3-8B	66.6	51.2	-	45.8	-	33.5
Llama-3-8B-Instruct	68.4	51.3	34.2	79.6	30.0	62.2
ChatGLM3-6B-Base	61.4	69.0	26.8	72.3	25.7	58.5
GLM-4-9B	74.7	77.1	34.3	84.0	30.4	70.1

由于 GLM-4-9B 在预训练过程中加入了部分数学、推理、代码相关的 instruction 数据，所以将 Llama-3-8B-Instruct 也列入比较范围。

长文本

在 1M 的上下文长度下进行大海捞针实验，结果如下：

在 LongBench-Chat 上对长文本能力进行了进一步评测，结果如下:

多语言能力

在六个多语言数据集上对 GLM-4-9B-Chat 和 Llama-3-8B-Instruct 进行了测试，测试结果及数据集对应选取语言如下表

Dataset	Llama-3-8B-Instruct	GLM-4-9B-Chat	Languages
M-MMLU	49.6	56.6	all
FLORES	25.0	28.8	ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
MGSM	54.0	65.3	zh, en, bn, de, es, fr, ja, ru, sw, te, th
XWinograd	61.7	73.1	zh, en, fr, jp, ru, pt
XStoryCloze	84.7	90.7	zh, en, ar, es, eu, hi, id, my, ru, sw, te
XCOPA	73.3	80.1	zh, et, ht, id, it, qu, sw, ta, th, tr, vi

工具调用能力

我们在 Berkeley Function Calling Leaderboard 上进行了测试并得到了以下结果：

Model	Overall Acc.	AST Summary	Exec Summary	Relevance
Llama-3-8B-Instruct	58.88	59.25	70.01	45.83
gpt-4-turbo-2024-04-09	81.24	82.14	78.61	88.75
ChatGLM3-6B	57.88	62.18	69.78	5.42
GLM-4-9B-Chat	81.00	80.26	84.40	87.92

多模态能力

GLM-4V-9B 是一个多模态语言模型，具备视觉理解能力，其相关经典任务的评测结果如下：

	MMBench-EN-Test	MMBench-CN-Test	SEEDBench_IMG	MMStar	MMMU	MME	HallusionBench	AI2D	OCRBench
gpt-4o-2024-05-13	83.4	82.1	77.1	63.9	69.2	2310.3	55.0	84.6	736
gpt-4-turbo-2024-04-09	81.0	80.2	73.0	56.0	61.7	2070.2	43.9	78.6	656
gpt-4-1106-preview	77.0	74.4	72.3	49.7	53.8	1771.5	46.5	75.9	516
InternVL-Chat-V1.5	82.3	80.7	75.2	57.1	46.8	2189.6	47.4	80.6	720
LLaVA-Next-Yi-34B	81.1	79.0	75.7	51.6	48.8	2050.2	34.8	78.9	574
Step-1V	80.7	79.9	70.3	50.0	49.9	2206.4	48.4	79.2	625
MiniCPM-Llama3-V2.5	77.6	73.8	72.3	51.8	45.8	2024.6	42.4	78.4	725
Qwen-VL-Max	77.6	75.7	72.7	49.5	52.0	2281.7	41.2	75.7	684
Gemini 1.0 Pro	73.6	74.3	70.7	38.6	49.0	2148.9	45.7	72.9	680
Claude 3 Opus	63.3	59.2	64.0	45.7	54.9	1586.8	37.8	70.6	694
GLM-4V-9B	81.1	79.4	76.8	58.7	47.2	2163.8	46.6	81.1	786

快速调用

硬件配置和系统要求，请查看这里。

使用以下方法快速调用 GLM-4-9B-Chat 语言模型

使用 transformers 后端进行推理:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0' # 设置 GPU 编号，如果单机单卡指定一个，单机多卡指定多个 GPU 编号
MODEL_PATH = "THUDM/glm-4-9b-chat-hf"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

使用 vLLM 后端进行推理:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4
# 如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat-hf"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

使用以下方法快速调用 GLM-4V-9B 多模态模型