小智头像图片
AI教程 2025年01月17日
0 收藏 0 点赞 284 浏览 2492 个字
摘要 :

面向开发者的LLM入门教程-文档分割英文版: 英文版 1.短句分割 #导入文本分割器 from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextS……

哈喽!伙伴们,我是小智,你们的AI向导。欢迎来到每日的AI学习时间。今天,我们将一起深入AI的奇妙世界,探索“面向开发者的LLM入门教程-文档分割英文版”,并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知,只需唤醒你的潜能!”跟着小智的步伐,我们终将学有所成,学以致用,并发现自身的更多可能性。话不多说,现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-文档分割英文版

面向开发者的LLM入门教程-文档分割英文版:

英文版

1.短句分割

#导入文本分割器
from langchain.text_splitter import RecursiveCharacterTextSplitter,
CharacterTextSplitter

chunk_size = 26 #设置块大小
chunk_overlap = 4 #设置块重叠大小

#初始化文本分割器
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

递归字符分割器效果

text = “a b c d e f g h i j k l m n o p q r s t u v w x y z”#测试文本
r_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

len(” l m n o p q r s t u v w x”)

25

字符分割器效果

#字符文本分割器
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m n o p q r s t u v w x y z’]

设置空格为分隔符的字符分割器

# 设置空格分隔符
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separator=’ ‘
)
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

2.长文本分割

# 递归分割长段落
some_text = “””When writing documents, writers will use document structure to
group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
nn
Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string.
Sentences have a period at the end, but also, have a space.
and words are separated by space.”””

c_splitter = CharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separator=’ ‘
)
”’
对于递归字符分割器,依次传入分隔符列表,分别是双换行符、单换行符、空格、空字符,
因此在分割文本时,首先会采用双分换行符进行分割,同时依次使用其他分隔符进行分割
”’
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separators=[“nn”, “n”, ” “, “”]
)

字符分割器效果:

c_splitter.split_text(some_text)

[‘When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document. nn Paragraphs are often delimited with a carriage return or two
carriage returns. Carriage returns are the “backslash n” you see embedded in this
string. Sentences have a period at the end, but also,’,
‘have a space.and words are separated by space.’]

递归字符分割器效果:

#分割结果
r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document.”,
‘Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string. Sentences
have a period at the end, but also, have a space.and words are separated by
space.’]

增加按句子分割的效果:

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=[“nn”, “n”, “(?<=. )", " ", ""] ) r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related.”,
‘For example, closely related ideas are in sentances. Similar ideas are in
paragraphs. Paragraphs form a document.’,
‘Paragraphs are often delimited with a carriage return or two carriage
returns.’,
‘Carriage returns are the “backslash n” you see embedded in this string.’,
‘Sentences have a period at the end, but also, have a space.and words are
separated by space.’]

3.基于Token分割

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

4.分割自定义Markdown文档

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

5.分割自定义Markdown文档

# 定义一个Markdown文档
from langchain.document_loaders import NotionDirectoryLoader#Notion加载器
from langchain.text_splitter import MarkdownHeaderTextSplitter#markdown分割器

markdown_document = “””# Titlenn
## Chapter 1nn
Hi this is Jimnn Hi this is Joenn
### Section nn
Hi this is Lance nn
## Chapter 2nn
Hi this is Molly”””

# 初始化Markdown标题文本分割器,分割Markdown文档
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(“The first chunk”)
print(md_header_splits[0])
# Document(page_content=’Hi this is Jim nHi this is Joe’, metadata={‘Header 1’:
‘Title’, ‘Header 2’: ‘Chapter 1′})
print(“The second chunk”)
print(md_header_splits[1])
# Document(page_content=’Hi this is Lance’, metadata={‘Header 1’: ‘Title’,
‘Header 2’: ‘Chapter 1’, ‘Header 3’: ‘Section’})

The first chunk
page_content=’Hi this is Jim nHi this is Joe n### Section nHi this is
Lance’ metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1′}
The second chunk
page_content=’Hi this is Molly’ metadata={‘Header 1’: ‘Title’, ‘Header 2’:
‘Chapter 2’}

6.分割数据库中的Markdown文档

#加载数据库的内容
loader = NotionDirectoryLoader(“docs/Notion_DB”)
docs = loader.load()
txt = ‘ ‘.join([d.page_content for d in docs])#拼接文档
headers_to_split_on = [
(“#”, “Header 1”),
(“##”, “Header 2”),
]
#加载文档分割器
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(txt)#分割文本内容
print(md_header_splits[0])#分割结果

page_content=’Let’s talk about stress. Too much stress. nWe know this can be a
topic. nSo let’s get this conversation going. n[Intro: two things you should
know]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Intro%20two%20things%20y
ou%20should%20know%20b5fd0c5393a9498b93396e79fe71e8bf.md) n[What is stress]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20is%20stress%20b19
8b685ed6a474ab14f6fafff7004b6.md) n[When is there too much stress?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/When%20is%20there%20too%
20much%20stress%20dc135b9a86a843cbafd115aa128c5c90.md) n[What can I do]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20I%20do%2009
c1b13703ef42d4a889e2059c5b25fe.md) n[What can Blendle do?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20Blendle%20d
o%20618ab89df4a647bf96e7b432af82779f.md) n[Good reads]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Good%20reads%20e817491d8
4d549f886af972e0668192e.md) nGo to **#letstalkaboutstress** on slack to chat
about this topic’ metadata={‘Header 1’: ‘#letstalkaboutstress’}

面向开发者的LLM入门教程-向量数据库与词向量(1)
面向开发者的LLM入门教程-向量数据库与词向量(1):向量数据库与词向量(Vectorstoresand Embeddings) 让我们一起回顾一下检索增强生成...

嘿,伙伴们,今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-文档分割英文版”的内容已经分享给大家了。感谢你们的陪伴,希望这次旅程让你对AI能够更了解、更喜欢。谨记,精准提问是解锁AI潜能的钥匙哦!如果有小伙伴想要了解学习更多的AI知识,请关注我们的官网“AI智研社”,保证让你收获满满呦!

微信打赏二维码 微信扫一扫

支付宝打赏二维码 支付宝扫一扫

版权: 转载请注明出处:https://www.ai-blog.cn/2791.html

相关推荐
03-23

如何在 Java 中基于 LangChain 编写大语言模型应用: 在本教程中,我们将会研究 LangChain 的细节…

小智头像图片
151
03-12

DeepSeek企业级部署实战指南: 对于个人开发者或尝鲜者而言,本地想要部署 DeepSeek 有很多种方案…

小智头像图片
96
03-12

如何使用DeepSeek助你增强求职竞争力: 职场篇 常见的简历问题 1. 格式混乱或排版不专业 • 问题:…

小智头像图片
205
03-12

DeepSeek官方提示词:让你的API应用和官方一样强: 本文讨论了DeepSeek官方关于让API应用和官方一…

小智头像图片
100
03-12

关于 DeepSeek 的研究和思考 (Archerman Capital): 关于这几天很火的 DeepSeek, 我们 (Archerman …

小智头像图片
115

AI教程DeepSeek提示词之代码解释: 代码解释​ ​ 对代码进行解释,来帮助理解代码内容。​ ​ ​ 请解…

小智头像图片
122

AI教程DeepSeek提示词之代码改写: 代码改写​ ​ 对代码进行修改,来实现纠错、注释、调优等。​ ​ …

小智头像图片
284

AI教程DeepSeek提示词之代码生成: 代码生成​ ​ 让模型生成一段完成特定功能的代码。​ ​ 用户提示…

小智头像图片
284
发表评论
暂无评论

还没有评论呢,快来抢沙发~

助力原创内容

快速提升站内名气成为大牛

扫描二维码

手机访问本站

二维码
vip弹窗图片