开发者

面向开发者的LLM入门教程-文档分割英文版

小智 AI教程 2025年01月17日

0 收藏 0 点赞 331 浏览 2492 个字

摘要 :

面向开发者的LLM入门教程-文档分割英文版：英文版 1.短句分割 #导入文本分割器 from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextS……

哈喽！伙伴们，我是小智，你们的AI向导。欢迎来到每日的AI学习时间。今天，我们将一起深入AI的奇妙世界，探索“面向开发者的LLM入门教程-文档分割英文版”，并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知，只需唤醒你的潜能！”跟着小智的步伐，我们终将学有所成，学以致用，并发现自身的更多可能性。话不多说，现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-文档分割英文版：

英文版

1.短句分割

#导入文本分割器
from langchain.text_splitter import RecursiveCharacterTextSplitter,
CharacterTextSplitter

chunk_size = 26 #设置块大小
chunk_overlap = 4 #设置块重叠大小

#初始化文本分割器
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

递归字符分割器效果

text = “a b c d e f g h i j k l m n o p q r s t u v w x y z”#测试文本
r_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

len(” l m n o p q r s t u v w x”)

25

字符分割器效果

#字符文本分割器
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m n o p q r s t u v w x y z’]

设置空格为分隔符的字符分割器

# 设置空格分隔符
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separator=’ ‘
)
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

2.长文本分割

# 递归分割长段落
some_text = “””When writing documents, writers will use document structure to
group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
nn
Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string.
Sentences have a period at the end, but also, have a space.
and words are separated by space.”””

c_splitter = CharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separator=’ ‘
)
”’
对于递归字符分割器，依次传入分隔符列表，分别是双换行符、单换行符、空格、空字符，
因此在分割文本时，首先会采用双分换行符进行分割，同时依次使用其他分隔符进行分割
”’
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separators=[“nn”, “n”, ” “, “”]
)

字符分割器效果：

c_splitter.split_text(some_text)

[‘When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document. nn Paragraphs are often delimited with a carriage return or two
carriage returns. Carriage returns are the “backslash n” you see embedded in this
string. Sentences have a period at the end, but also,’,
‘have a space.and words are separated by space.’]

递归字符分割器效果：

#分割结果
r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document.”,
‘Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string. Sentences
have a period at the end, but also, have a space.and words are separated by
space.’]

增加按句子分割的效果：

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=[“nn”, “n”, “(?<=. )", " ", ""] ) r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related.”,
‘For example, closely related ideas are in sentances. Similar ideas are in
paragraphs. Paragraphs form a document.’,
‘Paragraphs are often delimited with a carriage return or two carriage
returns.’,
‘Carriage returns are the “backslash n” you see embedded in this string.’,
‘Sentences have a period at the end, but also, have a space.and words are
separated by space.’]

3.基于Token分割

# 使用token分割器进行分割，
# 将块大小设为1，块重叠大小设为0，相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

4.分割自定义Markdown文档

# 使用token分割器进行分割，
# 将块大小设为1，块重叠大小设为0，相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

5.分割自定义Markdown文档

# 定义一个Markdown文档
from langchain.document_loaders import NotionDirectoryLoader#Notion加载器
from langchain.text_splitter import MarkdownHeaderTextSplitter#markdown分割器

markdown_document = “””# Titlenn
## Chapter 1nn
Hi this is Jimnn Hi this is Joenn
### Section nn
Hi this is Lance nn
## Chapter 2nn
Hi this is Molly”””

# 初始化Markdown标题文本分割器，分割Markdown文档
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(“The first chunk”)
print(md_header_splits[0])
# Document(page_content=’Hi this is Jim nHi this is Joe’, metadata={‘Header 1’:
‘Title’, ‘Header 2’: ‘Chapter 1′})
print(“The second chunk”)
print(md_header_splits[1])
# Document(page_content=’Hi this is Lance’, metadata={‘Header 1’: ‘Title’,
‘Header 2’: ‘Chapter 1’, ‘Header 3’: ‘Section’})

The first chunk
page_content=’Hi this is Jim nHi this is Joe n### Section nHi this is
Lance’ metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1′}
The second chunk
page_content=’Hi this is Molly’ metadata={‘Header 1’: ‘Title’, ‘Header 2’:
‘Chapter 2’}

6.分割数据库中的Markdown文档

#加载数据库的内容
loader = NotionDirectoryLoader(“docs/Notion_DB”)
docs = loader.load()
txt = ‘ ‘.join([d.page_content for d in docs])#拼接文档
headers_to_split_on = [
(“#”, “Header 1”),
(“##”, “Header 2”),
]
#加载文档分割器
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(txt)#分割文本内容
print(md_header_splits[0])#分割结果

page_content=’Let’s talk about stress. Too much stress. nWe know this can be a
topic. nSo let’s get this conversation going. n[Intro: two things you should
know]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Intro%20two%20things%20y
ou%20should%20know%20b5fd0c5393a9498b93396e79fe71e8bf.md) n[What is stress]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20is%20stress%20b19
8b685ed6a474ab14f6fafff7004b6.md) n[When is there too much stress?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/When%20is%20there%20too%
20much%20stress%20dc135b9a86a843cbafd115aa128c5c90.md) n[What can I do]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20I%20do%2009
c1b13703ef42d4a889e2059c5b25fe.md) n[What can Blendle do?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20Blendle%20d
o%20618ab89df4a647bf96e7b432af82779f.md) n[Good reads]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Good%20reads%20e817491d8
4d549f886af972e0668192e.md) nGo to **#letstalkaboutstress** on slack to chat
about this topic’ metadata={‘Header 1’: ‘#letstalkaboutstress’}

面向开发者的LLM入门教程-向量数据库与词向量(1)

面向开发者的LLM入门教程-向量数据库与词向量(1)：向量数据库与词向量(Vectorstoresand Embeddings) 让我们一起回顾一下检索增强生成...

查看文章

嘿，伙伴们，今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-文档分割英文版”的内容已经分享给大家了。感谢你们的陪伴，希望这次旅程让你对AI能够更了解、更喜欢。谨记，精准提问是解锁AI潜能的钥匙哦！如果有小伙伴想要了解学习更多的AI知识，请关注我们的官网“AI智研社”，保证让你收获满满呦！

赏

微信打赏二维码微信扫一扫