小智头像图片
AI教程 2025年01月17日
0 收藏 0 点赞 309 浏览 2492 个字
摘要 :

面向开发者的LLM入门教程-文档分割英文版: 英文版 1.短句分割 #导入文本分割器 from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextS……

哈喽!伙伴们,我是小智,你们的AI向导。欢迎来到每日的AI学习时间。今天,我们将一起深入AI的奇妙世界,探索“面向开发者的LLM入门教程-文档分割英文版”,并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知,只需唤醒你的潜能!”跟着小智的步伐,我们终将学有所成,学以致用,并发现自身的更多可能性。话不多说,现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-文档分割英文版

面向开发者的LLM入门教程-文档分割英文版:

英文版

1.短句分割

#导入文本分割器
from langchain.text_splitter import RecursiveCharacterTextSplitter,
CharacterTextSplitter

chunk_size = 26 #设置块大小
chunk_overlap = 4 #设置块重叠大小

#初始化文本分割器
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

递归字符分割器效果

text = “a b c d e f g h i j k l m n o p q r s t u v w x y z”#测试文本
r_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

len(” l m n o p q r s t u v w x”)

25

字符分割器效果

#字符文本分割器
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m n o p q r s t u v w x y z’]

设置空格为分隔符的字符分割器

# 设置空格分隔符
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separator=’ ‘
)
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

2.长文本分割

# 递归分割长段落
some_text = “””When writing documents, writers will use document structure to
group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
nn
Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string.
Sentences have a period at the end, but also, have a space.
and words are separated by space.”””

c_splitter = CharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separator=’ ‘
)
”’
对于递归字符分割器,依次传入分隔符列表,分别是双换行符、单换行符、空格、空字符,
因此在分割文本时,首先会采用双分换行符进行分割,同时依次使用其他分隔符进行分割
”’
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separators=[“nn”, “n”, ” “, “”]
)

字符分割器效果:

c_splitter.split_text(some_text)

[‘When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document. nn Paragraphs are often delimited with a carriage return or two
carriage returns. Carriage returns are the “backslash n” you see embedded in this
string. Sentences have a period at the end, but also,’,
‘have a space.and words are separated by space.’]

递归字符分割器效果:

#分割结果
r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document.”,
‘Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string. Sentences
have a period at the end, but also, have a space.and words are separated by
space.’]

增加按句子分割的效果:

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=[“nn”, “n”, “(?<=. )", " ", ""] ) r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related.”,
‘For example, closely related ideas are in sentances. Similar ideas are in
paragraphs. Paragraphs form a document.’,
‘Paragraphs are often delimited with a carriage return or two carriage
returns.’,
‘Carriage returns are the “backslash n” you see embedded in this string.’,
‘Sentences have a period at the end, but also, have a space.and words are
separated by space.’]

3.基于Token分割

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

4.分割自定义Markdown文档

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

5.分割自定义Markdown文档

# 定义一个Markdown文档
from langchain.document_loaders import NotionDirectoryLoader#Notion加载器
from langchain.text_splitter import MarkdownHeaderTextSplitter#markdown分割器

markdown_document = “””# Titlenn
## Chapter 1nn
Hi this is Jimnn Hi this is Joenn
### Section nn
Hi this is Lance nn
## Chapter 2nn
Hi this is Molly”””

# 初始化Markdown标题文本分割器,分割Markdown文档
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(“The first chunk”)
print(md_header_splits[0])
# Document(page_content=’Hi this is Jim nHi this is Joe’, metadata={‘Header 1’:
‘Title’, ‘Header 2’: ‘Chapter 1′})
print(“The second chunk”)
print(md_header_splits[1])
# Document(page_content=’Hi this is Lance’, metadata={‘Header 1’: ‘Title’,
‘Header 2’: ‘Chapter 1’, ‘Header 3’: ‘Section’})

The first chunk
page_content=’Hi this is Jim nHi this is Joe n### Section nHi this is
Lance’ metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1′}
The second chunk
page_content=’Hi this is Molly’ metadata={‘Header 1’: ‘Title’, ‘Header 2’:
‘Chapter 2’}

6.分割数据库中的Markdown文档

#加载数据库的内容
loader = NotionDirectoryLoader(“docs/Notion_DB”)
docs = loader.load()
txt = ‘ ‘.join([d.page_content for d in docs])#拼接文档
headers_to_split_on = [
(“#”, “Header 1”),
(“##”, “Header 2”),
]
#加载文档分割器
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(txt)#分割文本内容
print(md_header_splits[0])#分割结果

page_content=’Let’s talk about stress. Too much stress. nWe know this can be a
topic. nSo let’s get this conversation going. n[Intro: two things you should
know]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Intro%20two%20things%20y
ou%20should%20know%20b5fd0c5393a9498b93396e79fe71e8bf.md) n[What is stress]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20is%20stress%20b19
8b685ed6a474ab14f6fafff7004b6.md) n[When is there too much stress?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/When%20is%20there%20too%
20much%20stress%20dc135b9a86a843cbafd115aa128c5c90.md) n[What can I do]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20I%20do%2009
c1b13703ef42d4a889e2059c5b25fe.md) n[What can Blendle do?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20Blendle%20d
o%20618ab89df4a647bf96e7b432af82779f.md) n[Good reads]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Good%20reads%20e817491d8
4d549f886af972e0668192e.md) nGo to **#letstalkaboutstress** on slack to chat
about this topic’ metadata={‘Header 1’: ‘#letstalkaboutstress’}

面向开发者的LLM入门教程-向量数据库与词向量(1)
面向开发者的LLM入门教程-向量数据库与词向量(1):向量数据库与词向量(Vectorstoresand Embeddings) 让我们一起回顾一下检索增强生成...

嘿,伙伴们,今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-文档分割英文版”的内容已经分享给大家了。感谢你们的陪伴,希望这次旅程让你对AI能够更了解、更喜欢。谨记,精准提问是解锁AI潜能的钥匙哦!如果有小伙伴想要了解学习更多的AI知识,请关注我们的官网“AI智研社”,保证让你收获满满呦!

微信打赏二维码 微信扫一扫

支付宝打赏二维码 支付宝扫一扫

版权: 转载请注明出处:https://www.ai-blog.cn/2791.html

相关推荐

AI写作-DeepSeek高阶提示词之自媒体爆款创作篇: 自媒体爆款创作篇 6.10W+标题生成器 “生成20个关…

小智头像图片
309

AI写作-DeepSeek高阶提示词之职场打工人必备篇: 职场打工人必备篇 1.会议纪要秒整理 “将以下会议…

小智头像图片
309

AI绘画-即梦3.0提示词示例之场景化种草型​: 场景化种草型​ 公式:产品+使用场景+氛围渲染+情感化…

小智头像图片
63

AI绘画-即梦3.0提示词示例之限时折扣型​: 限时折扣型​ 公式:产品+价格锚点+紧迫感元素+霓虹灯风…

小智头像图片
309

AI绘画-即梦3.0提示词示例之新品发布型: 新品发布型​ 公式:产品+核心卖点+高级质感+极简排版 ​ …

小智头像图片
309

AI绘画-即梦3.0提示词示例之节日促销型​: 节日促销型​ 公式:产品+节日主题+视觉元素+动态文字+风…

小智头像图片
309

AI绘画-即梦3.0提示词示例之暗黑哥特​: 暗黑哥特​ 提示词: 哥特体、烛光照明、高反差,荆棘十字…

小智头像图片
309

AI绘画-即梦3.0提示词示例之奶油治愈​: 奶油治愈​ 提示词: 奶乎乎、柔焦镜头、低对比,猫咪蜷缩…

小智头像图片
309
发表评论
暂无评论

还没有评论呢,快来抢沙发~

助力原创内容

快速提升站内名气成为大牛

扫描二维码

手机访问本站

二维码
vip弹窗图片