小智头像图片
AI教程 2025年01月17日
0 收藏 0 点赞 426 浏览 1603 个字
摘要 :

面向开发者的LLM入门教程-向量数据库与词向量英文版: 英文版 1.读取文档 from langchain.document_loaders import PyPDFLoader # 加载 PDF loaders = [ # 故意添加重复……

哈喽!伙伴们,我是小智,你们的AI向导。欢迎来到每日的AI学习时间。今天,我们将一起深入AI的奇妙世界,探索“面向开发者的LLM入门教程-向量数据库与词向量英文版”,并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知,只需唤醒你的潜能!”跟着小智的步伐,我们终将学有所成,学以致用,并发现自身的更多可能性。话不多说,现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-向量数据库与词向量英文版

面向开发者的LLM入门教程-向量数据库与词向量英文版:

英文版

1.读取文档

from langchain.document_loaders import PyPDFLoader

# 加载 PDF
loaders = [
# 故意添加重复文档,使数据混乱
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture02.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture03.pdf”)
]
docs = []
for loader in loaders:
docs.extend(loader.load())

进行分割

# 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1500, # 每个文本块的大小。这意味着每次切分文本时,会尽量使每个块包含 1500
个字符。
chunk_overlap = 150 # 每个文本块之间的重叠部分。
)

splits = text_splitter.split_documents(docs)

print(len(splits))

209

2.Embedding

from langchain.embeddings.openai import OpenAIEmbeddings
import numpy as np

embedding = OpenAIEmbeddings()

sentence1 = “i like dogs”
sentence2 = “i like canines”
sentence3 = “the weather is ugly outside”

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

print(“Sentence 1 VS setence 2”)
print(np.dot(embedding1, embedding2))
print(“Sentence 1 VS setence 3”)
print(np.dot(embedding1, embedding3))
print(“Sentence 2 VS sentence 3”)
print(np.dot(embedding2, embedding3))

Sentence 1 VS setence 2
0.9632026347895142
Sentence 1 VS setence 3
0.7711302839662464
Sentence 2 VS sentence 3
0.759699788340627

3.初始化Chroma

from langchain.vectorstores import Chroma

persist_directory = ‘docs/chroma/cs229_lectures/’
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory # 允许我们将persist_directory目录保存到磁盘

)

print(vectordb._collection.count())

100%|██████████| 1/1 [00:02<00:00, 2.62s/it]

209

3.相似性检索

question = “is there an email i can ask for help” # “有我可以寻求帮助的电子邮件吗”

docs = vectordb.similarity_search(question,k=3)

print(“Length of docs: “, len(docs))
print(“Page content:”)
print(docs[0].page_content)

Length of docs: 3
Page content:
cs229-qa@cs.stanford.edu. This goes to an acc ount that’s read by all the TAs and
me. So
rather than sending us email individually, if you send email to this account, it
will
actually let us get back to you maximally quickly with answers to your questions.
If you’re asking questions about homework probl ems, please say in the subject
line which
assignment and which question the email refers to, since that will also help us
to route
your question to the appropriate TA or to me appropriately and get the response
back to
you quickly.
Let’s see. Skipping ahead — let’s see — for homework, one midterm, one open and
term
project. Notice on the honor code. So one thi ng that I think will help you to
succeed and
do well in this class and even help you to enjoy this cla ss more is if you form
a study
group.
So start looking around where you’ re sitting now or at the end of class today,
mingle a
little bit and get to know your classmates. I strongly encourage you to form
study groups
and sort of have a group of people to study with and have a group of your fellow
students
to talk over these concepts with. You can also post on the class news group if
you want to
use that to try to form a study group.
But some of the problems sets in this cla ss are reasonably difficult. People
that have
taken the class before may tell you they were very difficult. And just I bet it
would be
more fun for you, and you’d probably have a be tter learning experience if you
form a

持久化数据库

vectordb.persist()

4.重复块

question = “what did they say about matlab?” # “他们对 matlab 有何评价?”

docs = vectordb.similarity_search(question,k=5)

print(“docs[0]”)
print(docs[0])

print(“docs[1]”)
print(docs[1])

docs[0]
page_content=’those homeworks will be done in either MATLA B or in Octave, which
is sort of — I nknow some people call it a free ve rsion of MATLAB, which it
sort of is, sort of isn’t. nSo I guess for those of you that haven’t s een
MATLAB before, and I know most of you nhave, MATLAB is I guess part of the
programming language that makes it very easy to write codes using matrices, to
write code for numerical routines, to move data around, to nplot data. And it’s
sort of an extremely easy to learn tool to use for implementing a lot of
nlearning algorithms. nAnd in case some of you want to work on your own home
computer or something if you ndon’t have a MATLAB license, for the purposes of
this class, there’s also — [inaudible] nwrite that down [inaudible] MATLAB —
there’ s also a software package called Octave nthat you can download for free
off the Internet. And it has somewhat fewer features than MATLAB, but it’s free,
and for the purposes of this class, it will work for just about neverything.
nSo actually I, well, so yeah, just a side comment for those of you that
haven’t seen nMATLAB before I guess, once a colleague of mine at a different
university, not at nStanford, actually teaches another machine l earning course.
He’s taught it for many years. nSo one day, he was in his office, and an old
student of his from, lik e, ten years ago came ninto his office and he said,
“Oh, professo r, professor, thank you so much for your’ metadata={‘source’:
‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}
docs[1]
page_content=’those homeworks will be done in either MATLA B or in Octave, which
is sort of — I nknow some people call it a free ve rsion of MATLAB, which it
sort of is, sort of isn’t. nSo I guess for those of you that haven’t s een
MATLAB before, and I know most of you nhave, MATLAB is I guess part of the
programming language that makes it very easy to write codes using matrices, to
write code for numerical routines, to move data around, to nplot data. And it’s
sort of an extremely easy to learn tool to use for implementing a lot of
nlearning algorithms. nAnd in case some of you want to work on your own home
computer or something if you ndon’t have a MATLAB license, for the purposes of
this class, there’s also — [inaudible] nwrite that down [inaudible] MATLAB —
there’ s also a software package called Octave nthat you can download for free
off the Internet. And it has somewhat fewer features than MATLAB, but it’s free,
and for the purposes of this class, it will work for just about neverything.
nSo actually I, well, so yeah, just a side comment for those of you that
haven’t seen nMATLAB before I guess, once a colleague of mine at a different
university, not at nStanford, actually teaches another machine l earning course.
He’s taught it for many years. nSo one day, he was in his office, and an old
student of his from, lik e, ten years ago came ninto his office and he said,
“Oh, professo r, professor, thank you so much for your’ metadata={‘source’:
‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}

5.检索错误答案

question = “what did they say about regression in the third lecture?” # “他们在第
三讲中是怎么谈论回归的?”

docs = vectordb.similarity_search(question,k=5)

for doc in docs:
print(doc.metadata)

print(“docs-4:”)
print(docs[4].page_content)

{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 14}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture02.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 6}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}
docs-4:
into his office and he said, “Oh, professo r, professor, thank you so much for
your
machine learning class. I learned so much from it. There’s this stuff that I
learned in your
class, and I now use every day. And it’s help ed me make lots of money, and
here’s a
picture of my big house.”
So my friend was very excited. He said, “W ow. That’s great. I’m glad to hear
this
machine learning stuff was actually useful. So what was it that you learned? Was
it
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that
you
learned that was so helpful?” And the student said, “Oh, it was the MATLAB.”
So for those of you that don’t know MATLAB yet, I hope you do learn it. It’s not
hard,
and we’ll actually have a short MATLAB tutori al in one of the discussion
sections for
those of you that don’t know it.
Okay. The very last piece of logistical th ing is the discussion s ections. So
discussion
sections will be taught by the TAs, and atte ndance at discussion sections is
optional,
although they’ll also be recorded and televi sed. And we’ll use the discussion
sections
mainly for two things. For the next two or th ree weeks, we’ll use the discussion
sections
to go over the prerequisites to this class or if some of you haven’t seen
probability or
statistics for a while or maybe algebra, we’ll go over those in the discussion
sections as a
refresher for those of you that want one.

面向开发者的LLM入门教程-向量数据库检索
面向开发者的LLM入门教程-向量数据库检索:检索(Retrieval) 在构建检索增强生成 (RAG) 系统时,信息检索是核心环节。检索模块负责对...

嘿,伙伴们,今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-向量数据库与词向量英文版”的内容已经分享给大家了。感谢你们的陪伴,希望这次旅程让你对AI能够更了解、更喜欢。谨记,精准提问是解锁AI潜能的钥匙哦!如果有小伙伴想要了解学习更多的AI知识,请关注我们的官网“AI智研社”,保证让你收获满满呦!

微信打赏二维码 微信扫一扫

支付宝打赏二维码 支付宝扫一扫

版权: 转载请注明出处:https://www.ai-blog.cn/2807.html

相关推荐
03-23

如何在 Java 中基于 LangChain 编写大语言模型应用: 在本教程中,我们将会研究 LangChain 的细节…

小智头像图片
153
03-12

DeepSeek企业级部署实战指南: 对于个人开发者或尝鲜者而言,本地想要部署 DeepSeek 有很多种方案…

小智头像图片
97
03-12

如何使用DeepSeek助你增强求职竞争力: 职场篇 常见的简历问题 1. 格式混乱或排版不专业 • 问题:…

小智头像图片
206
03-12

DeepSeek官方提示词:让你的API应用和官方一样强: 本文讨论了DeepSeek官方关于让API应用和官方一…

小智头像图片
101
03-12

关于 DeepSeek 的研究和思考 (Archerman Capital): 关于这几天很火的 DeepSeek, 我们 (Archerman …

小智头像图片
116

AI教程DeepSeek提示词之代码解释: 代码解释​ ​ 对代码进行解释,来帮助理解代码内容。​ ​ ​ 请解…

小智头像图片
123

AI教程DeepSeek提示词之代码改写: 代码改写​ ​ 对代码进行修改,来实现纠错、注释、调优等。​ ​ …

小智头像图片
426

AI教程DeepSeek提示词之代码生成: 代码生成​ ​ 让模型生成一段完成特定功能的代码。​ ​ 用户提示…

小智头像图片
426
发表评论
暂无评论

还没有评论呢,快来抢沙发~

助力原创内容

快速提升站内名气成为大牛

扫描二维码

手机访问本站

二维码
vip弹窗图片