Author
송명하 / MLOps Engineer
Category
Hands-on
Tags
Vector DatabaseNLP
Published
December 1, 2023
- 실습 환경 구성
- Database 구축
- qdrant DB
- milvus DB
- postgres (pgvector)
- Dataset 다운로드
- Library 설치
- SQuAD v2.0 Dataset란?
- Embedding 함수 구성
- 실습 수행
- Qdrant
- milvus
- pgvector
실습 환경 구성
Database 구축
qdrant DB
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
milvus DB
wget https://github.com/milvus-io/milvus/releases/download/v2.3.3/milvus-standalone-docker-compose.yml -O docker-compose.yml
sudo docker-compose up -d
postgres (pgvector)
postgres Dockerfile
docker build -t pg-vector .
docker run -it -p 5432:5432 -e POSTGRES_PASSWORD=1234 -e POSTGRES_HOST_AUTH_METHOD=trust pg-vector
Dataset 다운로드
Library 설치
실습하기에 앞서 설치해야 할 패키지는 다음과 같습니다.
- transformers
- torch
- qdrant_client
- pymilvus
- psycopg2
# install library
!pip install transformers
!pip install torch
!pip install qdrant_client
!pip install pymilvus
!pip install psycopg2
SQuAD v2.0 Dataset란?
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
import requests
# SQuAD v2.0 데이터셋 다운로드
squad_url = "<https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json>"
response = requests.get(squad_url)
squad_data = response.json()
Embedding 함수 구성
Dataset을 embedding으로 변환할 함수를 구성합니다. 해당 embedding은 아래 실습 시 각 VDB에서 공통으로 사용됩니다.
from transformers import BertTokenizer, BertModel
import torch
# Loading the tokenizer and BERT model for use
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def embed_text(text):
""" Returns BERT embeddings for the given text. """
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
return outputs.last_hidden_state.mean(1).detach().numpy()[0]
실습 수행
Qdrant
- qdrant client 초기화 설정을 수행합니다.
- qdrant에 collection을 생성합니다.
SQuAD v2.0
Dataset 중 1,000개 질문에 대한 embedding을 저장합니다.
# prepare qdrant client
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct, PointInsertOperations
qdrant = QdrantClient(host="localhost", port=6333)
# create collection
qdrant.create_collection(
collection_name="questions",
vectors_config=VectorParams(size= 768,distance=Distance.COSINE),
)
questions
)을 지정collection 생성 후 아래와 같은 Output을 확인할 수 있습니다.
[Output] True
Qdrant는 기본적으로 HNSW Indexing을 지원하는데, 임베딩을 저장하기 전에 Index를 빌드할 수 있습니다. 또는 Search하기 이전에 exact=True 파라미터를 사용하면 명시적으로 IVF Flat을 사용할 수 있습니다.
points = []
count = 0
questions = []
answers = []
max_questions = 1000
# Store embeddings for 1,000 questions from the SQuAD v2.0 dataset in Qdrant.
for article in squad_data["data"]:
for paragraph in article["paragraphs"]:
for qa in paragraph["qas"]:
question = qa["question"]
embedding = embed_text(question)
answer = qa["answers"]
point = PointStruct(
id=count,
vector=embedding.tolist(),
payload={"question": question}
)
questions.append(question)
answers.append(answer)
points.append(point)
count += 1
if count >= max_questions:
break
if count >= max_questions:
break
if count >= max_questions:
break
operation_info = qdrant.upsert(
collection_name="questions",
wait=True,
points=points
)
questions
collection에 저장- qdrant 는 기본적으로 HNSW indexing 만 지원함.
- 일반적으로 임베딩 넣기 -> index 빌드 -> 검색을 해야하는데, 저기 코드엔 index 빌드하는 코드가 없슴다
- Vector search를 수행할 질문의 embedding을 생성하고 Search를 수행합니다. Search를 수행하는데 걸리는 시간과 결괏값을 확인할 수 있습니다.
vector = embed_text("who is Beyonce")
%%time
results = qdrant.search(
collection_name="questions", query_vector= vector, limit=5
)
수행 시간에 대한 Output을 확인할 수 있습니다.
[Output] CPU times: user 1.91 ms, sys: 1.1 ms,total: 3.02 ms
Wall time:3.79 ms
Search 결괏값과 similarity score를 확인할 수 있습니다.
for i in results:
print("question : {}, answer : {}, similarity score : {}".format(questions[i.id], answers[i.id][0]["text"], i.score))
[Output] question : Who is Beyoncé married to?, answer : Jay Z, similarity score : 0.831488 question : Who influenced Beyonce?, answer : Michael Jackson, similarity score : 0.8148161 question : Who did Beyoncé marry?, answer : Jay Z., similarity score : 0.8004105 question : When did Beyoncé release Formation?, answer : February 6, 2016, similarity score : 0.79073215 question : Which artist did Beyonce marry?, answer : Jay Z, similarity score : 0.7883082
milvus
- milvus client 초기화 설정을 수행합니다.
- milvus에 collection을 생성합니다.
SQuAD v2.0
Dataset 중 1,000개 질문에 대한 embedding을 저장합니다.- 저장된 Collection을 로드한 뒤, Vector search를 수행할 질문의 embedding을 생성하고 Search를 수행합니다. Search를 수행하는데 걸리는 시간과 결괏값을 확인할 수 있습니다.
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType, connections
import numpy as np
# prepare milvus client
connections.connect(host='localhost', port='19530')
# create collection
id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector_field = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
schema = CollectionSchema(fields=[id_field, vector_field ], description="SQuAD Questions")
collection_name = "questions101"
collection = Collection(name=collection_name, schema=schema)
questions101
)을 지정count = 0
max_count = 1000
points = []
ids = []
embeddings = []
questions = []
answers = []
for article in squad_data['data']:
for paragraph in article['paragraphs']:
for qa in paragraph['qas']:
question = qa['question']
answer = qa["answers"]
embedding = embed_text(question)
# ID와 임베딩을 별도의 리스트로 준비
ids.append(count)
embeddings.append(embedding.tolist())
questions.append(question)
answers.append(answer)
count += 1
if count >= max_count:
break
if count >= max_count:
break
if count >= max_count:
break
# save milvus db
index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 768}
}
insert_data = [ids, embeddings]
collection.create_index(field_name="embedding", index_params=index_params)
ids = collection.insert(insert_data)
questions
collection에 저장collection.load()
search_params = {
"metric_type": "COSINE",
"offset": 0,
"ignore_growing": False,
"params": {"nprobe": 10}
}
search_params
설정question = "who is Beyonce"
embedding = embed_text(question)
%%time
results = collection.search(
data=[embedding.tolist()],
anns_field="embedding",
param=search_params,
limit=5
)
수행 시간에 대한 Output을 확인할 수 있습니다.
[Output] CPU times: user 950 µs, sys: 895 µs,total: 1.85 ms
Wall time:5.3 ms
Search 결괏값과 similarity score를 확인할 수 있습니다.
for i in results:
print("question : {}, answer : {}, similarity score : {}".format(questions[i.id], answers[i.id][0]["text"], i.score))
[Output] question : Who is Beyoncé married to?, answer : Jay Z, similarity score : 0.8314879536628723 question : Who influenced Beyonce?, answer : Michael Jackson, similarity score : 0.8148161768913269 question : Who did Beyoncé marry?, answer : Jay Z., similarity score : 0.8004105091094971 question : When did Beyoncé release Formation?, answer : February 6, 2016, similarity score : 0.79073215 question : Which artist did Beyonce marry?, answer : Jay Z, similarity score : 0.7883082
pgvector
- pgvector client 초기화 설정을 수행합니다.
- pgvector에 table을 생성합니다.
SQuAD v2.0
Dataset 중 1,000개 질문에 대한 embedding을 저장합니다.- 저장된 Collection을 로드한 뒤, Vector search를 수행할 질문의 embedding을 생성하고 Search를 수행합니다. Search를 수행하는데 걸리는 시간과 결괏값을 확인할 수 있습니다.
import psycopg2
# prepare pgvector client
conn = psycopg2.connect(host="localhost",user="postgres", password="1234",port=4321)
cur = conn.cursor()
# 확장설치
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
# create table
cur.execute('CREATE TABLE questions (id bigserial PRIMARY KEY, embedding vector(768))')
conn.commit()
questions
)을 지정points = []
count = 0
max_questions = 1000 # 최대 질문 수
questions = []
answers = []
try:
for article in squad_data['data']:
for paragraph in article['paragraphs']:
for qa in paragraph['qas']:
question = qa['question']
answer = qa["answers"]
embedding = embed_text(question)
cur.execute('INSERT INTO questions (embedding) VALUES (%s)', (embedding.tolist(),))
conn.commit()
questions.append(question)
answers.append(answer)
count += 1
if count >= max_questions:
break
if count >= max_questions:
break
if count >= max_questions:
break
except psycopg2.DatabaseError as e:
print(f"Database error: {e}")
conn.rollback()
cur.execute("""SET maintenance_work_mem = '128MB';""")
maintenance_work_mem
설정cur.execute('''
CREATE INDEX IF NOT EXISTS idx_vector
ON questions
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 768);
''')
questions
table에 대한 INDEX
생성question = "who is Beyonce"
embedding = embed_text(question)
%%time
try:
cur.execute("SELECT * FROM questions ORDER BY embedding <-> %s::vector LIMIT 5;", (embedding.tolist(),))
results = cur.fetchall()
except psycopg2.DatabaseError as e:
print(f"Database error: {e}")
conn.rollback()
수행 시간에 대한 Output을 확인할 수 있습니다.
[Output] CPU times: user 1.36 ms, sys: 1.05 ms,total: 2.42 ms
Wall time:4.1 ms
Search 결괏값과 similarity score를 확인할 수 있습니다.
def cosine_similarity(vec_a, vec_b):
return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
cosine_similarity
for i in results:
db_vector = np.array(eval(i[1]))
similarity = cosine_similarity(embedding, db_vector)
print("question : {}, answer : {}, similarity score : {}".format(questions[i[0]-1], answers[i[0]-1][0]["text"],similarity))
[Output] question : Who is Beyoncé married to?, answer : Jay Z, similarity score : 0.8314879702500422 question : Who influenced Beyonce?, answer : Michael Jackson, similarity score : 0.8148160879479863 question : What band did Beyonce introduce in 2006?, answer : Suga Mama, similarity score : 0.7825187277864076 question : What solo album did Beyonce release in 2003?, answer : Dangerously in Love, similarity score : 0.7818855973596779 question : When did Beyoncé release Formation?, answer : February 6, 2016, similarity score : 0.7907322425502744