Ensuring only updated docs are indexed
directory_loader = DirectoryLoader(
path="../data/updates/IT_Helpdesk_KB_Articles_v2",
glob="*.txt",
loader_cls=TextLoader,
)
documents = directory_loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
)
chunks = text_splitter.split_documents(documents)
embedding = VertexAIEmbeddings(
model_name="text-embedding-005")
vector_store = Chroma(
collection_name="kb_collection",
embedding_function=embedding,
persist_directory="../vectordb/kb_collection_db_sample1",
)
# only changed docs and reindex
result = index(
docs_source=documents,
record_manager=sql_record_manager,
vector_store=vector_store,
cleanup='incremental',
source_id_key='source'
)
Project for RAG: – HR Helpdesk
- The core idea of this project is we would be having HR documents (knowledge bases)
- For the purposes of this project i would have this docs in folder, in your organization they might be on confluence, wiki pages
- What we would be building:
- Chatgpt kind of interface
- User asks the question and RAG should respond
- This is enterprise RAG, so it expected to be grounded
- We should prove the precision, fairness of RAG before deploying
- Deployment possibilities
- Refer Here for the git repo.
Steps:
- synthetic data creation:
- understanding data
- indexing
- vector stores
- Prompts
- Retrieval
- Grounding
- Scores
- Deployment
Like this:
Like Loading...