Dealing with PDF Loading in Langchain
-
Popular libraries for pdf
- pypdf (largely text)
- pymupdf (text + images)
- unstructured (elements)
- pypdfplumber
- OCR (scanned pdf)
-
We need to write extra code to extract images
- pypdf loading for Refer Here
- ncert
- panchantra
Scenario 1
- PDF is full of image illustrations which has text to be extracted,
- we need to build a indexing pipeline
- Exercise:
- Find out how to use OCR in langchain via document loaders and write code to deal with Panchatantra for indexing
- pyproject.toml
[project]
name = "example2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"langchain>=1.2.7",
"langchain-community>=0.4.1",
"langchain-unstructured[local]>=1.0.1",
"unstructured[image]>=0.18.31",
]
- We need to install tessearct and continue
Give me steps to install tessearct on my windows and add it to PATH
