Gen-AI Developer Classroom notes 29/Jan/2026

Dealing with PDF Loading in Langchain

  • Popular libraries for pdf

    • pypdf (largely text)
    • pymupdf (text + images)
    • unstructured (elements)
    • pypdfplumber
    • OCR (scanned pdf)
  • We need to write extra code to extract images

  • pypdf loading for Refer Here
    • ncert
    • panchantra

Scenario 1

  • PDF is full of image illustrations which has text to be extracted,
  • we need to build a indexing pipeline
  • Exercise:
    • Find out how to use OCR in langchain via document loaders and write code to deal with Panchatantra for indexing
  • pyproject.toml
[project]
name = "example2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "langchain>=1.2.7",
    "langchain-community>=0.4.1",
    "langchain-unstructured[local]>=1.0.1",
    "unstructured[image]>=0.18.31",
]
  • We need to install tessearct and continue
Give me steps to install tessearct on my windows and add it to PATH

By continuous learner

enthusiastic technology learner

Leave a Reply

Discover more from Direct AI Powered By Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading