Gen-AI Developer Classroom notes 07/Apr/2025

Problem

  • We have a text book, primary with text and also some images.
  • We need to parse the text from pdf file and also images
    Preview

Relation between chunk and token

  • A document is broken down into chunks.
  • Chunk will have multiple tokens
  • Note: If using a llm model ensure chunk does not have more tokens than what model accepts

Popular models and their token limit

| Model | Token Limit | Estimated Word Count |
|———–|—————–|————————–|
| GPT-3.5 | 4,096 | Approximately 3,083 words[3] |
| GPT-3.5 Turbo | 4,000 | Not specified[1] |
| GPT-3.5 Turbo-16k | 16,000 | Not specified[1] |
| GPT-4 | 32,768 | Approximately 25,000 words[3] |
| Llama2 | 2,048 | Approximately 1,563 words[3] |
| Claude 2 | 100,000 | Approximately 60,000 words[3] |
| PaLM | 8,000 | Approximately 6,200 words[3] |

Different Types of Chunking

Basic Problem

  • Parsing Content from PDF:
    • Lets parse the text page by page
    • Lets Also extract images if any in this page.
  • The page information will be metadata which will be stored in addition to vectors in vector database
  • Now lets findout a Python library which allows me parsing text page by page.

  • Refer Here for the code written

  • We have a parser which can parse text as well images
  • Note: Images fix required
  • Design:
    • In our system are we going to perform similarity search based on images ?
      • If yes we need to do image embeddings as well
      • If no, Perform text embedding and add images as metadata
  • Next Steps:
    • Chunking
    • Seeking library helps which are model aware in chunking.
  • Concepts required for this
    • Transformer
    • Langchain

By continuous learner

enthusiastic technology learner

Leave a Reply

Discover more from Direct AI Powered By Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading