RAG for Beginners — Part 1: Loading and Chunking Documents

Retrieval-augmented generation starts before any model call — with turning documents into searchable chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120
)
chunks = splitter.split_text(document_text)

Chunk size is a trade-off: too small and you lose context within a chunk; too large and irrelevant text dilutes the part that actually answers the question. 500–1000 characters with some overlap is a reasonable starting point.

Try it

Run the splitter on one real document and print the chunk count and a sample chunk. That's the input every later step in this series builds on.