Parse and Chunk Documents for AI With Docling
Building a RAG pipeline means your data needs to be clean, structured, and appropriately chunked before it reaches the embedding model. The tricky part is converting your source documents like dense PDFs and complex Word files into something an LLM can actually use without losing the tables, headings, and structure that make the content meaningful.
Docling is an open-source library from IBM that handles document conversion with structure in mind. In this tutorial you will work with the Docling Python library to convert documents to AI ready markdown format, while still retaining table and document structure. You will then implement the Docling HybridChunker to chunk the markdown documents using token size limits to prepare the document for embedding creation and storing in your vector database.
In this tutorial you will learn to:
Install and use Docling DocumentConverter
Convert PDF and DOCX to markdown
Use Docling HybridChunker to break the full document markdown into chunks for embedding
Setting Up Your Docling Project
Before you can jump into coding you will need to setup your project structure and download the necessary libraries for Docling to run.
For Linux and MacOS users:
cd docling-project
python -m venv .venv
#For Linux and MacOS users
source .venv/bin/activate
#For Windows and powershell users
.venv\Scripts\Activate.ps1
Now that you have created your working directory and created your virtual environment, you can install the necessary dependencies from Docling.
pip install docling transformers
It should be noted that Docling and transformers are larger libraries that may take a few minutes to download the model artifacts and will require significant space on your machine. You should check that your machine has the required space necessary to both download the libraries as well run them.
Converting a Document Using Docling
Once you have finished setting up your project structure you are ready to dive into document conversion. First, import the DocumentConverter class into your code from Docling.
The first run using Docling may take longer than expected, as the model artifacts will need to be downloaded and loaded into memory for use.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert('/path/to/yourdocument.pdf')
doc = result.document
markdown = doc.export_to_markdown()
print(markdown)
You now have a fully converted document to markdown with headings, table structure, and in a format that can be passed directly to your AI agent, chatbot, or be used with Docling's HybridChunker to chunk and create embeddings.
Using Docling HybridChunker
Now that you have your document processed into markdown, you will instantiate the HybridChunker, set up the parameters for your desired chunk size, and perform document chunking.
The HybridChunker works together with Docling's hierarchical chunker. It begins by taking the DoclingDocument from the previous step and performs an initial chunking pass based on document structure using the hierarchical chunker. It then performs a second pass where it splits any chunks that exceed the token limit. Finally, it performs a third pass where it merges undersized neighboring chunks that share the same headings and captions.
In the code snippet below, you will also use HuggingFaceTokenizer to ensure that your chunks respect the same token limits as the embedding model selected. This will align how the chunker counts and creates chunks and how the embedding model processes them. For example, one embedding model may process a word into tokens differently, where one model may see a section of text as 250 tokens another model may see that as 300. This could cause you to create chunks that exceed the context window of the embedding model and thus lose rich semantic context.
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import ( HuggingFaceTokenizer,)
from transformers import AutoTokenizer
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 256
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
max_tokens=MAX_TOKENS,
)
chunker = HybridChunker(tokenizer=tokenizer)
chunks = list(chunker.chunk(dl_doc=doc))
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
enriched = chunker.contextualize(chunk=chunk)
print(f"--- Chunk {i} ({len(enriched.split())} words) ---")
print(enriched[:300])
print()
The contextualize() method enriches each chunk by prepending its heading hierarchy and caption metadata from the DoclingDocument, which provides the embedding model with additional document context that normally is lost when the text is split into smaller pieces.
Conclusion
You have now built a document processing pipeline ready for use in your AI application. This process is necessary for RAG and other document processing AI systems, and Docling allows you the tools to achieve this on your own hardware.
It should be noted for Docling, that it is a heavy process, and standing this up to run on a schedule within a production environment is not trivial. If you are already using Azure or other cloud based tools, then using Azure Document Intelligence will be significantly faster and easier to integrate and run. Docling does offer an API based version of Docling through Docling Serve, though there are known issues with the VLM pipeline being fragile, and if running on your personal hardware resource intensive with slow run times. You should keep this in mind when designing your system, and ensure that you choose the product that best fits your end product's requirements of security, speed, and cost.