How to Build a RAG Pipeline with LangChain and Pinecone in 2026

Dev Nakamura 12 min read Updated May 29, 2026

TL;DR

  • Build a complete RAG system that ingests documents, creates vector embeddings, and answers questions with citations
  • Stack: LangChain 0.1.x, Pinecone serverless, OpenAI embeddings (text-embedding-3-small), GPT-4
  • Real code: Every configuration, API call, and error handler you need for production
  • End result: A Python application that processes PDFs, chunks intelligently, stores 1536-dimensional vectors in Pinecone, and returns contextual answers with source references in under 2 seconds

Prerequisites

Before we start, make sure you have:

Required:

  • Python 3.11+ installed (python --version to check)
  • OpenAI API key with credits (get from platform.openai.com)
  • Pinecone account (free tier works fine — sign up at pinecone.io)
  • pip package manager (comes with Python)
  • Git for cloning any sample documents

Knowledge:

  • Basic Python syntax (functions, classes, imports)
  • Familiarity with API keys and environment variables
  • Understanding of what embeddings are (vectors representing text meaning)

Time:

  • ~45 minutes to complete the full tutorial
  • ~10 minutes more if you hit dependency conflicts

Cost:

  • Pinecone: Free tier (1 serverless index, sufficient for testing)
  • OpenAI: ~$0.50 for embeddings + LLM calls during this tutorial

What We’re Building

We’re building a RAG (Retrieval-Augmented Generation) pipeline that lets an LLM answer questions from your documents with factual grounding. Instead of hallucinating answers, the system retrieves relevant chunks from a vector database first, then generates responses based only on that context.

Architecture flow:

PDF Documents → Text Extraction → Chunking → OpenAI Embeddings → Pinecone Vector Store → Query (with embedding) → Top-K Retrieval → Context + Prompt → GPT-4 → Answer with Citations

Why this stack? LangChain abstracts RAG complexity (chunking strategies, retrieval chains, prompt templates), Pinecone offers serverless vector search with no infrastructure management, and OpenAI’s latest embeddings (text-embedding-3-small) deliver strong retrieval quality at 1/5th the cost of older models.

Step 1: Set Up Your Python Environment

Create a clean project directory and virtual environment to avoid dependency conflicts.

mkdir rag-langchain-pinecone
cd rag-langchain-pinecone
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required packages with pinned versions to ensure reproducibility:

pip install langchain==0.1.20 \
            langchain-openai==0.1.7 \
            langchain-pinecone==0.1.0 \
            pinecone-client==3.2.2 \
            pypdf==4.2.0 \
            python-dotenv==1.0.1 \
            tiktoken==0.7.0

Expected output:

Successfully installed langchain-0.1.20 langchain-openai-0.1.7 ...

What each package does:

  • langchain: Core RAG framework with document loaders and chains
  • langchain-openai: OpenAI integration for embeddings and chat models
  • langchain-pinecone: Pinecone vector store integration
  • pinecone-client: Official Pinecone SDK
  • pypdf: PDF text extraction
  • python-dotenv: Environment variable management
  • tiktoken: OpenAI’s tokenizer for accurate chunk sizing

Step 2: Configure API Keys

Create a .env file in your project root to store credentials securely:

touch .env

Add your API keys (replace with your actual keys):

OPENAI_API_KEY=sk-proj-abcdefghijklmnopqrstuvwxyz1234567890
PINECONE_API_KEY=pcsk_abcdef_1234567890abcdefghijklmnopqrstuvwxyz
PINECONE_ENVIRONMENT=us-east-1-aws

Where to find these:

  • OpenAI key: platform.openai.com/api-keys
  • Pinecone key: app.pinecone.io → API Keys tab
  • Pinecone environment: Shown in your Pinecone dashboard (looks like us-east-1-aws or gcp-starter)

Create a .gitignore to prevent committing secrets:

echo ".env" >> .gitignore
echo "venv/" >> .gitignore
echo "__pycache__/" >> .gitignore
echo "*.pyc" >> .gitignore

Step 3: Initialize Pinecone Vector Store

Create setup_pinecone.py to configure your vector database:

import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec

load_dotenv()

def initialize_pinecone():
    """Create Pinecone index for storing document embeddings."""
    
    # Initialize Pinecone client
    pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
    
    index_name = "langchain-rag-demo"
    
    # Check if index already exists
    existing_indexes = [index.name for index in pc.list_indexes()]
    
    if index_name not in existing_indexes:
        print(f"Creating index '{index_name}'...")
        
        # Create serverless index with 1536 dimensions (OpenAI text-embedding-3-small)
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric="cosine",
            spec=ServerlessSpec(
                cloud="aws",
                region=os.getenv("PINECONE_ENVIRONMENT")
            )
        )
        print(f"✓ Index '{index_name}' created successfully")
    else:
        print(f"✓ Index '{index_name}' already exists")
    
    return index_name

if __name__ == "__main__":
    index_name = initialize_pinecone()
    print(f"\nPinecone index ready: {index_name}")

Run the setup script:

python setup_pinecone.py

Expected output:

Creating index 'langchain-rag-demo'...
✓ Index 'langchain-rag-demo' created successfully

Pinecone index ready: langchain-rag-demo

Key parameters explained:

  • dimension=1536: Matches OpenAI’s text-embedding-3-small output size
  • metric="cosine": Measures similarity between vectors (0-1 scale, higher = more similar)
  • ServerlessSpec: No infrastructure management, pay per query

Step 4: Load and Chunk Documents

Create a data/ folder and add sample PDFs:

mkdir data
# Add your PDFs to data/ folder
# For testing, you can use any PDF — research papers, reports, manuals, etc.

Create document_processor.py to handle PDF ingestion:

import os
from typing import List
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

class DocumentProcessor:
    """Load and chunk documents for embedding."""
    
    def __init__(self, data_dir: str = "./data"):
        self.data_dir = data_dir
        # Configure chunking strategy
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
    
    def load_documents(self) -> List[Document]:
        """Load all PDFs from data directory."""
        print(f"Loading documents from {self.data_dir}...")
        
        loader = PyPDFDirectoryLoader(self.data_dir)
        documents = loader.load()
        
        print(f"✓ Loaded {len(documents)} document pages")
        return documents
    
    def chunk_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents into smaller chunks for better retrieval."""
        print("Chunking documents...")
        
        chunks = self.text_splitter.split_documents(documents)
        
        # Add metadata for tracking sources
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_id"] = i
            chunk.metadata["chunk_size"] = len(chunk.page_content)
        
        print(f"✓ Created {len(chunks)} chunks (avg {sum(len(c.page_content) for c in chunks)//len(chunks)} chars)")
        return chunks
    
    def process(self) -> List[Document]:
        """Full pipeline: load → chunk."""
        docs = self.load_documents()
        chunks = self.chunk_documents(docs)
        return chunks

if __name__ == "__main__":
    processor = DocumentProcessor()
    chunks = processor.process()
    print(f"\nReady to embed {len(chunks)} chunks")
    print(f"Sample chunk: {chunks[0].page_content[:200]}...")

Test the document processor:

python document_processor.py

Expected output:

Loading documents from ./data...
✓ Loaded 15 document pages
Chunking documents...
✓ Created 42 chunks (avg 847 chars)

Ready to embed 42 chunks
Sample chunk: Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...

Chunking strategy explained:

  • chunk_size=1000: Target 1000 characters per chunk (balances context vs. precision)
  • chunk_overlap=200: 200-char overlap prevents splitting related sentences
  • separators: Split on paragraph breaks first, then sentences, then words

Step 5: Create Embeddings and Store in Pinecone

Create vector_store.py to handle embedding and indexing:

import os
from typing import List
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.schema import Document
from document_processor import DocumentProcessor

load_dotenv()

class VectorStoreManager:
    """Manage vector embeddings and Pinecone storage."""
    
    def __init__(self, index_name: str = "langchain-rag-demo"):
        self.index_name = index_name
        
        # Initialize OpenAI embeddings (text-embedding-3-small)
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
    
    def create_vector_store(self, documents: List[Document]) -> PineconeVectorStore:
        """Embed documents and store in Pinecone."""
        print(f"Creating embeddings for {len(documents)} documents...")
        print("(This may take 30-60 seconds depending on document count)")
        
        vector_store = PineconeVectorStore.from_documents(
            documents=documents,
            embedding=self.embeddings,
            index_name=self.index_name
        )
        
        print(f"✓ Stored {len(documents)} vectors in Pinecone index '{self.index_name}'")
        return vector_store
    
    def load_vector_store(self) -> PineconeVectorStore:
        """Load existing vector store (for querying)."""
        vector_store = PineconeVectorStore(
            index_name=self.index_name,
            embedding=self.embeddings
        )
        return vector_store

def index_documents():
    """Main function to process and index documents."""
    # Load and chunk documents
    processor = DocumentProcessor()
    chunks = processor.process()
    
    # Create embeddings and store
    manager = VectorStoreManager()
    vector_store = manager.create_vector_store(chunks)
    
    print("\n✓ Indexing complete! Your documents are now searchable.")
    return vector_store

if __name__ == "__main__":
    index_documents()

Run the indexing pipeline:

python vector_store.py

Expected output:

Loading documents from ./data...
✓ Loaded 15 document pages
Chunking documents...
✓ Created 42 chunks (avg 847 chars)
Creating embeddings for 42 documents...
(This may take 30-60 seconds depending on document count)
✓ Stored 42 vectors in Pinecone index 'langchain-rag-demo'

✓ Indexing complete! Your documents are now searchable.

What’s happening behind the scenes:

  1. Each chunk is sent to OpenAI’s embedding API
  2. Returns a 1536-dimensional vector representing semantic meaning
  3. Vector + metadata stored in Pinecone with automatic indexing
  4. Pinecone creates HNSW (Hierarchical Navigable Small World) graph for fast similarity search

Step 6: Build the RAG Query Pipeline

Create rag_pipeline.py for question-answering:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from vector_store import VectorStoreManager

load_dotenv()

class RAGPipeline:
    """Complete RAG pipeline for document Q&A."""
    
    def __init__(self, index_name: str = "langchain-rag-demo"):
        # Load vector store
        manager = VectorStoreManager(index_name)
        self.vector_store = manager.load_vector_store()
        
        # Initialize LLM (GPT-4)
        self.llm = ChatOpenAI(
            model="gpt-4-turbo-preview",
            temperature=0.1,  # Low temperature for factual responses
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
        
        # Custom prompt template
        self.prompt_template = PromptTemplate(
            template="""You are a helpful AI assistant answering questions based on provided context.

Context: {context}

Question: {question}

Instructions:
- Answer based ONLY on the context provided
- If the context doesn't contain enough information, say "I don't have enough information to answer that"
- Cite specific parts of the context in your answer
- Be concise but complete

Answer:""",
            input_variables=["context", "question"]
        )
        
        # Build retrieval chain
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",  # "stuff" passes all retrieved docs to LLM at once
            retriever=self.vector_store.as_retriever(
                search_type="similarity",
                search_kwargs={"k": 4}  # Retrieve top 4 most relevant chunks
            ),
            return_source_documents=True,
            chain_type_kwargs={"prompt": self.prompt_template}
        )
    
    def query(self, question: str) -> dict:
        """Ask a question and get answer with sources."""
        print(f"\nQuery: {question}")
        print("Retrieving relevant context...\n")
        
        result = self.qa_chain.invoke({"query": question})
        
        return {
            "question": question,
            "answer": result["result"],
            "sources": result["source_documents"]
        }
    
    def format_response(self, result: dict) -> str:
        """Format response with sources."""
        output = []
        output.append(f"Question: {result['question']}")
        output.append(f"\nAnswer: {result['answer']}")
        output.append("\n--- Sources ---")
        
        for i, doc in enumerate(result['sources'], 1):
            source = doc.metadata.get('source', 'Unknown')
            page = doc.metadata.get('page', 'N/A')
            output.append(f"\n[{i}] {source} (Page {page})")
            output.append(f"    Excerpt: {doc.page_content[:150]}...")
        
        return "\n".join(output)

if __name__ == "__main__":
    # Initialize pipeline
    rag = RAGPipeline()
    
    # Example queries
    questions = [
        "What is machine learning?",
        "How does gradient descent work?",
        "What are the main types of neural networks?"
    ]
    
    for question in questions:
        result = rag.query(question)
        print(rag.format_response(result))
        print("\n" + "="*80 + "\n")

Run the RAG pipeline:

python rag_pipeline.py

Expected output:

Query: What is machine learning?
Retrieving relevant context...

Question: What is machine learning?

Answer: Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. According to the provided context, it involves algorithms that can identify patterns in data and make predictions or decisions based on those patterns.

--- Sources ---

[1] data/intro_to_ml.pdf (Page 1)
    Excerpt: Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...

[2] data/intro_to_ml.pdf (Page 2)
    Excerpt: The core idea behind machine learning is to allow computers to learn from data rather than following only explicitly programmed instructions...

================================================================================

Key pipeline components:

  • search_kwargs={"k": 4}: Returns top 4 most similar chunks (tune based on context window needs)
  • temperature=0.1: Low randomness = more deterministic, factual answers
  • return_source_documents=True: Enables citation tracking
  • chain_type="stuff": Alternative options: “map_reduce” (for very large contexts), “refine” (iterative refinement)

Step 7: Create an Interactive CLI

Create app.py for a simple command-line interface:

import sys
from rag_pipeline import RAGPipeline

def main():
    """Interactive RAG application."""
    print("="*80)
    print("RAG Document Q&A System")
    print("="*80)
    print("\nInitializing pipeline...")
    
    try:
        rag = RAGPipeline()
        print("✓ Pipeline ready!\n")
    except Exception as e:
        print(f"Error initializing pipeline: {e}")
        sys.exit(1)
    
    print("Ask questions about your documents (type 'quit' to exit)\n")
    
    while True:
        try:
            question = input("You: ").strip()
            
            if not question:
                continue
            
            if question.lower() in ['quit', 'exit', 'q']:
                print("\nGoodbye!")
                break
            
            result = rag.query(question)
            formatted = rag.format_response(result)
            print(f"\n{formatted}\n")
            print("-"*80 + "\n")
            
        except KeyboardInterrupt:
            print("\n\nGoodbye!")
            break
        except Exception as e:
            print(f"\nError processing query: {e}\n")

if __name__ == "__main__":
    main()

Run the interactive application:

python app.py

Expected interaction:

================================================================================
RAG Document Q&A System
================================================================================

Initializing pipeline...
✓ Pipeline ready!

Ask questions about your documents (type 'quit' to exit)

You: What is supervised learning?

Query: What is supervised learning?
Retrieving relevant context...

Question: What is supervised learning?

Answer: Supervised learning is a type of machine learning where the model is trained on labeled data...
[Sources and citations follow]

Testing Your Implementation

Create test_rag.py to verify the complete pipeline:

import time
from rag_pipeline import RAGPipeline

def test_retrieval_quality():
    """Test vector retrieval performance."""
    print("Testing retrieval quality...\n")
    
    rag = RAGPipeline()
    
    # Test queries with expected context
    test_cases = [
        {
            "query": "What is the definition of machine learning?",
            "expected_keywords": ["learn", "data", "algorithm", "pattern"]
        },
        {
            "query": "How do neural networks work?",
            "expected_keywords": ["neuron", "layer", "weight", "activation"]
        }
    ]
    
    for i, test in enumerate(test_cases, 1):
        print(f"Test {i}: {test['query']}")
        start = time.time()
        
        result = rag.query(test['query'])
        elapsed = time.time() - start
        
        answer = result['answer'].lower()
        found_keywords = [kw for kw in test['expected_keywords'] if kw in answer]
        
        print(f"  ✓ Response time: {elapsed:.2f}s")
        print(f"  ✓ Found keywords: {found_keywords}")
        print(f"  ✓ Source documents: {len(result['sources'])}")
        print()

def test_source_attribution():
    """Verify source documents are returned."""
    print("Testing source attribution...\n")
    
    rag = RAGPipeline()
    result = rag.query("What is machine learning?")
    
    assert len(result['sources']) > 0, "No source documents returned"
    assert all('source' in doc.metadata for doc in result['sources']), "Missing source metadata"
    
    print(f"✓ All {len(result['sources'])} sources have proper metadata\n")

if __name__ == "__main__":
    print("="*80)
    print("RAG Pipeline Test Suite")
    print("="*80 + "\n")
    
    test_retrieval_quality()
    test_source_attribution()
    
    print("\n" + "="*80)
    print("All tests passed!")
    print("="*80)

Run the test suite:

python test_rag.py

Expected output:

================================================================================
RAG Pipeline Test Suite
================================================================================

Testing retrieval quality...

Test 1: What is the definition of machine learning?
  ✓ Response time: 1.84s
  ✓ Found keywords: ['learn', 'data', 'algorithm', 'pattern']
  ✓ Source documents: 4

Test 2: How do neural networks work?
  ✓ Response time: 2.01s
  ✓ Found keywords: ['neuron', 'layer', 'weight', 'activation']
  ✓ Source documents: 4

Testing source attribution...

✓ All 4 sources have proper metadata

================================================================================
All tests passed!
================================================================================

Common Issues & Fixes

Problem: ModuleNotFoundError: No module named 'langchain_openai' Cause: Package not installed or wrong virtual environment active Fix: Ensure venv is activated and reinstall:

source venv/bin/activate
pip install langchain-openai==0.1.7

Problem: PineconeException: Index 'langchain-rag-demo' not found Cause: Index wasn’t created or wrong environment variable Fix: Run setup script and verify environment:

python setup_pinecone.py
# Check PINECONE_ENVIRONMENT in .env matches your dashboard

Problem: Retrieval returns irrelevant documents Cause: Chunks are too large or too small, or k value is wrong Fix: Tune chunking parameters in document_processor.py:

self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Try smaller chunks
    chunk_overlap=100,
    length_function=len
)

And adjust k in retriever:

search_kwargs={"k": 6}  # Retrieve more candidates

Problem: RateLimitError: You exceeded your current quota Cause: OpenAI API quota exceeded Fix: Check usage at platform.openai.com/usage or add payment method. For development, reduce number of test documents.

Problem: Answers are too verbose or off-topic Cause: Prompt template or temperature settings Fix: Adjust prompt in rag_pipeline.py to be more specific:

template="""Answer in 2-3 sentences maximum. Use only the context provided.

Context: {context}
Question: {question}

Brief answer:"""

And lower temperature further:

temperature=0.0  # Maximum determinism

Next Steps

Now that you have a working RAG pipeline, here are ways to extend it:

Add conversation memory: Implement chat history to enable follow-up questions

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

Support more document types: Add loaders for DOCX, TXT, Markdown

from langchain_community.document_loaders import (
    TextLoader,
    UnstructuredWordDocumentLoader
)

Implement hybrid search: Combine vector similarity with keyword search

retriever=self.vector_store.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance for diversity
    search_kwargs={"k": 4, "fetch_k": 20}
)

Add streaming responses: Stream LLM output token-by-token

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = ChatOpenAI(
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Deploy as an API: Wrap in FastAPI for production use

from fastapi import FastAPI

app = FastAPI()
rag = RAGPipeline()

@app.post("/query")
async def query(question: str):
    return rag.query(question)

Challenge: Extend this system to handle a corpus of 1000+ documents. You’ll need to implement batch processing for embeddings and consider Pinecone’s namespaces feature to organize documents by category. Can you keep query latency under 2 seconds?

For more tutorials on production RAG systems, check out our guides on semantic caching and evaluation metrics for retrieval quality.

Share:

Related Posts

tutorial 18 min read

How to Build a ReAct Agent with Claude and Tool Use in 2026

Learn to build a ReAct (Reasoning + Acting) agent that thinks through problems step-by-step using Claude's tool calling capabilities. This tutorial walks you through creating an agent that can use web search, perform calculations, and read files to answer complex questions.

Dev Nakamura