Introduction: The Challenge of Creating High-Quality Training Data
In the rapidly evolving landscape of artificial intelligence and natural language processing, one of the most persistent challenges is obtaining high-quality training data. Whether you're fine-tuning a language model, building a RAG (Retrieval-Augmented Generation) system, or creating a domain-specific chatbot, you need question-answer pairs that are contextually rich, semantically coherent, and representative of your knowledge base.
Traditional approaches to generating Q&A datasets often fall short:
- Manual creation is time-consuming, expensive, and doesn't scale
- Simple extraction methods miss contextual relationships between different parts of documents
- Random chunk-based generation creates isolated Q&A pairs that lack broader context
- Cloud-based solutions raise privacy concerns and incur ongoing costs
This is where our Q&A Dataset Generator comes in—a sophisticated, privacy-first tool that leverages local language models, semantic embeddings, and vector databases to create contextually-aware question-answer pairs from your PDF documents.
What Makes This Tool Different?
The key innovation of this tool lies in its use of semantic similarity to provide contextual awareness during Q&A generation. Here's what sets it apart:
1. Context-Aware Generation
Unlike traditional approaches that process each text chunk in isolation, our tool:
- Generates embeddings for all text chunks
- Builds a FAISS vector index for lightning-fast similarity search
- Finds semantically related chunks for each piece of text
- Uses these related chunks as context when generating Q&A pairs
The result? Answers that are more comprehensive, accurate, and contextually grounded.
2. Privacy-First Architecture
Everything runs locally:
- Uses Ollama for local LLM inference
- No data leaves your machine
- Perfect for sensitive documents (medical records, legal documents, proprietary research)
- No API costs or rate limits
3. Persistent Vector Indexes
The tool creates and saves FAISS indexes for each PDF, enabling:
- Fast semantic search across your Q&A dataset
- Reusability without regenerating embeddings
- Efficient similarity-based retrieval
- Foundation for building RAG systems
4. User-Friendly Interface
Built with Streamlit, the tool offers:
- Intuitive folder management
- Real-time progress tracking
- Interactive search capabilities
- Dataset management features
Technical Architecture: Under the Hood
Let's dive deep into the technical components that power this tool.
Core Technologies
1. Ollama: Local LLM Infrastructure
Ollama provides the foundation for running large language models locally. In our tool, we use it for two distinct purposes:
Language Generation (e.g., Gemma 3:4b)
llm = Ollama(model=ollama_model)
This model generates the actual questions and answers. Smaller models like Gemma 3:4b offer an excellent balance between quality and speed.
Embeddings Generation (e.g., Qwen3-Embedding)
embeddings_model = OllamaEmbeddings(model=embedding_model)
Specialized embedding models convert text into high-dimensional vectors that capture semantic meaning. The 0.6b embedding model is lightweight yet effective for similarity search.
2. FAISS: Facebook AI Similarity Search
FAISS is Meta's library for efficient similarity search and clustering of dense vectors. Here's why it's perfect for our use case:
- Speed: Searches millions of vectors in milliseconds
- Efficiency: Optimized for both CPU and GPU
- Scalability: Handles large-scale vector databases
- Flexibility: Supports various distance metrics (L2, inner product, cosine)
Our implementation uses IndexFlatL2 for exact L2 distance search:
dimension = embeddings_array.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings_array)
Why L2 Distance?
L2 (Euclidean) distance measures the straight-line distance between vectors in high-dimensional space. While cosine similarity is also popular, L2 distance works well for normalized embeddings and provides intuitive distance metrics.
3. LangChain: Orchestration Framework
LangChain provides the glue that connects different components:
Text Splitting
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
The recursive splitter intelligently breaks text at natural boundaries (paragraphs, sentences) while maintaining context through overlap.
Prompt Templates
prompt = PromptTemplate.from_template(qa_prompt_template)
chain = prompt | llm | StrOutputParser()
LangChain's expression language (LCEL) creates clean, composable pipelines.
4. PyMuPDF (fitz): PDF Processing
For reliable, fast PDF text extraction:
doc = fitz.open(file_path)
text = "".join(page.get_text() for page in doc)
PyMuPDF handles complex PDF layouts, embedded fonts, and multi-column documents better than many alternatives.
The Complete Workflow: From PDF to Searchable Q&A
Let's walk through the entire process step by step.
Phase 1: Document Processing
Step 1: PDF Text Extraction
def extract_and_chunk_pdf(file_path, chunk_size=1000, chunk_overlap=200):
doc = fitz.open(file_path)
text = "".join(page.get_text() for page in doc)
doc.close()
The tool opens each PDF and extracts raw text. The page-by-page extraction preserves document structure while building a complete text corpus.
Step 2: Intelligent Chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.create_documents([text], metadatas=[metadata])
Why 1000 characters with 200 overlap?
- 1000 characters (~150-200 words) provides enough context for meaningful questions
- 200 character overlap ensures concepts spanning chunk boundaries aren't lost
- Maintains semantic coherence across splits
Phase 2: Embedding Generation
Step 3: Converting Text to Vectors
This is where the magic begins. For each chunk:
embedding = embeddings_model.embed_query(chunk.page_content)
chunk_embeddings.append(embedding)
Each chunk becomes a dense vector (typically 384-1024 dimensions depending on the model). These vectors capture semantic meaning—similar concepts cluster together in vector space.
Example:
- Chunk about "machine learning algorithms" → Vector A
- Chunk about "neural network training" → Vector B (close to A)
- Chunk about "office furniture" → Vector C (far from A and B)
Phase 3: FAISS Index Construction
Step 4: Building the Similarity Search Index
embeddings_array = np.array(chunk_embeddings).astype('float32')
index = faiss.IndexFlatL2(dimension)
index.add(embeddings_array)
The FAISS index enables ultra-fast similarity search. Instead of comparing every vector pair (O(n²) complexity), FAISS optimizes the search.
Step 5: Finding Related Context
For each chunk, we find semantically similar chunks:
def find_similar_chunks(query_idx, faiss_index, chunks, k=5):
query_vector = faiss_index.reconstruct(query_idx)
distances, indices = faiss_index.search(query_vector, k + 1)
similar_chunks = []
for idx in indices[0]:
if idx != query_idx:
similar_chunks.append(chunks[idx].page_content)
return similar_chunks[:k]
This returns the top-k most similar chunks (excluding the query chunk itself).
Phase 4: Context-Aware Q&A Generation
Step 6: Generating Questions and Answers
This is where the contextual awareness shines:
def generate_qa_from_chunk_with_context(main_chunk, related_chunks, llm):
context = f"MAIN TEXT:\n{main_chunk}\n\n"
if related_chunks:
context += "RELATED CONTEXT (semantically similar sections):\n"
for i, chunk in enumerate(related_chunks, 1):
context += f"{i}. {chunk}\n\n"
The Prompt Strategy:
The tool uses a carefully crafted prompt that:
- Distinguishes between MAIN TEXT (focus) and RELATED CONTEXT (supporting)
- Instructs the LLM to generate questions primarily from main text
- Encourages comprehensive answers leveraging all context
- Enforces consistent output formatting
Example Scenario:
Main Chunk:
"FAISS uses Product Quantization for compression, reducing memory usage by 97%."
Related Chunks (found via similarity):
- "Vector databases enable similarity search at scale..."
- "Embedding models convert text to high-dimensional vectors..."
- "Product Quantization divides vectors into subspaces..."
Generated Q&A:
- Q: "How does FAISS achieve such significant memory reduction?"
- A: "FAISS achieves 97% memory reduction through Product Quantization, a technique that divides high-dimensional vectors into subspaces. This allows vector databases to perform similarity search at scale while maintaining efficiency in both storage and computation."
Notice how the answer synthesizes information from multiple chunks, creating a richer response than using the main chunk alone would provide.
Phase 5: Persistent Storage
Step 7: Saving FAISS Indexes
def save_pdf_faiss_index(pdf_filename, embeddings_list, qa_indices):
faiss_path = f"embeddings/{base_name}.faiss"
faiss.write_index(index, faiss_path)
metadata = {
'qa_indices': qa_indices,
'dimension': dimension,
'total_vectors': len(embeddings_list)
}
with open(metadata_path, 'wb') as f:
pickle.dump(metadata, f)
Each PDF gets its own FAISS index and metadata file, enabling:
- Targeted semantic search within specific documents
- Efficient storage (vectors compressed in binary format)
- Easy management and deletion
Step 8: CSV Dataset Export
df = pd.DataFrame(all_qa_data)
df.to_csv(OUTPUT_CSV, index=False)
The resulting dataset includes:
file: Source PDF filenamequestion: Generated questionanswer: Comprehensive answercontext_chunks_used: Number of related chunks used
Key Features in Detail
1. Flexible Folder Management
The tool offers multiple ways to organize your PDFs:
- Browse existing folders in your current directory
- Create new folders on-the-fly
- Enter custom paths (relative or absolute)
- Automatic validation ensures folders exist before processing
This flexibility accommodates different organizational structures and workflows.
2. Real-Time Progress Tracking
During processing, you see:
- Overall progress across all PDFs
- Per-PDF status updates
- Chunk processing progress
- Embedding generation progress
- Q&A generation status
This transparency helps you estimate completion time and identify bottlenecks.
3. Dual Search Capabilities
Traditional Text Search:
mask = df['question'].str.contains(search_term, case=False, na=False) | \
df['answer'].str.contains(search_term, case=False, na=False)
Simple keyword matching for quick lookups.
Semantic Search with FAISS:
query_embedding = embeddings_model.embed_query(semantic_query)
results = search_faiss_index(index, metadata, query_embedding, top_k=10)
Find conceptually similar Q&A pairs even if they don't share exact keywords.
Example:
- Search query: "How do neural networks learn?"
- Semantic matches might include Q&A about:
- Backpropagation algorithms
- Gradient descent optimization
- Weight adjustment mechanisms
None of these contain "neural networks learn" but are semantically related.
4. Comprehensive Dataset Management
The tool provides cleanup options:
- Clear Q&A Dataset: Remove the CSV while keeping FAISS indexes
- Clear All FAISS Indexes: Remove vector indexes while keeping Q&A data
- Clear Everything: Fresh start
This granular control lets you iterate and experiment efficiently.
5. FAISS Index Management
The dedicated management section shows:
- All available FAISS indexes
- File sizes (storage efficiency)
- Vector counts (dataset size)
- Individual deletion options
Perfect for managing large document collections.
Use Cases and Applications
1. Fine-Tuning Dataset Creation
Generate training data for:
- Domain-specific language models
- Instruction-tuned models
- Question-answering systems
Example: A medical organization could process thousands of research papers to create a specialized medical Q&A dataset for fine-tuning a healthcare chatbot.
2. RAG System Development
Build retrieval-augmented generation systems:
- The FAISS indexes enable fast retrieval
- Q&A pairs serve as example interactions
- Semantic search finds relevant context
Example: A legal firm could create a searchable knowledge base where lawyers query past cases and precedents using natural language.
3. Educational Content Generation
Create study materials:
- Quiz questions from textbooks
- Practice tests from course materials
- Review questions from lecture notes
Example: An online learning platform processes course PDFs to generate thousands of practice questions for students.
4. Documentation Intelligence
Make technical documentation interactive:
- Extract Q&A from API docs
- Create FAQs from user manuals
- Build searchable troubleshooting guides
Example: A software company processes product documentation to create an intelligent help system.
5. Research Literature Review
Analyze academic papers:
- Extract key findings as Q&A
- Find related research through semantic search
- Build knowledge graphs from paper relationships
Example: Researchers query across hundreds of papers to find works discussing specific methodologies.
Best Practices and Optimization Tips
1. Choosing the Right Models
For Generation:
- Small documents (< 100 pages): Gemma 3:4b, Llama 3:8b
- Large documents: Qwen 2.5:7b, Mistral:7b
- High quality needs: Llama 3:70b, Mixtral:8x7b
For Embeddings:
- Speed priority: nomic-embed-text (dimension: 768)
- Quality priority: bge-large (dimension: 1024)
- Balanced: qwen3-embedding:0.6b (dimension: 512)
2. Optimal Chunking Strategy
Adjust based on document type:
Technical documents:
chunk_size=1500 # Longer for complex concepts
chunk_overlap=300 # More overlap for technical continuity
Narrative content:
chunk_size=800 # Shorter for distinct ideas
chunk_overlap=150 # Less overlap needed
Structured content (FAQs, lists):
chunk_size=500 # Very short for discrete items
chunk_overlap=50 # Minimal overlap
3. Context Chunk Optimization
The num_related_chunks parameter significantly impacts quality:
- 0-1 chunks: Fast but limited context
- 2-3 chunks: Balanced (recommended default)
- 4-5 chunks: Rich context but slower, may overwhelm LLM
- 6+ chunks: Diminishing returns, token limit concerns
Pro tip: Start with 3, then experiment based on your document type.
4. Batch Processing Strategy
For large document collections:
- Process in batches (e.g., 10 PDFs at a time)
- Monitor system resources (RAM, GPU if available)
- Use incremental CSV append (already built-in)
- Schedule overnight for hundreds of documents
5. Quality Assurance
Regularly review generated Q&A:
# Sample random entries
sample = df.sample(n=20)
# Check for common issues:
# - Incomplete questions/answers
# - Repetitive content
# - Formatting problems
Adjust prompts based on findings.
Advanced Techniques
1. Custom Prompt Engineering
Modify the QA generation prompt for specific needs:
For factual Q&A:
qa_prompt_template = """Generate a FACTUAL question that can be answered
directly from the main text. Avoid opinion or interpretation questions.
Focus on: who, what, when, where, how many, how much
{context}
OUTPUT:"""
For comprehension Q&A:
qa_prompt_template = """Generate a COMPREHENSION question that requires
understanding and synthesis of the main concept.
Focus on: why, how, what is the significance, what are the implications
{context}
OUTPUT:"""
2. Multi-Stage Filtering
Add quality filters:
def is_quality_qa(question, answer):
# Check minimum length
if len(question) < 20 or len(answer) < 50:
return False
# Check for question mark
if not question.strip().endswith('?'):
return False
# Check for generic responses
generic_phrases = ["as mentioned", "the text says", "according to"]
if any(phrase in answer.lower() for phrase in generic_phrases):
return False
return True
3. Hierarchical Chunking
For documents with clear structure:
# Create parent chunks (sections)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=3000,
chunk_overlap=500
)
# Create child chunks (paragraphs)
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
# Generate Q&A from children, but use parent as context
4. Embedding Model Ensembles
Use multiple embedding models for robustness:
# Generate embeddings with multiple models
embedding1 = model1.embed_query(text)
embedding2 = model2.embed_query(text)
# Concatenate or average
combined = np.concatenate([embedding1, embedding2])
# or
averaged = (np.array(embedding1) + np.array(embedding2)) / 2
5. Dynamic Context Window
Adjust context based on main chunk complexity:
def get_adaptive_context_count(main_chunk):
# More context for complex chunks
if len(main_chunk) < 500:
return 5 # Short chunk needs more context
elif len(main_chunk) > 1500:
return 2 # Long chunk already has context
else:
return 3 # Standard
Performance Optimization
Memory Management
For large document sets:
# Process in batches to manage memory
def process_large_pdf_set(pdf_files, batch_size=10):
for i in range(0, len(pdf_files), batch_size):
batch = pdf_files[i:i+batch_size]
process_batch(batch)
# Clear memory
import gc
gc.collect()
FAISS Optimization
For very large indexes (millions of vectors):
# Use IVF (Inverted File) index for faster search
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist=100)
# Train the index
index.train(embeddings_array)
index.add(embeddings_array)
# Search with probe parameter
index.nprobe = 10 # Trade-off between speed and accuracy
Parallel Processing
Speed up embedding generation:
from concurrent.futures import ThreadPoolExecutor
def generate_embeddings_parallel(chunks, model, max_workers=4):
with ThreadPoolExecutor(max_workers=max_workers) as executor:
embeddings = list(executor.map(
lambda c: model.embed_query(c.page_content),
chunks
))
return embeddings
Troubleshooting Common Issues
Issue 1: Out of Memory Errors
Symptoms: Process crashes during embedding generation
Solutions:
- Reduce chunk size
- Process fewer PDFs at once
- Use smaller embedding model
- Enable batch processing
Issue 2: Poor Quality Q&A
Symptoms: Generic, repetitive, or incomplete Q&A pairs
Solutions:
- Adjust prompt template
- Increase related chunks for more context
- Try different generation model
- Increase chunk size for more information
Issue 3: Slow Processing
Symptoms: Takes hours for small document sets
Solutions:
- Use smaller/faster models
- Reduce number of related chunks
- Enable GPU acceleration in Ollama
- Implement parallel processing
Issue 4: FAISS Index Errors
Symptoms: Can't load or search indexes
Solutions:
- Ensure float32 dtype for embeddings
- Check dimension consistency
- Verify index files aren't corrupted
- Regenerate indexes if needed
Issue 5: Inconsistent Formatting
Symptoms: Q&A pairs missing or malformed
Solutions:
- Strengthen prompt formatting instructions
- Implement regex-based parsing
- Add validation before saving
- Use few-shot examples in prompt
Future Enhancements and Roadmap
Near-Term Improvements
1. Multi-Modal Support
- Extract and process images from PDFs
- Generate Q&A about diagrams and charts
- Use vision-language models
2. Active Learning
- Flag uncertain Q&A for human review
- Learn from corrections
- Iteratively improve quality
3. Customizable Templates
- Different Q&A formats (multiple choice, true/false)
- Domain-specific templates (medical, legal, technical)
- Difficulty levels
Medium-Term Features
4. Advanced FAISS Indexes
- Implement IVF for large-scale search
- Add PQ (Product Quantization) for compression
- Support GPU-accelerated search
5. Metadata Enrichment
- Extract document structure (headings, sections)
- Preserve page numbers for citations
- Add topic/category tags
6. Quality Metrics
- Automatic quality scoring
- Diversity measures
- Coherence evaluation
Long-Term Vision
7. Distributed Processing
- Multi-machine processing for massive datasets
- Distributed FAISS indexes
- Cluster coordination
8. Interactive Refinement
- Web interface for editing Q&A
- Feedback loop for model improvement
- Collaborative annotation
9. Export Formats
- JSONL for LLM fine-tuning
- Hugging Face datasets
- Custom formats for specific platforms
Conclusion
Building high-quality Q&A datasets is no longer a bottleneck for AI development. This tool demonstrates how combining modern technologies—local LLMs, semantic embeddings, and vector databases—can create a powerful, privacy-first solution for generating contextually-aware training data.
Whether you're a researcher, developer, educator, or enterprise team, this approach offers a scalable, cost-effective way to transform your document collections into intelligent, searchable knowledge bases.
The future of AI lies not just in larger models, but in better data. With tools like this, we're democratizing access to high-quality training data, enabling everyone to build specialized AI systems tailored to their unique domains.