Engineering

Topic Modeling in Python: LDA, NMF, and BERTopic

B

Boundev Team

Mar 9, 2026
14 min read
Topic Modeling in Python: LDA, NMF, and BERTopic

Topic modeling is the backbone of every modern NLP pipeline that needs to make sense of unstructured text at scale. Customer support ticket classification, research paper categorization, social media sentiment clustering, and content recommendation engines all depend on algorithms that can discover latent thematic structures without labeled training data. Python’s ecosystem offers three dominant approaches: Latent Dirichlet Allocation (LDA) for probabilistic topic discovery, Non-Negative Matrix Factorization (NMF) for deterministic decomposition, and BERTopic for transformer-powered semantic clustering. This guide compares the architecture, preprocessing requirements, coherence benchmarks, and production deployment patterns of all three, with implementation guidance using Gensim, scikit-learn, and the BERTopic library.

Key Takeaways

LDA is a probabilistic generative model best suited for long-form documents where interpretability and stable topic distributions matter more than semantic precision
NMF produces deterministic, non-negative factorizations that often outperform LDA on coherence metrics while running 3–5x faster on medium-sized corpora
BERTopic leverages transformer embeddings (BERT/Sentence-BERT) with UMAP + HDBSCAN clustering, achieving 15–25% higher coherence scores than LDA on short-text datasets
Preprocessing depth varies dramatically: LDA/NMF require tokenization, stopword removal, and lemmatization, while BERTopic works best with raw sentences to preserve context
Boundev’s dedicated engineering teams build production NLP pipelines that integrate topic modeling with real-time classification, search, and recommendation systems

At Boundev, we build NLP systems that process millions of documents for enterprise clients — from customer support ticket routing to regulatory document analysis. The choice of topic modeling algorithm is never academic; it directly determines classification accuracy, inference latency, and whether the system can handle the client’s data scale. We have shipped all three approaches in production and know exactly where each one excels and where it breaks down.

This guide provides the engineering-level comparison you need to choose the right algorithm for your specific dataset characteristics, latency requirements, and interpretability needs.

Algorithm Architecture Comparison

Each topic modeling approach makes fundamentally different assumptions about how topics exist in text. Understanding these architectural differences is essential for choosing the right algorithm before writing a single line of Python.

Characteristic LDA (Gensim) NMF (scikit-learn) BERTopic
Model Type Probabilistic generative Matrix factorization Embedding + clustering
Text Representation Bag-of-words (BoW) TF-IDF matrix Dense sentence embeddings
Topic Count Must specify k upfront Must specify k upfront Auto-detected via HDBSCAN
Deterministic No (stochastic sampling) Yes (given same init) No (UMAP randomness)
Semantic Understanding Word co-occurrence only Word co-occurrence only Full contextual semantics
Short Text Performance Poor (sparse BoW) Moderate Excellent

LDA: Latent Dirichlet Allocation

LDA remains the most widely deployed topic model in production systems because of its interpretability and mature library support. It models each document as a mixture of topics, and each topic as a distribution over words, using Dirichlet priors to enforce sparse, interpretable distributions.

When to Use LDA

  • Long-form documents (research papers, articles, reports)
  • When topic interpretability is a hard requirement
  • Corpora with well-separated thematic clusters
  • Resource-constrained environments (CPU-only)

LDA Limitations

  • Poor on short text (tweets, chat messages, search queries)
  • Must pre-specify number of topics (k)
  • Stochastic — different runs produce different results
  • Heavy preprocessing requirement (tokenize, stem, lemmatize)

NMF: Non-Negative Matrix Factorization

NMF decomposes the document-term matrix (typically TF-IDF weighted) into two non-negative matrices: one mapping terms to topics and another mapping documents to topics. The non-negativity constraint produces naturally interpretable, additive topic representations.

NMF Advantage Engineering Impact
Deterministic Output Same input always produces the same topics, critical for reproducible pipelines and CI/CD testing
3–5x Faster Than LDA Coordinate descent solver converges faster than LDA’s Gibbs sampling, enabling real-time retraining on updated corpora
Higher Coherence on Medium Corpora Research benchmarks show NMF averaging 0.55–0.65 coherence vs LDA’s 0.40–0.55 on datasets under 100K documents
Sparse Output Produces sparser topic-word distributions than LDA, resulting in cleaner, more distinct topic definitions

BERTopic: Transformer-Powered Topic Modeling

BERTopic represents the current state-of-the-art by combining pre-trained transformer embeddings with dimensionality reduction and density-based clustering. Unlike LDA and NMF, BERTopic captures contextual semantic meaning rather than relying on word co-occurrence statistics.

BERTopic Pipeline Architecture

The four-stage pipeline that powers BERTopic’s semantic topic discovery.

1Document Embedding

Sentence-BERT (or any Sentence Transformer model) converts each document into a dense 384–768 dimensional vector that captures semantic meaning. No tokenization or stopword removal needed.

2Dimensionality Reduction (UMAP)

UMAP reduces 768-dimensional embeddings to 5–15 dimensions while preserving local and global structure. This step is critical — HDBSCAN cannot cluster effectively in high-dimensional space.

3Density-Based Clustering (HDBSCAN)

HDBSCAN automatically discovers the number of clusters (topics) and identifies outlier documents that do not belong to any topic. No need to specify k in advance.

4Topic Representation (c-TF-IDF)

Class-based TF-IDF extracts the most representative words for each discovered cluster, creating human-readable topic labels from the semantically grouped documents.

Build Production NLP Pipelines

Boundev’s staff augmentation engineers specialize in deploying topic modeling pipelines at scale — from preprocessing and model training through real-time inference APIs and monitoring dashboards.

Talk to Our ML Engineers

Preprocessing Pipeline Comparison

The preprocessing requirements differ dramatically between classical and neural approaches. Getting this wrong is the most common source of poor topic quality in production systems.

Preprocessing Step LDA NMF BERTopic
Lowercasing Required Required Not needed
Tokenization Required (NLTK/spaCy) Handled by TfidfVectorizer Not needed (model tokenizes)
Stopword Removal Critical for quality Critical for quality Can degrade quality
Lemmatization Strongly recommended (spaCy) Recommended Not needed
N-gram Generation Bigrams/trigrams via Gensim Phrases Via TfidfVectorizer ngram_range Captured inherently by embeddings

Boundev Insight: We observe that BERTopic coherence scores drop by 8–12% when teams apply traditional preprocessing (stopword removal + lemmatization) to the input text. Transformer models need the full sentence context to generate meaningful embeddings. Preprocessing that helps LDA actively hurts BERTopic.

Evaluation and Tuning

Choosing between algorithms requires objective metrics. We use these evaluation approaches in our software outsourcing engagements to validate topic model quality before deployment.

Benchmark Performance Ranges

Typical coherence scores and processing characteristics across the three approaches.

0.45
LDA avg C_v coherence (long docs)
0.61
NMF avg C_v coherence (medium corpora)
0.72
BERTopic avg coherence (short text)
3–5x
NMF speed advantage over LDA

Common Topic Modeling Mistakes:

Using perplexity alone — lower perplexity does not guarantee more interpretable topics
Applying LDA to tweets/chat — bag-of-words fails on short, noisy text
Preprocessing BERTopic input — stripping stopwords destroys sentence context
Skipping hyperparameter search — default k often produces incoherent topics

Production Best Practices:

Use C_v coherence — most reliable single metric for topic quality
Grid search over k — test 5–50 topics with coherence plots for LDA/NMF
Human evaluation sample — score 50 random topic-document pairs for relevance
Monitor topic drift — retrain periodically as corpus vocabulary evolves

FAQ

What is topic modeling in Python?

Topic modeling is an unsupervised machine learning technique that discovers hidden thematic structures in large collections of unstructured text. In Python, it is implemented using libraries like Gensim (for LDA), scikit-learn (for NMF), and BERTopic (for transformer-based topic discovery). Each approach analyzes word patterns and semantic relationships to automatically group documents into coherent topics without requiring labeled training data.

Which is better: LDA or BERTopic?

BERTopic generally produces higher coherence scores (15–25% improvement) and better handles short text, polysemous words, and nuanced topics because it uses contextual transformer embeddings rather than word co-occurrence statistics. However, LDA remains preferable when you need full interpretability of probabilistic topic-document distributions, when running on CPU-only environments, or when working with well-separated topics in long-form documents. Choose based on your text length, interpretability requirements, and compute resources.

How do I choose the number of topics for LDA?

Use a grid search over a range of k values (typically 5–50) and compute the C_v coherence score for each model. Plot coherence against k and select the value where the curve peaks or begins to plateau. Complement this with human evaluation by sampling 50–100 documents and scoring topic assignment relevance. Avoid using perplexity alone as the selection criterion — lower perplexity does not necessarily correlate with more interpretable topics.

What preprocessing does BERTopic need?

BERTopic works best with minimal preprocessing — ideally raw sentences or paragraphs. Unlike LDA and NMF, you should NOT remove stopwords, stem, or lemmatize the text before passing it to BERTopic. The underlying Sentence-BERT model needs complete sentence context to generate meaningful embeddings. The only recommended preprocessing is removing HTML tags, URLs, and non-text artifacts that are not part of the natural language content.

Can I use topic modeling for real-time classification?

Yes, but the approach differs by algorithm. For LDA and NMF, train the model offline and use the fitted model to transform new documents into topic distributions at inference time (sub-millisecond latency). For BERTopic, pre-compute embeddings and use the trained model’s transform method, though embedding generation adds 10–100ms latency per document depending on model size and hardware. For sub-10ms requirements, consider distilling BERTopic topics into a lightweight classifier trained on the model’s output labels.

Tags

#Python#NLP#Machine Learning#Topic Modeling#Data Science
B

Boundev Team

At Boundev, we're passionate about technology and innovation. Our team of experts shares insights on the latest trends in AI, software development, and digital transformation.

Ready to Transform Your Business?

Let Boundev help you leverage cutting-edge technology to drive growth and innovation.

Get in Touch

Start Your Journey Today

Share your requirements and we'll connect you with the perfect developer within 48 hours.

Get in Touch