{"id":3661,"date":"2025-05-08T07:02:24","date_gmt":"2025-05-08T07:02:24","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/08\/a-practical-guide-to-bertopic-for-transformer-based-topic-modeling\/"},"modified":"2025-05-08T07:02:24","modified_gmt":"2025-05-08T07:02:24","slug":"a-practical-guide-to-bertopic-for-transformer-based-topic-modeling","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/08\/a-practical-guide-to-bertopic-for-transformer-based-topic-modeling\/","title":{"rendered":"A Practical Guide to BERTopic for Transformer-Based Topic Modeling"},"content":{"rendered":"<p>    A Practical Guide to BERTopic for Transformer-Based Topic Modeling<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1746680777805\" class=\"mdspan-comment\">Topic modeling<\/mdspan> has a wide range of use cases in the natural language processing (NLP) domain, such as document tagging, survey analysis, and content organization. It falls under the realm of unsupervised learning technique, making it a very cost-effective technique that reduces the resources required to collect human-annotated data. We will dive deeper into BERTopic, a popular python library for transformer-based topic modeling, to help us process financial news faster and reveal how the trending topics change overtime.<br \/>BERTopic consists of 6 core modules that can be customized to suit different use cases. In this article, we\u2019ll examine, experiment with each module individually and explore how they work together coherently to produce the end results.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"658\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/Screenshot-2025-05-06-at-7.34.43%25E2%2580%25AFPM-1024x658.png?resize=1024%2C658&#038;ssl=1\" alt=\"BERTopic: Transformer-Based Topic Modeling\" class=\"wp-image-603396\"><figcaption class=\"wp-element-caption\">BERTopic: Transformer-Based Topic Modeling (unless otherwise noted, all images are by the author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">At a high level, a typical BERTopic architecture is composed of:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Embeddings: transform text into vector representations (i.e. embeddings) that capture semantic meaning using sentence-transformer models.<\/li>\n<li class=\"wp-block-list-item\">Dimensionality Reduction: reduce the high-dimensional embeddings to a lower-dimensional space while preserving important relationships, including PCA, UMAP \u2026<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/towardsdatascience.com\/tag\/clustering\/\" title=\"Clustering\">Clustering<\/a>: group similar documents together based on their embeddings with reduced dimensionality to form distinct topics, including HDBSCAN, K-Means algorithms \u2026<\/li>\n<li class=\"wp-block-list-item\">Vectorizers: after topic clusters are formed, vectorizers convert text into numerical features that can be used for topic analysis, including count vectorizer, online vectorizer \u2026<\/li>\n<li class=\"wp-block-list-item\">c-TF-IDF: calculate importance scores for words within and across topic clusters to identify key terms.<\/li>\n<li class=\"wp-block-list-item\">Representation Model: leverage semantic similarity between the embedding of candidate keywords and the embedding of documents to find the most representative topic keywords, including KeyBERT, LLM-based techniques \u2026<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Project Overview<\/h2>\n<p class=\"wp-block-paragraph\">In this practical application, we will use <a href=\"https:\/\/towardsdatascience.com\/tag\/topic-modeling\/\" title=\"Topic Modeling\">Topic Modeling<\/a> to identify trending topics in Apple financial news. Using NewsAPI, we collect daily top-ranked Apple stock news from Google Search and compile them into a dataset of 250 documents, with each document containing financial news for one specific day. However, this is not the main focus of this article so feel free to replace it with your own dataset. The objective is to demonstrate how to transform raw text documents containing top Google search results into meaningful topic keywords and refine those keywords to be more representative.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"241\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-33-1024x241.png?resize=1024%2C241&#038;ssl=1\" alt=\"\" class=\"wp-image-603397\"><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">BERTopic\u2019s 6 Fundamental Modules<\/h2>\n<h3 class=\"wp-block-heading\">1. Embeddings<\/h3>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"612\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-70-1024x612.png?resize=1024%2C612&#038;ssl=1\" alt=\"embeddings\" class=\"wp-image-603544\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">BERTopic uses sentence transformer models as its first building block, converting sentences into dense vector representations (i.e. embeddings) that capture semantic meanings. These models are based on transformer architectures like BERT and are specifically trained to produce high-quality sentence embeddings. We then compute the semantic similarity between sentences using cosine distance between the embeddings. Common models include:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">all-MiniLM-L6-v2: lightweight, fast, good general performance<\/li>\n<li class=\"wp-block-list-item\">BAAI\/bge-base-en-v1.5: larger model with strong semantic understanding hence gives much slower training and inference speed.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">There are a massive range of pre-trained sentence transformers for you to choose from on the \u201c<a href=\"https:\/\/www.sbert.net\/docs\/sentence_transformer\/pretrained_models.html\">Sentence Transformer\u201d<\/a> website and <a href=\"https:\/\/huggingface.co\/models\">Huggingface model hub<\/a>. We can use a few lines of code to load a sentence transformer model and encode the text sequences into high dimensional numerical embeddings.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sentence_transformers import SentenceTransformer\n\n# Initialize model\nmodel = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n# Convert sentences to embeddings\nsentences = [\"First sentence\", \"Second sentence\"]\nembeddings = model.encode(sentences)  # Returns numpy array of embeddings<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In this instance, we input a collection of financial news data from October 2024 to March 2025 into the sentence transformer \u201cbge-base-en-v1.5\u201d. As shown in the result below. these text documents are transformed into vector embedding with the shape of 250 rows and each with 384 dimensions.<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-35.png?ssl=1\" alt=\"embeddings result\" class=\"wp-image-603399\" style=\"width:524px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">We can then feed this sentence transformer to BERTopic pipeline and keep all other modules as the default settings.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sentence_transformers import SentenceTransformer\nfrom bertopic import BERTopic\n\nemb_minilm = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntopic_model = BERTopic(\n    embedding_model=emb_minilm,\n)\n\ntopic_model.fit_transform(docs)\ntopic_model.get_topic_info()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As the end result, we get the following topic representation.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"159\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-36-1024x159.png?resize=1024%2C159&#038;ssl=1\" alt=\"topic result \" class=\"wp-image-603400\"><\/figure>\n<p class=\"wp-block-paragraph\">Compared to the more powerful and larger \u201cbge-base-en-v1.5\u201d model, we get the following result which is slightly more meaningful than the smaller \u201call-MiniLM-L6-v2\u201d model but still leaves large room for improvement. <\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"284\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-38-1024x284.png?resize=1024%2C284&#038;ssl=1\" alt=\"\" class=\"wp-image-603402\"><\/figure>\n<p class=\"wp-block-paragraph\">One area for improvement is reducing the dimensionality, because sentence transformers typically results in high-dimensional embeddings. As BERTopic relies on comparing the spatial proximity between embedding space to form meaningful clusters, it is crucial to apply a dimensionality reduction technique to make the embeddings less sparse. Therefore, we are going to introduce various dimensionality reduction techniques in the next section.<\/p>\n<h3 class=\"wp-block-heading\">2. Dimensionality Reduction<\/h3>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"678\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-71-1024x678.png?resize=1024%2C678&#038;ssl=1\" alt=\"dimensionality reduction\" class=\"wp-image-603545\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">After converting the financial news documents into embeddings, we face the problem of high dimensionality. Since each embedding contains 384 dimensions, the vector space becomes too sparse to create meaningful distance measurement between two vector embeddings. Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) are common techniques to reduce dimensionalities while preserving the maximum variance in the data. We will look at UMAP, BERTopic\u2019s default dimensionality reduction technique, in more details. It is a non-linear algorithm adopted from topology analysis that seeks diverse structure within the data. It works by extending a radius outwards from each data point and connecting points with its close neighbors. You can dive more into the UMAP visualization on this website \u201c<a href=\"https:\/\/pair-code.github.io\/understanding-umap\/\">Understanding UMAP<\/a>\u201c.<\/p>\n<p class=\"wp-block-paragraph\"><strong>UMAP <code>n_neighbours<\/code> Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">An important UMAP parameter is <code>n_neighbours<\/code> that controls how UMAP balances local and global structure in the data. Low values of <code>n_neighbors<\/code> will force UMAP to concentrate on local structure, while large values will look at larger neighborhoods of each point.<br \/>The diagram below shows multiple scatterplots demonstrating the effect of different <code>n_neighbors<\/code> values, with each plot visualizing the embeddings in an 2-dimensional space after applying UMAP dimensionality reduction. <\/p>\n<p class=\"wp-block-paragraph\">With smaller <code>n_neighbors<\/code> values (e.g. n=2, n=5), the plots show more tightly coupled micro clusters, indicating a focus on local structure. As <code>n_neighbors<\/code> increases (towards n=100, n=150), the points form more cohesive global patterns, demonstrating how larger neighborhood sizes help UMAP capture broader relationships in the data.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1022\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-39-1024x1022.png?resize=1024%2C1022&#038;ssl=1\" alt=\"UMAP experimentation\" class=\"wp-image-603403\" style=\"width:483px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\"><strong>UMAP <code>min_dist<\/code> Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The <code>min_dist<\/code> parameter in UMAP controls how tightly points are allowed to be packed together in the lower dimensional representation. It sets the minimum distance between points in the embedding space. A smaller <code>min_dist<\/code> allows points to be packed very closely together whereas a larger <code>min_dist<\/code> forces points to be more scattered and evenly spread out. The diagram below shows an experimentation on <code>min_dist<\/code> value from 0.0001 to 1 when setting the <code>n_neighbors=5.<\/code> When min_dist is set to smaller values, UMAP emphasizes on preserving local structure whereas larger values transform the embeddings into a circular shape. <\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-40-1024x1024.png?resize=1024%2C1024&#038;ssl=1\" alt=\"UMAP experimentation\" class=\"wp-image-603404\" style=\"width:504px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">We decide to set <code>n_neighbors=5<\/code> and <code>min_dist=0.01<\/code> based on the hyperparameter tuning results, as it forms more distinct data clusters that are easier for the subsequent clustering model to process.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import umap\n\nUMAP_N = 5\nUMAP_DIST = 0.01\numap_model = umap.UMAP(\n    n_neighbors=UMAP_N,\n    min_dist=UMAP_DIST, \n    random_state=0\n)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">3. Clustering<\/h3>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"681\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-72-1024x681.png?resize=1024%2C681&#038;ssl=1\" alt=\"clustering\" class=\"wp-image-603546\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">Following the dimensionality reduction module, it\u2019s the process of grouping embeddings with close proximity into clusters. This process is fundamental to topic modeling, as it categorizes relevant text documents together by looking at their semantic relationships. BERTopic employs HDBSCAN model by default, which has the advantage in capturing structures with diverse densities. Additionally, BERTopic provides the flexibility of choosing other clustering models based on the nature of the dataset, such as K-Means (for spherical, equally-sized clusters) or agglomerative clustering (for hirerarchical clusters).<\/p>\n<p class=\"wp-block-paragraph\"><strong>HDBSCAN Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">We will explore how two important parameters, <code>min_cluster_size<\/code> and <code>min_samples<\/code>, impact the behavior of HDBSCAN model.<br \/><code>min_cluster_size<\/code> determines the minimum number of data points allowed to form a cluster and clusters not meeting the threshold are treated as outliers. When setting <code>min_cluster_size<\/code> too low, you might get many small, unstable clusters which might be noise. If setting it too high, you might merge multiple clusters into one, losing their distinct characteristics.<\/p>\n<p class=\"wp-block-paragraph\"><code>min_samples<\/code> calculates the distance between a point\u00a0and its k-th nearest neighbor, determining how strict the cluster formation process is. The larger the <code>min_samples<\/code> value, the more conservative the clustering becomes, as clusters will be restricted to form in dense areas, classifying sparse points as noise.<\/p>\n<p class=\"wp-block-paragraph\">Condensed Tree is a useful technique to help us decide appropriate values of these two parameters. Clusters that persist for a large range of lambda values (shown as the left vertical axis in a condense tree plot) are considered stable and more meaningful. We prefer the selected clusters to be both tall (more stable) and wide (large cluster size). We use <code>condensed_tree_<\/code> from HDBSCAN to compare <code>min_cluster_size<\/code> from 3 to 50, then visualize the data points in their vector space, color coded by the predicted cluster labels. As we progress through different <code>min_cluster_size<\/code>, we can identify optimal values that group close data points together.<\/p>\n<p class=\"wp-block-paragraph\">In this experimentation, we selected <code>min_cluster_size=15<\/code> as it generates 4 clusters (highlighted in red in the condensed tree plot below) with good stability and cluster size. Additionally the scatterplot also indicates reasonable cluster formation based on proximity and density.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"976\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-41-1024x976.png?resize=1024%2C976&#038;ssl=1\" alt=\"Condensed Tree for HDBSCAN min_cluster_size\" class=\"wp-image-603406\" style=\"width:408px;height:auto\"><figcaption class=\"wp-element-caption\">Condensed Trees for HDBSCAN <code>&lt;code&gt;min_cluster_size<\/code> Experimentation<\/figcaption><\/figure>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"1010\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-42-1010x1024.png?resize=1010%2C1024&#038;ssl=1\" alt=\"Condensed Tree for HDBSCAN min_samples\" class=\"wp-image-603407\" style=\"width:407px;height:auto\"><figcaption class=\"wp-element-caption\">Scatterplots for HDBSCAN <code>&lt;code&gt;min_cluster_size<\/code> Experimentation<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We then carry out a similar exercise to compare <code>min_samples<\/code> from 1 to 80 and selected <code>min_samples=5<\/code>. As you can observe from the visuals, the parameters <code>min_samples<\/code> and <code>min_cluster_size<\/code> exert distinct impacts on the clustering process.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"976\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-43-1024x976.png?resize=1024%2C976&#038;ssl=1\" alt=\"\" class=\"wp-image-603408\" style=\"width:458px;height:auto\"><figcaption class=\"wp-element-caption\">Condensed Trees for HDBSCAN <code>min_samples<\/code> Experimentation<\/figcaption><\/figure>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"1013\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-44-1013x1024.png?resize=1013%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-603409\" style=\"width:444px;height:auto\"><figcaption class=\"wp-element-caption\">Scatterplots for HDBSCAN <code>min_samples<\/code> Experimentation<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import hdbscan\n\nMIN_CLUSTER _SIZE= 15\nMIN_SAMPLES = 5\nclustering_model = hdbscan.HDBSCAN(\n    min_cluster_size=MIN_CLUSTER_SIZE,\n    metric='euclidean',\n    cluster_selection_method='eom',\n    min_samples=MIN_SAMPLES,\n    random_state=0\n)\n\ntopic_model = BERTopic(\n    embedding_model=emb_bge,\n    umap_model=umap_model,\n    hdbscan_model=clustering_model, \n)\n\ntopic_model.fit_transform(docs)\ntopic_model.get_topic_info()<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong>K-Means Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Compared to HDBSCAN, using K-Means clustering allows us to generate more granular topics by specifying the <code>n_cluster<\/code> parameter, consequently, controlling the number of topics generated from the text documents.<\/p>\n<p class=\"wp-block-paragraph\">This image shows a series of scatter plots demonstrating different clustering results when varying the number of clusters (<code>n_cluster<\/code>) from 3 to 50 using K-Means. With <code>n_cluster=3<\/code>, the data is divided into just three large groups. As <code>n_cluster<\/code> increases (5, 8, 10, etc.), the data points are split into more granular groupings. Overall, it forms rounded-shape clusters compared to HDBSCAN. We selected <code>n_cluster=8<\/code> where the clusters are neither too broad (losing important distinctions) nor too granular (creating artificial divisions). Additionally, it is a right amount of topics for categorizing 250 days of financial news. However, feel free to adjust the code snippet to your requirements if need to identify more granular or broader topics.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"1010\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-45-1010x1024.png?resize=1010%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-603410\" style=\"width:482px;height:auto\"><figcaption class=\"wp-element-caption\">Scatterplots for K-Means <code>n_cluster<\/code> Experimentation<\/figcaption><\/figure>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sklearn.cluster import KMeans\n\nN_CLUSTER = 8\nclustering_model = KMeans(\n    n_clusters=N_CLUSTER,\n    random_state=0\n)\n\ntopic_model = BERTopic(\n    embedding_model=emb_bge,\n    umap_model=umap_model,\n    hdbscan_model=clustering_model, \n)\n\ntopic_model.fit_transform(docs)\ntopic_model.get_topic_info()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Comparing the topic cluster results of K-Means and HDBSCAN reveals that K-Means produces more distinct and meaningful topic representations. However, both methods still generate many stop words, indicating that subsequent modules are critical to refine the topic representations.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"105\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-46-1024x105.png?resize=1024%2C105&#038;ssl=1\" alt=\"HDBSCAN Output\" class=\"wp-image-603419\"><figcaption class=\"wp-element-caption\">HDBSCAN Output<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"166\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-47-1024x166.png?resize=1024%2C166&#038;ssl=1\" alt=\"K-Means Output\" class=\"wp-image-603426\"><figcaption class=\"wp-element-caption\">K-Means Output<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">4. Vectorizer<\/h3>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"607\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-73-1024x607.png?resize=1024%2C607&#038;ssl=1\" alt=\"vectorizer\" class=\"wp-image-603547\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">Previous modules serve the role of grouping documents into semantically similar clusters, and starting from this module the main focus is to fine-tune the topics by choosing more representative and meaningful keywords. BERTopic offers various Vectorizer options from the basic <code>CountVectorizer<\/code> to more advanced <code>OnlineCountVectorizer<\/code> which incrementally update topic representations. For this exercise, we will experiment on <code>CountVectorizer<\/code>, a text processing tool that creates a matrix of token counts out of a collection of documents. Each row in the matrix represents a document and each column represents a term from the vocabulary, with the values showing how many times each term appears in each document. This matrix representation enables machine learning algorithms to process the text data mathematically.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Vectorizer Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">We will go through a few important parameters of the\u00a0<code>CountVectorizer<\/code>\u00a0and see how they might affect the topic representations.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>ngram_range<\/code> specifies how many words to combine together into topic phrases. It is particularly useful for documents consists of short phrases, which is not needed in this situation.<br \/>example output if we set <code>ngram_range=(1, 3)<\/code>\n<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\">0                -1_apple nasdaq aapl_apple stock_apple nasdaq_nasdaq aapl   \n1  0_apple warren buffett_apple stock_berkshire hathaway_apple nasdaq aapl   \n2           1_apple nasdaq aapl_nasdaq aapl apple_apple stock_apple nasdaq   \n3              2_apple aapl stock_apple nasdaq aapl_apple stock_aapl stock   \n4           3_apple nasdaq aapl_cramer apple aapl_apple nasdaq_apple stock <\/code><\/pre>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>stop_words<\/code> determines whether stop words are removed from the topics, which significantly improves topic representations.<\/li>\n<li class=\"wp-block-list-item\">\n<code>min_df<\/code> and <code>max_df<\/code> determines the frequency thresholds for terms to be included in the vocabulary. <code>min_df<\/code> sets the minimum number of documents a term must appear while <code>max_df<\/code> sets the maximum document frequency above which terms are considered too common and discarded.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">We explore the effect of adding <code>CountVectorizer<\/code> with <code>max_df=0.8<\/code> (i.e. ignore words appearing in more than 80% of the documents) to both HDBSCAN and K-Means models from the previous step. <\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sklearn.feature_extraction.text import CountVectorizer\nvectorizer_model = CountVectorizer(\n\t\tmax_df=0.8, \n\t\tstop_words=\"english\"\n)\n\ntopic_model = BERTopic(\n    embedding_model=emb_bge,\n    umap_model=umap_model,\n    hdbscan_model=clustering_model, \n    vectorizer_model=vectorizer_model\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Both shows improvements after introducing the <code>CountVectorizer<\/code>, significantly reducing keywords frequently appeared in all documents and not bringing extra values, such as \u201cappl\u201d, \u201cstock\u201d, and \u201capple\u201d.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"149\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-49-1024x149.png?resize=1024%2C149&#038;ssl=1\" alt=\"HDBSCAN Output with Vectorizer\" class=\"wp-image-603431\"><figcaption class=\"wp-element-caption\">HDBSCAN Output with Vectorizer<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"148\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-50-1024x148.png?resize=1024%2C148&#038;ssl=1\" alt=\"K-Means Output with Vectorizer\" class=\"wp-image-603432\"><figcaption class=\"wp-element-caption\">K-Means Output with Vectorizer<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">5. c-TF-IDF<\/h3>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"559\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-74-1024x559.png?resize=1024%2C559&#038;ssl=1\" alt=\"c-TF-IDF\" class=\"wp-image-603549\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">While the Vectorizer module focuses on adjusting the topic representation at the document level, c-TF-IDF mainly look at the cluster level to reduce frequently encountered topics across clusters. This is achieved by converting all documents belonging to one cluster as a single document and calculated the keyword importance based on the traditional TF-IDF approach.<\/p>\n<p class=\"wp-block-paragraph\"><strong>c-TF-IDF Experimentation<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>reduce_frequent_words<\/code>: determines whether to down-weight frequently occurring words across topics<\/li>\n<li class=\"wp-block-list-item\">\n<code>bm25_weighting<\/code>: when set to True, uses BM25 weighting instead of standard TF-IDF, which can help better handle document length variations. In smaller datasets, this variant can be more robust to stop words.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">We use the following code snippet to add c-TF-IDF (with <code>bm25_weighting=True<\/code>) into our BERTopic pipeline.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from bertopic.vectorizers import ClassTfidfTransformer\n\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\ntopic_model = BERTopic(\n    embedding_model=emb_bge,\n    umap_model=umap_model,\n    hdbscan_model=clustering_model, \n    vectorizer_model=vectorizer_model,\n    ctfidf_model=ctfidf_model\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The topic cluster outputs below show that adding c-TF-IDF has no major impact to the end results when <code>CountVectorizer<\/code> has already been added. This is potentially because our <code>CountVectorizer<\/code> has already set a high bar of eliminating words appearing in more than 80% at the document level. Subsequently, this already reduces overlapping vocabularies at the topic cluster level, which is what c-TF-IDF is intended to achieve.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"90\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-51-1024x90.png?resize=1024%2C90&#038;ssl=1\" alt=\"\" class=\"wp-image-603433\"><figcaption class=\"wp-element-caption\">HDBSCAN Output with Vectorizer and c-TF-IDF<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"148\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-52-1024x148.png?resize=1024%2C148&#038;ssl=1\" alt=\"\" class=\"wp-image-603434\"><figcaption class=\"wp-element-caption\">K-Means Output with Vectorizer and c-TF-IDF<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">However, If we replace <code>CountVectorizer<\/code> with c-TF-IDF, although the result below shows slight improvements compared to when both are not added, there are too many stop words present, making the topic representations less valuable. Therefore, it appears that for the documents we are dealing with in this scenario, c-TF-IDF module does not bring extra value.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"145\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-53-1024x145.png?resize=1024%2C145&#038;ssl=1\" alt=\"\" class=\"wp-image-603435\"><figcaption class=\"wp-element-caption\">HDBSCAN Output with c-TF-IDF only<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"282\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-54-1024x282.png?resize=1024%2C282&#038;ssl=1\" alt=\"\" class=\"wp-image-603436\"><figcaption class=\"wp-element-caption\">K-Means Output with c-TF-IDF only<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">6. Representation Model<\/h3>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-76.png?ssl=1\" alt=\"\" class=\"wp-image-603553\" style=\"width:425px\"><\/figure>\n<p class=\"wp-block-paragraph\">The last module is the representation model which has been observed having a significant impact on tuning the topic representations. Instead of using the frequency based approach like Vectorizer and c-TF-IDF, it leverages semantic similarity between the embeddings of candidate keywords and the embeddings of documents to find the most representative topic keywords. This can result in more semantically coherent topic representations and reducing the number of synonymically similar keywords. BERTopic also offers various customization options for representation models, including but not limited to the following:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<code>KeyBERTInspired<\/code>: employ <a href=\"https:\/\/maartengr.github.io\/KeyBERT\/\">KeyBERT<\/a> technique to extract topic words based semantic similarity.<\/li>\n<li class=\"wp-block-list-item\">\n<code>ZeroShotClassification<\/code>: make the most of open-source transformers in the <a href=\"https:\/\/huggingface.co\/models?pipeline_tag=zero-shot-classification&amp;sort=downloads\">Huggingface\u00a0model hub<\/a> to assign labels to topics.<\/li>\n<li class=\"wp-block-list-item\">\n<code>MaximalMarginalRelevance<\/code>: decrease synonyms in topics (e.g. stock and stocks).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>KeyBERTInspired Experimentation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">We found that KeyBERTInspired is a very cost-effective approach as it significantly improves the end result by adding a few extra lines of code, without the need of extensive hyperparameter tuning.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from bertopic.representation import KeyBERTInspired\n\nrepresentation_model = KeyBERTInspired()\n\ntopic_model = BERTopic(gh\n    embedding_model=emb_bge,\n    umap_model=umap_model,\n    hdbscan_model=clustering_model, \n    vectorizer_model=vectorizer_model,\n    representation_model=representation_model\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">After incorporating the KeyBERT-Inspired representation model, we now observe that both models generate noticeably more coherent and valuable themes.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"83\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-68-1024x83.png?resize=1024%2C83&#038;ssl=1\" alt=\"HDBSCAN Output with KeyBERTInspired\" class=\"wp-image-603536\"><figcaption class=\"wp-element-caption\">HDBSCAN Output with KeyBERTInspired<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"141\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-69-1024x141.png?resize=1024%2C141&#038;ssl=1\" alt=\"K-Means Output with KeyBERTInspired\" class=\"wp-image-603537\"><figcaption class=\"wp-element-caption\">K-Means Output with KeyBERTInspired<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Take-Home Message<\/h2>\n<p class=\"wp-block-paragraph\">This article explores BERTopic technique and implementation for topic modeling, detailing its six key modules with practical examples using Apple stock market news data to demonstrate each component\u2019s impact on the quality of topic representations.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Embeddings:<\/strong> use transformer-based embedding models to convert documents into numerical representations that capture semantic meaning and contextual relationships in text.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Dimensionality Reduction:<\/strong> employ UMAP or other dimensionality reduction techniques to reduce high-dimensional embeddings while preserving both local and global structure of the data<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Clustering:<\/strong> compare HDBSCAN (density-based) and K-Means (centroid-based) clustering algorithm to group similar documents into coherent topics<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Vectorizers:<\/strong> use Count Vectorizer to create document-term matrices and refine topics based on statistical approach.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>c-TF-IDF:<\/strong> update topic representations by analyzing term frequency at cluster level (topic class) and reduce common words across different topics.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Representation Model:<\/strong> refine topic keywords using semantic similarity, offering options like <code>KeyBERTInspired<\/code> and <code>MaximalMarginalRelevance<\/code> for better topic descriptions<\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/a-practical-guide-to-bertopic-for-transformer-based-topic-modeling\/\">A Practical Guide to BERTopic for Transformer-Based Topic Modeling<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Destin Gong<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/a-practical-guide-to-bertopic-for-transformer-based-topic-modeling\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Practical Guide to BERTopic for Transformer-Based Topic Modeling Topic modeling has a wide range of use cases in the natural language processing (NLP) domain, such as document tagging, survey analysis, and content organization. It falls under the realm of unsupervised learning technique, making it a very cost-effective technique that reduces the resources required to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,350,83,67,70,260,1288],"tags":[189,2603,1289],"class_list":["post-3661","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-clustering","category-data-science","category-deep-dives","category-machine-learning","category-nlp","category-topic-modeling","tag-based","tag-bertopic","tag-topic"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3661"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3661"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3661\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3661"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3661"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3661"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}