{"id":1169,"date":"2025-01-14T07:02:38","date_gmt":"2025-01-14T07:02:38","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/14\/contextual-topic-modelling-in-chinese-corpora-with-keynmf-9a1d02f02648\/"},"modified":"2025-01-14T07:02:38","modified_gmt":"2025-01-14T07:02:38","slug":"contextual-topic-modelling-in-chinese-corpora-with-keynmf-9a1d02f02648","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/14\/contextual-topic-modelling-in-chinese-corpora-with-keynmf-9a1d02f02648\/","title":{"rendered":"Contextual Topic Modelling in Chinese Corpora with KeyNMF"},"content":{"rendered":"<p>    Contextual Topic Modelling in Chinese Corpora with KeyNMF<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>A comprehensive guide on getting the most out of your Chinese topic models, from preprocessing to interpretation.<\/h4>\n<p>With our <a href=\"https:\/\/arxiv.org\/abs\/2410.12791\">recent paper<\/a> on discourse dynamics in European Chinese diaspora media, our team has tapped into an almost unanimous frustration with the quality of topic modelling approaches when applied to Chinese data. In this article, I will introduce you to our novel topic modelling method, KeyNMF, and how to apply it most effectively to Chinese textual\u00a0data.<\/p>\n<h3>Topic Modelling with Matrix Factorization<\/h3>\n<p>Before diving into practicalities, I would like to give you a brief introduction to topic modelling theory, and motivate the advancements introduced in our\u00a0paper.<\/p>\n<p>Topic modelling is a discipline of Natural Language Processing for uncovering latent topical information in textual corpora in an unsupervised manner, that is then presented to the user in a human-interpretable way (usually 10 keywords for each\u00a0topic).<\/p>\n<p>There are many ways to formalize this task in mathematical terms, but one rather popular conceptualization of topic discovery is matrix factorization. This is a rather natural and intuitive way to tackle the problem, and in a minute, you will see why. The primary insight behind topic modelling as matrix factorization is the following: Words that frequently occur together, are likely to belong to the same latent structure. In other words: Terms, the occurrence of which are highly correlated, are part of the same\u00a0topic.<\/p>\n<p>You can discover topics in a corpus, by first constructing a bag-of-words matrix of documents. A bag-of-words matrix represents documents in the following way: Each row corresponds to a document, while each column to a unique word from the model\u2019s vocabulary. The values in the matrix are then the number of times a word occurs in a given document.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AKrks8ObBLUyeiBAxpJuubg.png?ssl=1\"><figcaption>Schematic Overview of Non-negative Matrix Factorization<\/figcaption><\/figure>\n<p>This matrix can be decomposed into the linear combination of a <em>topic-term matrix, <\/em>which indicates how important a word is for a given topic,<em> <\/em>and a <em>document-topic matrix, <\/em>which indicates how important a given topic is for a given document. A method for this decomposition is Non-negative Matrix Factorization, where we decompose a non-negative matrix to two other strictly non-negative matrices, instead of allowing arbitrary signed\u00a0values.<\/p>\n<p>NMF is not the only method one can use for decomposing the bag-of-words matrix. A method of high historical significance, Latent Semantic Analysis, utilizes Truncated Singular-Value Decomposition for this purpose. NMF, however, is generally a better choice,\u00a0as:<\/p>\n<ol>\n<li>The discovered latent factors are of different quality from other decomposition methods. NMF typically discovers <strong>localized patterns<\/strong> or <strong>parts <\/strong>in the data, which are easier to interpret.<\/li>\n<li>\n<strong>Non-negative<\/strong> topic-term and document-topic relations are easier to interpret than signed\u00a0ones.<\/li>\n<\/ol>\n<p>Using NMF with just BoW matrices, however attractive and simple it may be, does come with its setbacks:<\/p>\n<ol>\n<li>NMF typically minimizes the Frobenius norm of the error matrix. This entails an <strong>assumption of Gaussianity<\/strong> of the outcome variable, which is obviously false, as we are modelling word\u00a0counts.<\/li>\n<li>BoW representations are<strong> just word counts<\/strong>. This means that words won\u2019t be interpreted in context, and syntactical information will be\u00a0ignored.<\/li>\n<\/ol>\n<h3>KeyNMF<\/h3>\n<p>To account for these limitations, and with the help of new transformer-based language representations, we can significantly improve NMF for our purposes.<\/p>\n<p>The key intuition behind KeyNMF is that most words in a document are <em>semantically insignificant<\/em>, and we can get an overview of topical information in the document by highlighting the top N most relevant terms. We will select these terms by using <strong>contextual embeddings<\/strong> from sentence-transformer models.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/0%2AuHugo1J8uKwnQnhP.png?ssl=1\"><figcaption>A Schematic Overview of the KeyNMF\u00a0Model<\/figcaption><\/figure>\n<p>The KeyNMF algorithm consists of the following steps:<\/p>\n<ol>\n<li>Embed each document using a sentence-transformer, along with all words in the document.<\/li>\n<li>Calculate cosine similarities of word embeddings to document embeddings.<\/li>\n<li>For each document, keep the highest N words with positive cosine similarities to the document.<\/li>\n<li>Arrange cosine similarities into a <strong>keyword-matrix<\/strong>, where each row is a document, each column is a keyword, and values are cosine similarities of the word to the document.<\/li>\n<li>Decompose the keyword matrix with\u00a0NMF.<\/li>\n<\/ol>\n<p>This formulation helps us in multiple ways. a) We substantially reduce the model\u2019s vocabulary, thereby having less parameters, resulting in faster and better model fit b) We get continuous distribution, which is a better fit for NMF\u2019s assumptions and c) We incorporate contextual information into our topic\u00a0model.<\/p>\n<h3>Chinese Topic Modelling with\u00a0KeyNMF<\/h3>\n<p>Now that you understand how KeyNMF works, let\u2019s get our hands dirty and apply the model in a practical context.<\/p>\n<h4>Preparation and\u00a0Data<\/h4>\n<p>First, let\u2019s install the packages we are going to use in this demonstration:<\/p>\n<pre>pip install turftopic[jieba] datasets sentence_transformers topicwizard<\/pre>\n<p>Then let\u2019s get some openly available data. I chose to go with the <a href=\"https:\/\/huggingface.co\/datasets\/Davlan\/sib200\">SIB200<\/a> corpus, as it is freely available under the CC-BY-SA 4.0 open license. This piece of code will fetch us the\u00a0corpus.<\/p>\n<pre>from datasets import load_dataset<br><br># Loads the dataset<br>ds = load_dataset(\"Davlan\/sib200\", \"zho_Hans\", split=\"all\")<br>corpus = ds[\"text\"]<\/pre>\n<h4>Building a Chinese Topic\u00a0Model<\/h4>\n<p>There are a number of tricky aspects to applying language models to Chinese, since most of these systems are developed and tested on English data. When it comes to KeyNMF, there are two aspects that need to be taken into\u00a0account.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/0%2AiqCsY1VAXpM3kL60.png?ssl=1\"><figcaption>Elements of a Topic Modelling Pipeline in Turftopic<\/figcaption><\/figure>\n<p>Firstly, we will need to figure out how to tokenize texts in Chinese. Luckily, the <a href=\"https:\/\/github.com\/x-tabdeveloping\/turftopic\">Turftopic<\/a> library, which contains our implementation of KeyNMF (among other things), comes prepackaged with tokenization utilities for Chinese. Normally, you would use a CountVectorizer object from sklearn to extract words from text. We added a ChineseCountVectorizer object that uses the Jieba tokenizer in the background, and has an optionally usable Chinese stop word\u00a0list.<\/p>\n<pre>from turftopic.vectorizers.chinese import ChineseCountVectorizer<br><br>vectorizer = ChineseCountVectorizer(stop_words=\"chinese\")<\/pre>\n<p>Then we will need a Chinese embedding model for producing document and word representations. We will use the paraphrase-multilingual-MiniLM-L12-v2 model for this, as it is quite compact and fast, and was specifically trained to be used in multilingual retrieval contexts.<\/p>\n<pre>from sentence_transformers import SentenceTransformer<br><br>encoder = SentenceTransformer(\"paraphrase-multilingual-MiniLM-L12-v2\")<\/pre>\n<p>We can then build a fully Chinese KeyNMF model! I will initialize a model with 20 topics and N=25 (a maximum of 15 keywords will be extracted for each document)<\/p>\n<pre>from turftopic import KeyNMF<br><br>model = KeyNMF(<br>    n_components=20,<br>    top_n=25,<br>    vectorizer=vectorizer,<br>    encoder=encoder,<br>    random_state=42, # Setting seed so that our results are reproducible<br>)<\/pre>\n<p>We can then fit the model to the corpus and see what results we\u00a0get!<\/p>\n<pre>document_topic_matrix = model.fit_transform(corpus)<br>model.print_topics()<\/pre>\n<pre>\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513<br>\u2503 Topic ID \u2503 Highest Ranking                                                                              \u2503<br>\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529<br>\u2502        0 \u2502 \u65c5\u884c, \u975e\u6d32, \u5f92\u6b65\u65c5\u884c, \u6f2b\u6b65, \u6d3b\u52a8, \u901a\u5e38, \u53d1\u5c55\u4e2d\u56fd\u5bb6, \u8fdb\u884c, \u8fdc\u8db3, \u5f92\u6b65                         \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        1 \u2502 \u6ed1\u96ea, \u6d3b\u52a8, \u6ed1\u96ea\u677f, \u6ed1\u96ea\u8fd0\u52a8, \u96ea\u677f, \u767d\u96ea, \u5730\u5f62, \u9ad8\u5c71, \u65c5\u6e38, \u6ed1\u96ea\u8005                           \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        2 \u2502 \u4f1a, \u53ef\u80fd, \u4ed6\u4eec, \u5730\u7403, \u5f71\u54cd, \u5317\u52a0\u5dde, \u5e76, \u5b83\u4eec, \u5230\u8fbe, \u8239                                       \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        3 \u2502 \u6bd4\u8d5b, \u9009\u624b, \u9526\u6807\u8d5b, \u5927\u56de\u8f6c, \u8d85\u7ea7, \u7537\u5b50, \u6210\u7ee9, \u83b7\u80dc, \u963f\u6839\u5ef7, \u83b7\u5f97                             \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        4 \u2502 \u822a\u7a7a\u516c\u53f8, \u822a\u73ed, \u65c5\u5ba2, \u98de\u673a, \u52a0\u62ff\u5927\u822a\u7a7a\u516c\u53f8, \u673a\u573a, \u8fbe\u7f8e\u822a\u7a7a\u516c\u53f8, \u7968\u4ef7, \u5fb7\u56fd\u6c49\u838e\u822a\u7a7a\u516c\u53f8, \u884c\u674e \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        5 \u2502 \u539f\u5b50\u6838, \u8d28\u5b50, \u80fd\u91cf, \u7535\u5b50, \u6c22\u539f\u5b50, \u6709\u70b9\u50cf, \u539f\u5b50\u5f39, \u6c22\u79bb\u5b50, \u884c\u661f, \u7c92\u5b50                         \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        6 \u2502 \u75be\u75c5, \u4f20\u67d3\u75c5, \u75ab\u60c5, \u7ec6\u83cc, \u7814\u7a76, \u75c5\u6bd2, \u75c5\u539f\u4f53, \u868a\u5b50, \u611f\u67d3\u8005, \u771f\u83cc                             \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        7 \u2502 \u7ec6\u80de, cella, \u5c0f\u623f\u95f4, cell, \u751f\u7269\u4f53, \u663e\u5fae\u955c, \u5355\u4f4d, \u751f\u7269, \u6700\u5c0f, \u79d1\u5b66\u5bb6                          \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        8 \u2502 \u536b\u661f, \u671b\u8fdc\u955c, \u592a\u7a7a, \u706b\u7bad, \u5730\u7403, \u98de\u673a, \u79d1\u5b66\u5bb6, \u536b\u661f\u7535\u8bdd, \u7535\u8bdd, \u5de8\u578b                           \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502        9 \u2502 \u732b\u79d1\u52a8\u7269, \u52a8\u7269, \u730e\u7269, \u72ee\u5b50, \u72ee\u7fa4, \u556e\u9f7f\u52a8\u7269, \u9e1f\u7c7b, \u72fc\u7fa4, \u884c\u4e3a, \u5403                             \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       10 \u2502 \u611f\u67d3, \u79bd\u6d41\u611f, \u533b\u9662, \u75c5\u6bd2, \u9e1f\u7c7b, \u571f\u8033\u5176, \u75c5\u4eba, h5n1, \u5bb6\u79bd, \u533b\u62a4\u4eba\u5458                           \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       11 \u2502 \u6297\u8bae, \u9152\u5e97, \u767d\u5385, \u6297\u8bae\u8005, \u4eba\u7fa4, \u8b66\u5bdf, \u4fdd\u5b88\u515a, \u5e7f\u573a, \u59d4\u5458\u4f1a, \u653f\u5e9c                             \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       12 \u2502 \u65c5\u884c\u8005, \u6587\u5316, \u8010\u5fc3, \u56fd\u5bb6, \u76ee\u7684\u5730, \u9002\u5e94, \u4eba\u4eec, \u6c34, \u65c5\u884c\u793e, \u56fd\u5916                               \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       13 \u2502 \u901f\u5ea6, \u82f1\u91cc, \u534a\u82f1\u91cc, \u8dd1\u6b65, \u516c\u91cc, \u8dd1, \u8010\u529b, \u6708\u7403, \u53d8\u7126\u955c\u5934, \u955c\u5934                               \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       14 \u2502 \u539f\u5b50, \u7269\u8d28, \u5149\u5b50, \u5fae\u5c0f, \u7c92\u5b50, \u5b87\u5b99, \u8f90\u5c04, \u7ec4\u6210, \u4ebf, \u800c\u5149                                     \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       15 \u2502 \u6e38\u5ba2, \u5bf9, \u5730\u533a, \u81ea\u7136, \u5730\u65b9, \u65c5\u6e38, \u65f6\u95f4, \u975e\u6d32, \u5f00\u8f66, \u5546\u5e97                                     \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       16 \u2502 \u4e92\u8054\u7f51, \u7f51\u7ad9, \u8282\u76ee, \u5927\u4f17\u4f20\u64ad, \u7535\u53f0, \u4f20\u64ad, toginetradio, \u5e7f\u64ad\u5267, \u5e7f\u64ad, \u5185\u5bb9                   \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       17 \u2502 \u8fd0\u52a8, \u8fd0\u52a8\u5458, \u7f8e\u56fd, \u4f53\u64cd, \u534f\u4f1a, \u652f\u6301, \u5965\u59d4\u4f1a, \u5965\u8fd0\u4f1a, \u53d1\u73b0, \u5b89\u5168                             \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       18 \u2502 \u706b\u8f66, metroplus, metro, metrorail, \u8f66\u53a2, \u5f00\u666e\u6566, \u901a\u52e4, \u7ed5\u57ce, \u57ce\u5185, \u4e09\u7b49\u8231                    \u2502<br>\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524<br>\u2502       19 \u2502 \u6295\u7968, \u6295\u7968\u7bb1, \u4fe1\u5c01, \u9009\u6c11, \u6295\u7968\u8005, \u6cd5\u56fd, \u5019\u9009\u4eba, \u7b7e\u540d, \u900f\u660e, \u7bb1\u5185                             \u2502<br>\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/pre>\n<p>As you see, we\u2019ve already gained a sensible overview of what there is in our corpus! You can see that the topics are quite distinct, with some of them being concerned with scientific topics, such as astronomy (8), chemistry (5) or animal behaviour (9), while others were oriented at leisure (e.g. 0, 1, 12), or politics (19,\u00a011).<\/p>\n<h4>Visualization<\/h4>\n<p>To gain further aid in understanding the results, we can use the topicwizard library to visually investigate the topic model\u2019s parameters.<\/p>\n<p>Since topicwizard uses wordclouds, we will need to tell the library that it should be using a font that is compatible with Chinese. I took a font from the <a href=\"https:\/\/github.com\/shangjingbo1226\/ChineseWordCloud\">ChineseWordCloud<\/a> repo, that we will download and then pass to topicwizard.<\/p>\n<pre>import urllib.request<br>import topicwizard<br><br>urllib.request.urlretrieve(<br>    \"https:\/\/github.com\/shangjingbo1226\/ChineseWordCloud\/raw\/refs\/heads\/master\/fonts\/STFangSong.ttf\",<br>    \".\/STFangSong.ttf\",<br>)<br>topicwizard.visualize(<br>    corpus=corpus, model=model, wordcloud_font_path=\".\/STFangSong.ttf\"<br>)<\/pre>\n<p>This will open the topicwizard web app in a notebook or in your browser, with which you can interactively investigate your topic\u00a0model:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/800\/0%2AZ8O_xN1Pxrri-8Do.gif?ssl=1\"><figcaption>Investigating the relations of topic, documents and words in your corpus using topicwizard<\/figcaption><\/figure>\n<h3>Conclusion<\/h3>\n<p>In this article, we\u2019ve looked at what KeyNMF is, how it works, what it\u2019s motivated by and how it can be used to discover high-quality topics in Chinese text, as well as how to visualize and interpret your results. I hope this tutorial will prove useful to those who are looking to explore Chinese textual\u00a0data.<\/p>\n<p>For further information on the models, and how to improve your results, I encourage you to check out our <a href=\"https:\/\/x-tabdeveloping.github.io\/turftopic\/\">Documentation<\/a>. If you should have any questions or encounter issues, feel free to submit an <a href=\"https:\/\/github.com\/x-tabdeveloping\/turftopic\/issues\">issue on Github<\/a>, or reach out in the comments\u00a0:))<\/p>\n<p>All figures presented in the article were produced by the\u00a0author.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=9a1d02f02648\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/contextual-topic-modelling-in-chinese-corpora-with-keynmf-9a1d02f02648\">Contextual Topic Modelling in Chinese Corpora with KeyNMF<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    M\u00e1rton Kardos<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcontextual-topic-modelling-in-chinese-corpora-with-keynmf-9a1d02f02648\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Contextual Topic Modelling in Chinese Corpora with KeyNMF A comprehensive guide on getting the most out of your Chinese topic models, from preprocessing to interpretation. With our recent paper on discourse dynamics in European Chinese diaspora media, our team has tapped into an almost unanimous frustration with the quality of topic modelling approaches when applied [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1286,260,92,1288,1287],"tags":[419,1290,1289],"class_list":["post-1169","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-chinese","category-nlp","category-thoughts-and-theory","category-topic-modeling","category-transformers","tag-matrix","tag-modelling","tag-topic"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1169"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1169"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1169\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}