Tuning Bertopic aka "depends on your use case" #2473
Unanswered
mehdigreefhorst
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The first time I worked with BERTopic is more than 2 years ago during my master in Data Science at the Jheronimus Academy of Data Science (JADS). I thought it would be nice to start a discussion around a topic that many others and I are also struggling with. How to tune BERTopic?
Tuning BERTopic clustering is a tedious process and it doesn't have a set of "winner" hyperparameters, it depends on the use case. You change one variable, look at the output clusters, and try it again. If you're lucky, you remember that setting the random seed is important. You set the random seed and then try to get to the "correct parameters". You could use coherence or diversity as metrics to optimize for, but those do not really work, as coherence doesn't give you the whole story. You might get a very low number of clusters, and a relatively high coherence score. The conclusion: you must look at the clusters each time you run BERTopic.
I thought of an analogy of BERTopic & billiard trickshots. Imagine the following: BERTopic is 6 steps, that all depend on each other, and some are stochastic by nature. The output can differ by a large amount even though you change something by just a little. One time I tried BERTopic for one of my data science courses at JADS and thought I performed hyperparameter tuning with a fixed random seed. In reality, I reran BERTopic 250 times with the same parameters and without a fixed random seed. The differences were immense. The number of topics was between 50-150 topic clusters. Even though all that was changed was the random seed.
You might wonder why I brought up the analogy of BERTopic & billiard trickshots. Imagine, you put 6 billiard balls in a row, each ball is a step in BERTopic. Each subsequent ball in your trickshot depends massively on the previous ball that was hit. In order to do the trickshot properly, you must look at each billiard ball before it hits the next one in the row of balls. When you only look at the final ball (the topics) of your trickshot, you will get the trickshot by being lucky. Similar to using BERTopic, if you only look at the final set of clusters, the process will feel like working with some sort of a magical system, until you look at the topics at some point and think, hmm this looks super interesting, this is what I call being lucky.
The parameters of BERTopic have an impact, but it will be very hard to see the effect of each of your individual changes, because it is like a billiard trickshot. A small change in the first ball of your trickshot completely changes the outcome topics. And since you're changing something in step 2 or 3 of the process, and you're only looking at the final output, it will be super hard to see WHY BERTopic does what it does.
This brings me to my hypothesis. When we manage BERTopic carefully in each step, and see how each step influences the next step, could we then guarantee that we get a good set of topics? For example, we hyperparameter tune the UMAP step, at each iteration, we evaluate visually & measure correlation of the reduced clusters. When we determine that a specific step is properly done, we continue to the next step in BERTopic.
Curious to hear what your point of view is, Maarten, regarding hyperparameter tuning of BERTopic. Do you agree or disagree with my story? You've stated that the parameters depend a lot on each specific case. But how does your process of hyperparameter tuning look like?
Beta Was this translation helpful? Give feedback.
All reactions