Member-only story
Extracting Synonyms (Similar Words) From Text Using BERT & NMSLIB 🔥
An approach to extracting words that are similar/synonyms from within multiple rows of text using BERT & NMSLIB.

We will begin by tokenizing the text into words as we want single-word outputs. Then, we will use BERT (sentence transformers) to embed the most common words, and then we will use NMBLIB to get the closest matches to each of them. We will be using a tweets data set from Twitter to find similar words within it.
NOTE — In this article, we are looking for similar words/synonyms from within the whole data set. Hence, we will take all the rows and extract the most common words that are nouns and work on them as a whole. There won’t be any concept of rows. Also, the resulting words won't necessarily be perfectly replaceable synonyms, but, rather, simply similar words that might or might not be directly replaceable in a sentence. For example, we will end up with something like “excellence” & “quality” and “soundcloud” & “spotify”.
Cleaning The Tweets
We start by cleaning up the data. I am removing stopwords and numbers and also lowercase the text.