Hello, dear readers, more than a beautiful day. In this article, I’ll give you a little information about the science of NLP that I enjoy the most. When it came to word embedding, I went through a process of research from many different sources to get information. And I have to say, one of the things I like to research most in the field of machine learning is natural language processing! Let’s make this beautiful process more fun with you.
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.Wikipedia
In the statement you see above, it states that this phrase is the common name of a number of language modeling and feature learning techniques. Before you get too confused, get to the details. You know, machines can’t understand words like us. Instead, they communicate with numerical values and vectors. In the case of numeric values, categorical data remains invalid. Do you think we’re any closer to what I want to say?
In this article, I’ll talk about the two methods I use most often. During my master’s degree, I had the opportunity to use one-hot encoding and word2vec methods in projects I worked on. For this reason, I want to tell you about these two methods in the most understandable way.
1. Frequency-Based Word Embedding Method
As I mentioned above, machine learning algorithms do not work directly on categorical data. For this reason, the data it contains must be converted to numeric values. If we need to open the following numeric values a little more, we are talking about binary values.
How Does It Work?
Categorical values that exist in the text primarily match integer values. This match occurs vectorially. Initially, all integer values are assigned as 0. When examining the text, the value of which data is valid is replaced by 1.
⬆️ The above table contains product names, values and prices.
One Hot Encoding
The table contains values of 0 or 1, if you notice. For example, the value in the line where chocolate is available is assigned as 1. Chocolate had a value of 1, while pasta and detergent had a value of 0. It is possible to interpret other products by paying attention to this situation.
By these methods mentioned above, a matrix is usually created according to whether there are groups of words in a document or text.
We have finished the first process of embedding without exhausting you. I’m looking forward to my article, which includes TF/IDF and word2vec methods, which I will mention in my next post, but what about you?
- Karabuk University, Natural Language Processing, Oguz Findik.
- Istanbul Medeniyet University, Department of Business, Sadi Evren Seker, Natural Language Processing.