Understanding vectorization
Introduction
I needed to understand deeply Vectorization because it’s a fundamental concept in machine learning and natural language processing (NLP). It involves converting text data into numerical vectors that algorithms can process. To explain how vectorization works, I wanted to use a simple example involving a corpus of phrases about Pokémon (yhea why not?).
Create a corpus
First, we’ll define a small corpus of phrases related to Pokémon:
- “Pikachu is an electric type.”
- “Charizard is a fire and flying type.”
- “Bulbasaur is a grass and poison type.”
- “Squirtle is a water type.”
- “Jigglypuff is a fairy type.”
Tokenization
Tokenization is the process of breaking down text into individual words or tokens. For our corpus, the tokenized phrases are:
- [“Pikachu”, “is”, “an”, “electric”, “type”]
- [“Charizard”, “is”, “a”, “fire”, “and”, “flying”, “type”]
- [“Bulbasaur”, “is”, “a”, “grass”, “and”, “poison”, “type”]
- [“Squirtle”, “is”, “a”, “water”, “type”]
- [“Jigglypuff”, “is”, “a”, “fairy”, “type”]
Building the vocabulary
The next step is to build a vocabulary from the tokenized phrases. The vocabulary is a set of unique words in the corpus. For our example, the vocabulary is:
[“Pikachu”, “is”, “an”, “electric”, “type”, “Charizard”, “a”, “fire”, “and”, “flying”, “Bulbasaur”, “grass”, “poison”, “Squirtle”, “water”, “Jigglypuff”, “fairy”]
Vectorization methods
There are several methods for vectorizing text. We will discuss two common ones: Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
Bag-of-Words (BoW)
In the BoW model, each phrase is represented as a vector with a length equal to the vocabulary size. Each position in the vector corresponds to a word in the vocabulary, and the value at each position is the count of the word’s occurrences in the phrase.
Let’s create the BoW vectors for our corpus:
“Pikachu is an electric type.”
- Vector: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
“Charizard is a fire and flying type.”
- Vector: [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
“Bulbasaur is a grass and poison type.”
- Vector: [0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0]
“Squirtle is a water type.”
- Vector: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
“Jigglypuff is a fairy type.”
- Vector: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Term Frequency-Inverse Document Frequency (TF-IDF)
Imagine you have a collection of articles about Pokémon. Common words like “Pokémon” and “type” might appear frequently across all articles, while specific Pokémon names like “Pikachu” or “Charizard” might appear only in particular articles. TF-IDF helps you identify which Pokémon names are most relevant to each article, ignoring the common words.
For instance, in an article mostly about “Pikachu,” the word “Pikachu” will have a high TF-IDF score compared to its score in other articles where it might be mentioned less frequently. Conversely, the word “Pokémon” will have a lower TF-IDF score because it appears in every article and is not specific to just one.
Using TF-IDF, you can effectively highlight the unique terms that best describe each document within a larger set, making it a powerful tool for text analysis and information retrieval.
Query and Relevance
Let’s assume the query is: “electric type”
Tokenize the query:
- [“electric”, “type”]
Create a query vector:
- The query vector corresponds to the vocabulary:
- Query Vector: [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Calculate Relevance
To find the most relevant documents, we can use the dot product of the query vector with each document vector. The dot product will give a score indicating how many words from the query appear in each document.
Dot product calculation:
Document 1: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- Score: (01) + (01) + (01) + (11) + (1*1) + … = 1 + 1 = 2
Document 2: [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
- Score: (00) + (01) + (00) + (10) + (1*1) + … = 1
Document 3: [0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0]
- Score: (00) + (01) + (00) + (10) + (1*1) + … = 1
Document 4: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
- Score: (00) + (01) + (00) + (10) + (1*1) + … = 1
Document 5: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
- Score: (00) + (01) + (00) + (10) + (1*1) + … = 1
Retrieve the most relevant documents
Based on the scores, the most relevant documents for the query “electric type” are:
- Document 1: “Pikachu is an electric type.” (Score: 2)
- Document 2, 3, 4, or 5: They all have a score of 1, but you can pick any two based on further context or criteria.
Conclusion
Vectorization transforms text into numerical representations that machine learning algorithms can efficiently process. By using a simple Pokémon corpus, we’ve demonstrated the process of tokenization, vocabulary building, and vectorization using BoW and TF-IDF. Understanding these concepts is crucial for tasks like text classification, clustering, and similarity measurement in NLP.
Hope it helps, see ya 🤟