Understanding vectorization

Michaël Scherding
4 min readMay 17, 2024

--

Introduction

I needed to understand deeply Vectorization because it’s a fundamental concept in machine learning and natural language processing (NLP). It involves converting text data into numerical vectors that algorithms can process. To explain how vectorization works, I wanted to use a simple example involving a corpus of phrases about Pokémon (yhea why not?).

Create a corpus

First, we’ll define a small corpus of phrases related to Pokémon:

  1. “Pikachu is an electric type.”
  2. “Charizard is a fire and flying type.”
  3. “Bulbasaur is a grass and poison type.”
  4. “Squirtle is a water type.”
  5. “Jigglypuff is a fairy type.”

Tokenization

Tokenization is the process of breaking down text into individual words or tokens. For our corpus, the tokenized phrases are:

  1. [“Pikachu”, “is”, “an”, “electric”, “type”]
  2. [“Charizard”, “is”, “a”, “fire”, “and”, “flying”, “type”]
  3. [“Bulbasaur”, “is”, “a”, “grass”, “and”, “poison”, “type”]
  4. [“Squirtle”, “is”, “a”, “water”, “type”]
  5. [“Jigglypuff”, “is”, “a”, “fairy”, “type”]

Building the vocabulary

The next step is to build a vocabulary from the tokenized phrases. The vocabulary is a set of unique words in the corpus. For our example, the vocabulary is:

[“Pikachu”, “is”, “an”, “electric”, “type”, “Charizard”, “a”, “fire”, “and”, “flying”, “Bulbasaur”, “grass”, “poison”, “Squirtle”, “water”, “Jigglypuff”, “fairy”]

Vectorization methods

There are several methods for vectorizing text. We will discuss two common ones: Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).

Bag-of-Words (BoW)

In the BoW model, each phrase is represented as a vector with a length equal to the vocabulary size. Each position in the vector corresponds to a word in the vocabulary, and the value at each position is the count of the word’s occurrences in the phrase.

Let’s create the BoW vectors for our corpus:

“Pikachu is an electric type.”

  • Vector: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

“Charizard is a fire and flying type.”

  • Vector: [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

“Bulbasaur is a grass and poison type.”

  • Vector: [0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0]

“Squirtle is a water type.”

  • Vector: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]

“Jigglypuff is a fairy type.”

  • Vector: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

Term Frequency-Inverse Document Frequency (TF-IDF)

Imagine you have a collection of articles about Pokémon. Common words like “Pokémon” and “type” might appear frequently across all articles, while specific Pokémon names like “Pikachu” or “Charizard” might appear only in particular articles. TF-IDF helps you identify which Pokémon names are most relevant to each article, ignoring the common words.

For instance, in an article mostly about “Pikachu,” the word “Pikachu” will have a high TF-IDF score compared to its score in other articles where it might be mentioned less frequently. Conversely, the word “Pokémon” will have a lower TF-IDF score because it appears in every article and is not specific to just one.

Using TF-IDF, you can effectively highlight the unique terms that best describe each document within a larger set, making it a powerful tool for text analysis and information retrieval.

Query and Relevance

Let’s assume the query is: “electric type”

Tokenize the query:

  • [“electric”, “type”]

Create a query vector:

  • The query vector corresponds to the vocabulary:
  • Query Vector: [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Calculate Relevance

To find the most relevant documents, we can use the dot product of the query vector with each document vector. The dot product will give a score indicating how many words from the query appear in each document.

Dot product calculation:

Document 1: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

  • Score: (01) + (01) + (01) + (11) + (1*1) + … = 1 + 1 = 2

Document 2: [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

  • Score: (00) + (01) + (00) + (10) + (1*1) + … = 1

Document 3: [0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0]

  • Score: (00) + (01) + (00) + (10) + (1*1) + … = 1

Document 4: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]

  • Score: (00) + (01) + (00) + (10) + (1*1) + … = 1

Document 5: [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

  • Score: (00) + (01) + (00) + (10) + (1*1) + … = 1

Retrieve the most relevant documents

Based on the scores, the most relevant documents for the query “electric type” are:

  1. Document 1: “Pikachu is an electric type.” (Score: 2)
  2. Document 2, 3, 4, or 5: They all have a score of 1, but you can pick any two based on further context or criteria.

Conclusion

Vectorization transforms text into numerical representations that machine learning algorithms can efficiently process. By using a simple Pokémon corpus, we’ve demonstrated the process of tokenization, vocabulary building, and vectorization using BoW and TF-IDF. Understanding these concepts is crucial for tasks like text classification, clustering, and similarity measurement in NLP.

Hope it helps, see ya 🤟

--

--

No responses yet