The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the size of the dictionary.

Each word in a text is represented as an array of the length of the total number of unique words from the dictionary. Thus, each word in the text is mapped to a real number in the corresponding feature vector.

For example, consider a predefined dictionary that contains unique words like {cat, loves, to, play, with, ball}. If the text “A cat loves to play with a ball” is vectorized, the vector will be as follows: (0, 1, 1, 1, 1, 1, 0, 1).

In Rubiscape, you can use advanced techniques to convert multiple texts to numeric feature vectors like Count Vectorization, TF-IDF (Term Frequency-Inverse Document Frequency), and techniques like removing stop words and using N-gram.

In text vectorization, you cannot find the meaning of a text or the context of words in a text.

List of Text Vectorization Algorithms

In Rubiscape, two text Vectorization algorithms are available.

  • CountVectorizer
  • TF-IDF

(info)

Notes:

  • The Reader (dataset) should be connected to the algorithm.
  • These algorithms can be used only on textual data.
  • CountVectorizer and TF-IDF convert processed texts to numeric feature vectors.
  • These algorithms are used to determine similarities for text classification.


Table of Contents