The standard way of text vectorization is to define a fixed-length vector of unique words (features) from a predefined dictionary. Each entry in the vector corresponds to a unique word from the dictionary. The size of the vector is then equal to the size of the dictionary.
Each word in a text is represented as an array of the length of the total number of unique words from the dictionary. Thus, each word in the text is mapped to a real number in the corresponding feature vector.
For example, consider a predefined dictionary that contains unique words like {cat, loves, to, play, with, ball}. If the text “A cat loves to play with a ball” is vectorized, the vector will be as follows: (0, 1, 1, 1, 1, 1, 0, 1).
In Rubiscape, you can use advanced techniques to convert multiple texts to numeric feature vectors like Count Vectorization, TF-IDF (Term Frequency-Inverse Document Frequency), and techniques like removing stop words and using N-gram.
In text vectorization, you cannot find the meaning of a text or the context of words in a text.
List of Text Vectorization Algorithms
In Rubiscape, two text Vectorization algorithms are available.
- CountVectorizer
- TF-IDF
Notes: |
|
Table of Contents