CountVectorizer

Description

Transforms a collection of texts into a sparse matrix at the token level, based on the frequency of each unique word (feature) in the whole text (dictionary).

Why to use

For vectorization of multiple texts in a dictionary.

When to use

  • To convert each word in a text into a vector, based on the frequency of the word in the dictionary.

When not to use

On numerical data.

Prerequisites

  • The input variable should be of text type.
  • The input variable should be processed text.

Input

Any dataset that contains text data.

Output

  • A sparse matrix in which each number in each cell represents the count of a word in a particular text.
  • Each column of the matrix represents a feature from the dictionary.

Statistical Methods used

  • N-gram
  • Stop words

Limitations

Cannot be used on data other than text data.

The terms that are useful in understanding CountVectorizer are given below.

Text – it is a single text data point within a textual dataset, for example, a user review on a product X.

Textual Dataset – it is a collection of all the texts, for example, a collection of all user reviews for the product X.

Feature – each unique word in the textual dataset.
Consider the example given below.

Consider the textual dataset with the following user reviews as sample texts:

[“Product X is overrated”, “It is good”, “Product X needs improvement”]

where,

  1. Product X is overrated is text0.
  2. It is good is text1.
  3. Product X needs improvement is text2.

Here, the CountVectorizer creates a sparse matrix in which each word in the three texts is mapped as a real number in the corresponding feature vector. Each column in the sparse matrix represents a feature. Each feature has an index number. Each text from the document is a row in the sparse matrix. The value of each cell is the count of the feature in that particular text.

This can be visualized as below.

Feature → 

Product

x

is

overrated

it

good

needs

improvement

text0

1

1

1

1

0

0

0

0

text1

0

0

1

0

1

1

0

0

text2

1

1

0

0

0

0

1

1

Count →

2

2

2

2

1

1

1

1

Thus, the blue matrix is the actual representation of the sparse matrix for the example.

CountVectorizer is located under Textual Analysis ( ) in Text Vectorization, in the task pane on the left. Use the drag-and-drop method (or double-click on the node) to use the algorithm on the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of CountVectorizer.

Properties of CountVectorizer

The available properties of CountVectorizer are as shown in the figure given below.



The table given below describes the different fields present on the Properties pane of CountVectorizer.

Field

Description

Remark

Task Name


It displays the name of the selected task.

You can click the text field to edit or modify the name of the task as required.

Text


It allows you to select the text variable for which you need to perform the task.

  • Only one data field can be selected.
  • Data field with only text value is available.

Advanced




Lowercase

It converts the features to lowercase if selected as True.

The default value is True.

Ngram Minimum Range

It determines the minimum probability of occurrence of each feature in a sequence of N words, where N = 1, 2, 3, and so on.

  • The default value is 1.
  • Only numerical values can be entered.

Ngram Maximum Range

It determines the maximum probability of occurrence of each feature in a sequence of N words where N = 1, 2, 3, and so on.

  • The default value is 1.
  • Only numerical values can be entered.

Stop Words

It allows you to add one or multiple stop words from the standard English set of stop words.
You can also enter the stop word, English.

  • The default value is None.
  • Entering the English stop word excludes all the stop words in the set, from the feature columns.
  • While entering multiple stop words, each stop word is separated by a comma.

Example of CountVectorizer

Consider a dataset with one of the variables as a text variable. A snippet of the input data is shown in the figure given below.




In the Properties pane, the values are selected as shown in the table below.

Text

Text

Lowercase

True

Ngram Minimum Range

1

Ngram Maximum Range

1

Stop Words

None


The first part of the Result of CountVectorizer is shown in the figure below.


The second part of the Result of CountVectorizer is shown in the figure below.


The Result page displays the Sparse Matrix for the selected text variable.
A Sparse matrix is a structure that contains as many rows as the data points and as many columns as the number of features. In the matrix,

  • Each Features column represents one feature in the dictionary.
  • Each Count column represents the count of the occurrence of each feature in the dictionary.
  • Each Probability column represents the probability of occurrence of each feature in a sequence of N words.

Key Observations:

  • The dictionary (content_Original) contains 9 features.
  • Each text is a row in the sparse matrix. Thus, the sparse matrix has 9 rows.
  • Each cell in a row contains a number which is the count of the feature in that text.
  • The Result page displays five features on each page. Here, the total number of features are 9. In our example, the nine features are code, contains, in, is, language, notebook, programming, python, and this.
  • The features in the columns are arranged alphabetically.
  • To navigate to the next five features, you can click on the next arrow ( ) icon.
  • Each feature is converted to lowercase since the value selected in the Lowercase drop-down in the Properties pane is True.

In the above example, the count of the feature code is 3. Its probability is 0.13636 and is calculated according to the Ngram Minimum Range and Ngram Maximum Range values entered in the Properties pane. No word is excluded from the features columns since stop words are not defined in the Properties pane.

(info)

Notes:

  • If stop words are used then those words, which are mentioned as stop words, are excluded from the feature columns in the matrix.
  • If the value selected in the Lowercase drop-down in the Properties pane is False, the features will be considered as per the original text.
  • For example, the words 'and' and 'And' are considered as different features.

You can click () on the CountVectorizer task node to publish the model. The model can be reused in a workbook and workflow for training and experimenting or can be used in a workflow for production. For more information on publishing a task, refer to Publishing Models.
A snippet of the text variable is shown in the figure given below.


On the Data page, you can see the

  • Text variable column name is displayed as <variable name>_Original along with its text in rows.
  • Sparse Matrix containing the binary representation of each feature (according to the alphabetical sequence shown on the Results page) in a row.
  • You can download the text data from the Data page using the download icon ().
  • You can hover over a row in the text variable column to view the entire text in that row.

You can see that

  • The selected text variable (dictionary) contains 5 rows of text data.
  • Each row represents a text in the dictionary.
  • The Data page in the CountVectorizer result displays the text variable and its texts.

Table of Contents