The fuzzy lookup is based on the fuzzy logic in mathematics. It is supported only for categorical variables.

Methods

There are three methods for this feature:

Threshold Matching:

It compares the string values based on fuzzy logic and calculates a match score for each dataset row. Also, the user defines a match threshold value.
If the match score is more than or equal to the match threshold value, the two strings are said to be mapped; otherwise, they are not mapped.
For example, consider two records related to a customer's name, Stephan and Stefan. They repeatedly appear in customer data in multiple rows. The two names are similarly pronounced but have different spelling. You can compare the customer records for the two records using fuzzy logic. You can use the lookup functionality to map the two records.

Metaphone:

It generates a phonetic key for both the Lookup Feature (selected from the Lookup connection) and Input Feature (selected from the predecessor node to lookup) based on the pronunciation of the word/sentence. It then produces a match score to see whether the two strings match.

Double Metaphone:

It generates two phonetic keys for the Lookup Feature and Input Feature and then produces a match score.

(info)

Notes:

  • The match threshold value should be between 0 and 1 and is defined by the user.

  • The rows are matched only if the match score is greater than the match threshold value.

Example of Fuzzy Logic using Threshold Matching

Consider the datasets IRISDatasetModified and IRIS_Dataset with ID, Species, Sepal Length, Sepal Width, Petal Length, and Petal Width columns. They are used for selecting the Input Feature and Lookup Feature, respectively. The input data is shown below.



Figure: IRISDatasetModified Dataset

The Lookup feature dataset is shown below.

Figure: IRIS_Dataset

We apply the following condition to implement the Threshold Match method of fuzzy lookup.

Figure: Lookup Condition for Threshold Matching

The following Lookup properties are selected as below.

Property

Value

Lookup Connection

IRIS_Dataset

Lookup Features to Return

Id

If non-matching records found?

Capture Non-Matching Records

A snippet of the Lookup output data is displayed in the figure below.

Figure: Lookup Output Data for Threshold Matching

Observations:
  • A Lookup_Flag value of 1 for all values indicates that for a Match Threshold of 0.7, all the three species values from input data are mapped. (for an unmapped value, the Lookup_Flag value is zero.)
  • The Id_Lookup column shows those row IDs in the IRIS_Dataset whose species values are mapped with those from the IRISDatasetModified dataset. Each Id is mapped with the first value from the Lookup_Id whose match score is greater than the threshold.
IdsMapped to Lookup_Id
1, 2, and 31
4, 5, 6 and 751
8, 9, 10, 11, and 12101

For example, the Lookup_Id 51 (Lookup_Species) has a match score greater than the threshold for the versicolor species. Hence all the versicolor and iris-versicolor values from input data are mapped with it.
The result page displays the ratio value (match score) for the three species.

Figure: Lookup Output Result for Threshold Matching

As you can see, all the three values, 1, 0.8, and 0.7826 are greater than the match threshold of 0.7. Hence, all the species values are mapped.

Example of Fuzzy Logic using Metaphone

For the same pair of datasets, we use the Metaphone method.

For this, we apply the following condition.

Figure: Lookup Condition for Metaphone

A snippet of the Lookup output data is displayed in the figure below.

Figure: Lookup Output Data for Metaphone

The result page displays the Lookup Metaphone phonetic keys and their ratio value (match score) for the three species.

Figure: Lookup Output Result for Metaphone

As you can see,

  • The phonetic key created for

  • setosa – STS

  • versicolor – FRSKLR

  • virginica – FRJNK

  • All the three values, 1, 0.8, and 0.7692, are greater than the match threshold of 0.7. Hence, all the species values are mapped.

Example of Fuzzy Logic using Double Metaphone

For the same pair of datasets, we use the Double Metaphone method.
For this, we apply the following condition.


Figure: Lookup Condition for Double Metaphone

A snippet of the Lookup output data is displayed in the figure below.

Figure: Lookup Output Data for Double Metaphone

The result page displays the lookup double metaphone phonetic key and its ratio value (match score) for the three species.

Figure: Lookup Output Result for Double Metaphone

As you can see,

  • The double phonetic key created for

  • setosa – ('STS', ")

  • versicolor – ('FRSKLR', ")

  • iris-virginica – ('FRJNK', FRKNK ")

  • virginica – ('FRJNK', ")

  • All the four values, 1, 0.8, 0.5455, and 0.7692, are greater than the match threshold of 0.5. Hence, all the species values are mapped.

Table of Contents