Methods
There are three methods for this feature:
Threshold Matching:
It compares the string values based on fuzzy logic and calculates a match score for each dataset row. Also, the user defines a match threshold value.
If the match score is more than or equal to the match threshold value, the two strings are said to be mapped; otherwise, they are not mapped.
For example, consider two records related to a customer's name, Stephan and Stefan. They repeatedly appear in customer data in multiple rows. The two names are similarly pronounced but have different spelling. You can compare the customer records for the two records using fuzzy logic. You can use the lookup functionality to map the two records.
Metaphone:
It generates a phonetic key for both the Lookup Feature (selected from the Lookup connection) and Input Feature (selected from the predecessor node to lookup) based on the pronunciation of the word/sentence. It then produces a match score to see whether the two strings match.
Double Metaphone:
It generates two phonetic keys for the Lookup Feature and Input Feature and then produces a match score.
Notes: |
|
Example of Fuzzy Logic using Threshold Matching
Consider the datasets IRISDatasetModified and IRIS_Dataset with ID, Species, Sepal Length, Sepal Width, Petal Length, and Petal Width columns. They are used for selecting the Input Feature and Lookup Feature, respectively. The input data is shown below.
Figure: IRISDatasetModified Dataset
The Lookup feature dataset is shown below.
Figure: IRIS_Dataset
We apply the following condition to implement the Threshold Match method of fuzzy lookup.
Figure: Lookup Condition for Threshold Matching
The following Lookup properties are selected as below.
Property | Value |
Lookup Connection | IRIS_Dataset |
Lookup Features to Return | Id |
If non-matching records found? | Capture Non-Matching Records |
A snippet of the Lookup output data is displayed in the figure below.
Figure: Lookup Output Data for Threshold Matching
Observations:
- A Lookup_Flag value of 1 for all values indicates that for a Match Threshold of 0.7, all the three species values from input data are mapped. (for an unmapped value, the Lookup_Flag value is zero.)
- The Id_Lookup column shows those row IDs in the IRIS_Dataset whose species values are mapped with those from the IRISDatasetModified dataset. Each Id is mapped with the first value from the Lookup_Id whose match score is greater than the threshold.
Ids | Mapped to Lookup_Id |
---|---|
1, 2, and 3 | 1 |
4, 5, 6 and 7 | 51 |
8, 9, 10, 11, and 12 | 101 |
For example, the Lookup_Id 51 (Lookup_Species) has a match score greater than the threshold for the versicolor species. Hence all the versicolor and iris-versicolor values from input data are mapped with it.
The result page displays the ratio value (match score) for the three species.
Figure: Lookup Output Result for Threshold Matching
As you can see, all the three values, 1, 0.8, and 0.7826 are greater than the match threshold of 0.7. Hence, all the species values are mapped.
Example of Fuzzy Logic using Metaphone
For the same pair of datasets, we use the Metaphone method.
For this, we apply the following condition.
Figure: Lookup Condition for Metaphone
A snippet of the Lookup output data is displayed in the figure below.
Figure: Lookup Output Data for Metaphone
The result page displays the Lookup Metaphone phonetic keys and their ratio value (match score) for the three species.
Figure: Lookup Output Result for Metaphone
As you can see,
The phonetic key created for
setosa – STS
versicolor – FRSKLR
virginica – FRJNK
All the three values, 1, 0.8, and 0.7692, are greater than the match threshold of 0.7. Hence, all the species values are mapped.
Example of Fuzzy Logic using Double Metaphone
For the same pair of datasets, we use the Double Metaphone method.
For this, we apply the following condition.
Figure: Lookup Condition for Double Metaphone
A snippet of the Lookup output data is displayed in the figure below.
Figure: Lookup Output Data for Double Metaphone
The result page displays the lookup double metaphone phonetic key and its ratio value (match score) for the three species.
Figure: Lookup Output Result for Double Metaphone
As you can see,
The double phonetic key created for
setosa – ('STS', ")
versicolor – ('FRSKLR', ")
iris-virginica – ('FRJNK', FRKNK ")
virginica – ('FRJNK', ")
All the four values, 1, 0.8, 0.5455, and 0.7692, are greater than the match threshold of 0.5. Hence, all the species values are mapped.
Table of Contents