During my time at Epimorphics, I developed a Machine Learning algorithm to be used to match food business addresses from one database to another address database that contains more desirable properties for linked data.
One of the most important factors in predicting matches between the two databases is having a robust method for comparing the business name within address pairs. Features encoding information about the business names were consistently valued by the machine learning algorithms I tested. Given this, I spent part of my time investigating novel approaches to improving the name matching capabilities of my algorithm. One of my favourite approaches was to look at the term frequencies across the name fields in the data set.
Initially, I weighted the words by the frequency of their occurrence, working on the assumption that words which occur frequently are less likely to contain valuable information, ‘the’ or ‘and’ for example. Weighting terms by their frequency did help improve my models, however, this is not an original idea and I wondered whether it might be possible to predict if a large set of unlabeled pairs were matches or not. These could then be compared with the term frequencies over the sets generated by this labelling.
Using this approach, I found the term frequency comparisons were fairly insightful. I produced the following word clouds to visualise the results.
The first is the result of comparing the term-frequencies over all the examples irrespective of their predicted label, with those that were negatively labelled:
The above tends to encode terms which are often indicative of premises not being a food business.
Conversely, we can compare the term frequencies of all the data to those which were labelled as matches. This finds words that are indicative of premises likely being a food business:
My final model was a bootstrapped approach, as it relied on an earlier version of the machine learning model to make the initial predictions from which the term frequencies were extracted. A more advanced machine learning algorithm would then utilise this information later and make more finalised predictions.
You can read a more detailed version of my time at Epimorphics here.
[1] https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting