In another post, I recall mentioning that AI would likely become the best approach to de-duping data sets. A few people have emailed me asking for a concrete example.
As we know, OpenAI has shaken the tree and shown us a path to artificial general intelligence (AGI) with examples and eye-opening experiences demonstrating that it seems ready to help or harm humanity in profound ways. But back on earth, we presently have simple, practical needs such as de-duping data records, or finding similarities among records. A search engine, for example, is all about discovering similarities, a task that Airtable is woefully inept at doing.
What if you could perform a simple mathematical computation and find all similarities in a data set?
In this brief example, I define the essence of a “dot” product which is like a cosine similarity function but a little less elegant. Its objective is to compare two arrays of numbers that represent an embedding. Embeddinngs is a fundamental element of AGI. They take advantage of billions of parameters already computed by OpenAI.
As it happens, embeddings are for sale; they each cost about 1/600th of a cent, making their use quite practical for AI applications. I also realized recently that it’s not necessary to store embedding vectors in a specialized database like Pinecone or Weviate; they can be stored in Airtable. They’re big arrays, but not onerously large. I also learned that computing similarities, while not simple, are also not counter-performant.
Given a topic, like all records similar to John Smith, we can get the embeddings for all names in a table and then decide which are closely related through simple math. This applies to any data, not just names, of course.
Note the similarity outcome values in the code. Sally Smith (0.8865357851100655) is not at all similar to John R. Smith ( 0.952003729179478). If you want a really powerful search feature, perform these computations and order the results descending. Bob’s your uncle. And how about that de-dupe process that craves for fuzzy search? Embeddings might be the answer.
Using this technique, you can create magic in your apps while posturing yourselves as purveyors of AI.
// define the dot product
let dot = (a, b) => a.map((x, i) => a[i] * b[i]).reduce((m, n) => m + n);
// get the first data value
let data1 = "John Smith";
let data1E = JSON.parse(getEmbedding_(data1)).data[0].embedding;
// test the second data value
let data2 = "John L Smith";
let data2E = JSON.parse(getEmbedding_(data2)).data[0].embedding;
Logger.log(data2 + ": " + dot(data1E, data2E)); // 0.9470840588116363
// test the third data value
let data3 = "John Larry Smith";
let data3E = JSON.parse(getEmbedding_(data3)).data[0].embedding;
Logger.log(data3 + ": " + dot(data1E, data3E)); // 0.9181326228180411
// test the fourth data value
let data4 = "John Lawrence Smith";
let data4E = JSON.parse(getEmbedding_(data4)).data[0].embedding;
Logger.log(data4 + ": " + dot(data1E, data4E)); // 0.9289881895977837
// test the fifth data value
let data5 = "John R. Smith";
let data5E = JSON.parse(getEmbedding_(data5)).data[0].embedding;
Logger.log(data5 + ": " + dot(data1E, data5E)); // 0.952003729179478
// test the sixth data value
let data6 = "Sally Smith";
let data6E = JSON.parse(getEmbedding_(data6)).data[0].embedding;
Logger.log(data6 + ": " + dot(data1E, data6E)); // 0.8865357851100655