AI: Discovering Dupe Similarities is Just the Beginning

In another post, I recall mentioning that AI would likely become the best approach to de-duping data sets. A few people have emailed me asking for a concrete example.

As we know, OpenAI has shaken the tree and shown us a path to artificial general intelligence (AGI) with examples and eye-opening experiences demonstrating that it seems ready to help or harm humanity in profound ways. But back on earth, we presently have simple, practical needs such as de-duping data records, or finding similarities among records. A search engine, for example, is all about discovering similarities, a task that Airtable is woefully inept at doing.

What if you could perform a simple mathematical computation and find all similarities in a data set?

In this brief example, I define the essence of a “dot” product which is like a cosine similarity function but a little less elegant. Its objective is to compare two arrays of numbers that represent an embedding. Embeddinngs is a fundamental element of AGI. They take advantage of billions of parameters already computed by OpenAI.

As it happens, embeddings are for sale; they each cost about 1/600th of a cent, making their use quite practical for AI applications. I also realized recently that it’s not necessary to store embedding vectors in a specialized database like Pinecone or Weviate; they can be stored in Airtable. They’re big arrays, but not onerously large. I also learned that computing similarities, while not simple, are also not counter-performant.

Given a topic, like all records similar to John Smith, we can get the embeddings for all names in a table and then decide which are closely related through simple math. This applies to any data, not just names, of course.

Note the similarity outcome values in the code. Sally Smith (0.8865357851100655) is not at all similar to John R. Smith ( 0.952003729179478). If you want a really powerful search feature, perform these computations and order the results descending. Bob’s your uncle. And how about that de-dupe process that craves for fuzzy search? Embeddings might be the answer.

Using this technique, you can create magic in your apps while posturing yourselves as purveyors of AI.

  // define the dot product
  let dot = (a, b) => a.map((x, i) => a[i] * b[i]).reduce((m, n) => m + n);

  // get the first data value
  let data1  = "John Smith";
  let data1E = JSON.parse(getEmbedding_(data1)).data[0].embedding;

  // test the second data value
  let data2  = "John L Smith";
  let data2E = JSON.parse(getEmbedding_(data2)).data[0].embedding;
  Logger.log(data2 + ": " + dot(data1E, data2E)); // 0.9470840588116363

  // test the third data value
  let data3  = "John Larry Smith";
  let data3E = JSON.parse(getEmbedding_(data3)).data[0].embedding;
  Logger.log(data3 + ": " + dot(data1E, data3E)); // 0.9181326228180411

  // test the fourth data value
  let data4  = "John Lawrence Smith";
  let data4E = JSON.parse(getEmbedding_(data4)).data[0].embedding;
  Logger.log(data4 + ": " + dot(data1E, data4E)); // 0.9289881895977837

  // test the fifth data value
  let data5  = "John R. Smith";
  let data5E = JSON.parse(getEmbedding_(data5)).data[0].embedding;
  Logger.log(data5 + ": " + dot(data1E, data5E)); // 0.952003729179478

  // test the sixth data value
  let data6  = "Sally Smith";
  let data6E = JSON.parse(getEmbedding_(data6)).data[0].embedding;
  Logger.log(data6 + ": " + dot(data1E, data6E)); // 0.8865357851100655

For people who aren’t familiar with Bill’s use of this phrase …

@bfrench That doesn’t look like an Airtable script. Is it a Google Apps Script or something else?

3 Likes

V8 (Modern Ecma)

Lol. Not just a simple “yes”, but an actual link to the details. Bill, I don’t have the time to learn a fraction of the neat stuff you publish. But thank you for throwing it out there. The idea of embeddings is getting higher up on my list of things to learn.

In about four months you’ll run into a tough problem and you’ll realize GPT embeddings hold the answer.

You’re welcome!

Holly cow, this is scary … “OpenAI’s embeddings allow companies to more easily find and tag customer call transcripts with feature requests.”

Will be running out of smart ideas on how to talk to my insurance company. I need a personal AI assistant. It’s gonna be AI against AI, I wonder who is gonna win.

Or possibly = AI + AI = happier humans.

I wouldn’t be so quick to think of AI as a battle. Sure, there will be areas where conflicting interests will collide with AI actors. But it’s like texting - when it became mainstream, humans feared it would lead to nothing good. The reality is texting saves time and even lives. AI will compress time and vastly increase collaboration through inference automation.

Automation, today, requires we lay out every step and every instruction. LLMs will soon be creating automated processes that exchange data without much effort from humans or technical mapping of data or logic. We’re probably weeks away from using Zapier without ever opening Zapier and configuring an integration recipe. Solutions like this will vastly overshadow any conflicts that AI bots create.

Maybe, although one should clarify, happiness comes from other things than the number of characters or how fast we can send our instructions around the globe. Being able to control outcomes in life is perhaps a better indicator and that’s where AI could help specifically but only if it is designed around that. I can change my number if I receive too many fake SMS or stick with letter mail (for now at least), but once AI becomes interwoven of our life, it may get out of hand to know what is real and what not. It doesn’t mean people will go to live in a forest cabin and doing things with better tools doesn’t makes a better human. But the interaction with AI may, for example it may provide a better introspection of human character and help us understand what is not making us happy. Otherwise the expectations of AI may be misplaced when it comes to happiness or happyness.

We were already in a shit-storm of fakery before LLMs arrived on the scene. AI could make it worse, but I believe it will do the opposite. AI will be able to tell us what is probably fake and what is probably real. AI stands to be the one entity capable of knowing because it is made of AI itself. :wink:

I suspect this will happen. We’ve been using AI in many systems for about a decade. Autocorrect is based on word vectors and for the most part it makes us happy. There are times we hate it. But for all you mobile-first typists, would you get off that 72nd tweet in on day without autocorrect? Probably not.

So, let’s take a simple real-world example that involves only some basic AI; completely within reach and no need for a long proof-of-concept.

  • Every day, two workers examine about a thousand customer emails received at help@itoldusoandso.com to determine how they should be handled. This is a large construction firm and they have big clients with lots of questions.
  • The effort represents about 14 person-hours of effort and by the end of their shifts, both workers are exhausted from reading and classifying.
  • The data shows that these workers grow increasingly weary and make more errors as the day goes on.
  • They also complain that there’s not enough time to complete the shift reports that help management know when there are spikes in questions or chronic issues.
  • Their manager is contemplating hiring another person to assist them.
  • A really smart gal from IT is there fixing an Airtable automation and overhears their conversation. Oddly, her name rhymes with @Kuovonne (let that sink in).
  • The IT worker says - “Why don’t you use GPT to classify the messages automatically; it might free up a little time.”
  • The manager leans into this solution and they create a process that classifies automatically and then the workers review and correct any misclassification.
    This has worked nicely because they’re less stressed, and they work faster, so they’re able to actually get the breaks they need to remain accurate.
  • @Toowhan checks in and is really happy they can have a relaxing cup of coffee with her pals now that they’re not so stressed.
  • While chatting, @Toowhan suggests, they should accelerate their AI adoption - this time, use it to build the end-of-day summary for the executive team.
  • They hammer out the requirements for the new report and a few days later, the first executive summary automatically generates and prints a narrative of the most prevalent questions, and includes the most inquiries from each client and various patterns in the conversations.
  • The executive team is thrilled - it has lots of insights about their daily encounters with clients.
  • The manager is delighted - two very happy workers who now like to come to work and the executive team is no longer asking for reports.

These bullet points are nice, but we can do better with AI and we’re all a little “happier” because it reads much more naturally. :wink:

At the construction firm “IToldUsOandSo.com,” two workers named John and Maria spend their entire day reading and classifying about a thousand customer emails received at help@itoldusoandso.com. They work for a large construction firm with big clients who often have questions. After about 14 hours of effort, both workers are exhausted from reading and classifying. The more tired they get, the more errors they make, which leads to an increase in customer complaints. In addition, they don’t have enough time to complete the shift reports that help management know when there are spikes in questions or chronic issues.

Their manager is aware of their difficulties and is considering hiring another person to assist them. However, a really smart IT worker whose name rhymes with Kuovonne overhears their conversation. She is there to fix an Airtable automation and suggests using GPT to classify the messages automatically, freeing up some time for the workers. The manager leans into this solution, and they create a process that classifies automatically, and then the workers review and correct any misclassification.

This solution works really well. The workers are less stressed and can work faster, so they can take the breaks they need to remain accurate. Toowhan checks in and is delighted that the workers can now have a relaxing cup of coffee with her pals since they are no longer so stressed.

While chatting, @Toowhan suggests they should accelerate their AI adoption by using it to build the end-of-day summary for the executive team. They hammer out the requirements for the new report, and a few days later, the first executive summary automatically generates and prints a narrative of the most prevalent questions, including the most inquiries from each client and various patterns in the conversations.

The executive team is thrilled with the insights they receive from the summary, which provides them with a daily overview of client interactions. The manager is also delighted because the two workers are now happy, which makes them want to come to work. In addition, the executive team is no longer requesting reports, which makes things easier for everyone. All in all, AI systems have made the workers happier, and the firm more productive.

This is just the beginning of how AI will bring productivity to a team, not if. Improved productivity – in almost every case - can be measured in happiness by at least some workers who are impacted by it. Alas, there is someone in this story who is unhappy - the nameless and faceless person who was not hired to help John and Maria. Perhaps it’s John’s sister who separately needed that new opening. That would suck for John and his sis Jenny.

Feel free to say - I Told You So! AI is bad! But be prepared for chapter two - it has some surprises you probably do not see coming.

@bfrench Always enjoying reading your posts, thorough, thoughtful and precise.

1 Like

Thanks for the fun read. I agree with most of your points.

And what would her actual name be? Few people know how to pronounce my name, so how on earth are they supposed to figure out names that rhyme with mine?

Oh, wait, I’m overthinking it. You’re just tagging me.

Read the story. Toowhan.

Are you saying that Toowhan rhymes with Kuovonne? Makes me wonder how you pronounce Toowhan. Or maybe it is a slant rhyme. (Sorry, I’ve been helping my daughter with her homework analyzing romantic poetry.)