Searching for a duplicate in a database

We’d like a streamlined process for completing a duplicate record check on our database, accounting for any possible data inaccuracies. I know we have a Dedupe extension in Airtable and there is an option to look for duplicates in “Exact, Similar, or Fuzzy” method. Is there a way to accomplish this type of looking for a duplicate through an automation/formula/regex/script? (script as the last resort)

We’re looking into creating a simple form and the framework searches through a database to find that name or very close names. The system should leverage fuzzy logic to account for errors in spelling or other data inaccuracies.

If you want fuzzy logic, you are going to need a script or some other code-based system.

Formula fields can help with issues like changes between upper/lower case and leading/trailing spaces. Formula fields won’t let you do other fuzzy logic like small typos.

Formula fields also cannot identify duplicates on their own. They would work in conjunction with an automation and/or a script.

You can automate the system. A very simple system is detecting duplicate email addresses for records created via a form. Have a formula field that converts the email address to all lower case and removes leading/trailing spaces. Have a same-table link for holding potential duplicates. When the record is created, trigger the automation. The automation finds records where the formula version of the email matches. Then the automation updates the same-table link with the found records and sends an email to a human. Finally the human resolves the duplicates.

This is a very, very simple no-code method. Once you add code, it can get more robust and include fuzzier logic and actual consolidation of duplicates.

2 Likes

We’re probably weeks away from some clever GPT-4 approach to ending the deduping madness that exists in all database systems worldwide.

Using this basic approach, I’d be willing to bet a nice sub sandwich that spotting records that are very close to duplicates is possible with very elegant code that doesn’t need to know anything about the fields or their values.

Fuzzy logic is another way of saying “inference”. :wink:

The beauty of embedding vectors is they represent a search index for things of likeness.

1 Like

@jhi - I wrote a script for fuzzy matching using the Levenshtein distance algorithm. You can find it on the official Airtable community here:

About half way down the thread you’ll see the JS version of this that can be incorporated into an Airtable script. I’ve tried using this in an Airtable automation, but, generally, it does not work. Not because it won’t work technically, but, if you have a large number of records to compare the automation will run out of time (30 seconds is the maximum execution time for a single script)

4 Likes

Thanks @Kuovonne! It really looks like we’re going to be using scripts here.

@bfrench I’ll be keeping an eye on a possible GPT-4 approach :eyes:

@JonathanBowen thank you so much, that is exactly what I need. I tried running the script and it works well though I haven’t tried it using thousands of records yet.

I see that the script updates all the records in Match table, is there any way to have it trigger on just one record only? We’ll be using the automation script instead of the script extension and are okay with having a default match ratio for the search. Use case is that a name is gonna be submitted through a simple form and that’s gonna trigger the automation script to look for any similar names on the original table and link that to the match table. I know it’s probably a matter of tweaking the script but it’s not right up to my alley yet :see_no_evil:

1 Like

My embedded AI suggests a change to your statement.

I’ll be keeping an eye on a probable GPT-4 approach

:wink:

Hi @jhi - yes you could run this against a single record. You would need to modify the script to accept a record as user input - see here:

You can also run it on an automation, but whenever I’ve done this I come up against the time that a script can run within an automation (currently limited to 30 seconds). Whether or not you hit this limitation will depend upon the number of records you are comparing, but, from memory, even comparing 1 input record with a few thousand reference records hits the limit. Def one to try on your data though.

Thanks @JonathanBowen! We’ve decided to go for Scripting extension. I’ve added two variations of the script triggered via buttons, one where they can find an exact match (with a default match ratio so they don’t have to input them every time) and one where they can find fuzzy matches. The script has been very helpful, thank you so much again!

2 Likes