diff --git a/_posts/2023-10-23-llm-powered-data-classification.md b/_posts/2024-07-15-llm-powered-data-classification.md similarity index 95% rename from _posts/2023-10-23-llm-powered-data-classification.md rename to _posts/2024-07-15-llm-powered-data-classification.md index afefe95c..dc3db5e2 100644 --- a/_posts/2023-10-23-llm-powered-data-classification.md +++ b/_posts/2024-07-15-llm-powered-data-classification.md @@ -1,8 +1,8 @@ --- layout: post -id: 2023-10-23-llm-powered-data-classification +id: 2024-07-15-llm-powered-data-classification title: 'LLM-powered data classification for data entities at scale' -date: 2023-10-23 00:00:10 +date: 2024-07-15 00:00:10 authors: [hualin-liu,stefan-jaro,harvey-li,jerome-tong,andrew-lam,chamal-sapumohotti,feng-cheng,aaqib-kufran] categories: [Engineering, Data Science] tags: [Data, Machine Learning, Generative AI] @@ -11,6 +11,8 @@ cover_photo: /img/llm-powered-data-classification/cover.png excerpt: "With the advent of the Large Language Model (LLM), new possibilities dawned for metadata generation and sensitive data identification at Grab. This prompted the inception of our project aimed to integrate LLM classification into our existing data management service. Read to find out how we transformed what used to be a tedious and painstaking process to a highly efficient system and how it has empowered the teams across the organisation." --- + Editor’s note: This post was originally published in October 2023 and has been updated to reflect Grab’s partnership with the Infocomm Media Development Authority as part of its Privacy Enhancing Technology Sandbox that concluded in March 2024. + ## Introduction At Grab, we deal with PetaByte-level data and manage countless data entities ranging from database tables to Kafka message schemas. Understanding the data inside is crucial for us, as it not only streamlines the data access management to safeguard the data of our users, drivers and merchant-partners, but also improves the data discovery process for data analysts and scientists to easily find what they need. @@ -177,7 +179,6 @@ The predictions are published to the Kafka queue to downstream data platforms. - ### Impact Since the new system was rolled out, we have successfully integrated this with Grab’s metadata management platform and production database management platform. Within a month since its rollout, we have scanned more than 20,000 data entities, averaging around 300-400 entities per day. @@ -202,6 +203,9 @@ To track the performance of the prompt given, we are building analytical pipelin We are also planning to scale out this solution to more data platforms to streamline governance-related metadata generation to more teams. The development of downstream applications using our metadata is also on the way. These exciting applications are from various domains such as security, data discovery, etc. +## Acknowledgements + +Grab recently participated in the Singapore government’s regulatory [sandbox](https://www.imda.gov.sg/how-we-can-help/data-innovation/privacy-enhancing-technology-sandboxes), where we successfully demonstrated how LLMs can efficiently and effectively perform data classification, allowing Grab to compound the value of its data for innovative use cases while safeguarding sensitive information such as PII. # Join us