Ever since Open AI rolled out ChatGPT a few months ago, GenAI (Generative AI) has been making all the waves in news. Enterprises are actively working on understanding this technology and evaluating how to realize real dollars. This requires a strategic approach, that also ensures its leveraged responsibly.
A British mathematician Clive Humby is attributed to the famous analogy used by CDOs – “data is the new oil”. This analogy is truer now more than ever in the context of GenAI. We will see a two-way dependency between Data and Gen AI – 1) Data and its management will be critical to enable GenAI use cases, and 2) GenAI can be used to manage and govern data better.
In this article, we will explore this two-way relationship deeper.
Data for GenAI
For Enterprises to enable multiple use cases, they need to evolve and advance their data estates to enable multi-modal, unstructured data management & governance. This evolution needs to happen across four areas:
Curating proprietary multimodal data sets – GenAI applications will increasingly rely on unstructured and multimodal data, necessitating effective management of both internal and external data sources to ensure their availability. To accomplish this, the enterprise data team needs to work on-
- Tracing data origin carefully to ensure the reliability of data. This is achieved by prioritizing the early capture of metadata throughout the data lifecycle and maintaining proper traceability to the source. Additionally, ensuring the authenticity of the data for its intended purpose is vital for preserving its integrity.
- Categorizing how data is labeled and described. Metadata should include tags related to biases, personally identifiable information (PII), and regulatory compliance.
- Understanding of sequence of processing steps applied to data and during model training. Need to be able to reproduce model results and interpret the data used for outputs.
- Ensuring that data usage is ethical and compliant with regulations. Transparent and explainable models promote greater compliance, and secure handling of personal information, including anonymization and the removal of unintended personal information from models.
Refining data lifecycle management and governance- As the challenge of determining trusted data increases, it’s crucial to incorporate robust audit capabilities. This involves building and defining processes to validate the source of content within unstructured documents and adopting data ingestion methods that account for higher degrees of classification. Processing unstructured data must be enabled across various formats, with an emphasis on increasing the ability to handle noise, which is more prevalent in such data.
Enhanced rigor is needed to identify relevant data for GenAI models, alongside enabling continuous model training to keep generative content relevant. There is a growing need for high-quality labeled training data to fine-tune or retrain models. Clear ownership and responsibility for generated data must be defined, along with establishing guardrails for ethical usage, data leakage prevention, and appropriate sourcing of training data. Data models should be designed to generate responses and retain the context of conversations. Additionally, reference data, such as images, audio, and text, should be accessible to provide context, while controls must be considered for managing external model influences.
Multimodal data processing and integration – To support multimodal data processing, a data platform must evolve based on its current maturity level, potentially requiring new capabilities. As data sources diversify, platforms need to handle a variety of data types—text, images, audio, and video—seamlessly. This involves integrating advanced processing tools that can manage and analyze different modalities in unison, ensuring data consistency and relevance. Depending on the platform’s existing infrastructure, new capabilities may need to be developed to facilitate this processing, enabling more sophisticated insights and decision-making. Continuous enhancement and adaptation are crucial for meeting the demands of multimodal data environments.
New skills and training– To support GenAI initiatives, it’s essential to form dedicated teams and ensure their involvement in governance. Data platform teams must continually enable data product teams to utilize the infrastructure effectively. Also, the data product teams, organized within specific data domains, require guidance from domain and product owners. Teams must be tasked with utilizing unstructured, semi-structured, and curated data to develop AI/LLM models and generate business insights. Centers of Excellence (CoEs) can promote consistency and standardization across the organization, while governance roles uphold accountability for data ethics, quality, and the complete end-to-end data flow.
GenAI for Data
GenAI can be used to support data lifecycle management through several high-value use cases across different stages, from acquisition to consumption.
In the acquisition and ingestion phases, automating data labeling and classification ensures that data is accurately categorized and ready for downstream processes. This is complemented by data cleansing automation, which helps maintain high data quality by identifying and correcting errors early on.
As data moves into storage and processing, continuing to automate data cleansing is crucial for preserving integrity and reliability. Additionally, automating Master Data Management (MDM) processes ensures consistency and governance across data assets, making it easier to manage large volumes of information. These steps set the foundation for effective model processing, where synthetic data generation can be leveraged to create diverse and representative datasets, while data anonymization techniques protect sensitive information without compromising analytical value.
In the content generation phase, these automated processes contribute to producing high-quality outputs that are accurate, reliable, and compliant with data privacy standards.
Finally, during the input and consumption stage, augmented analytics can be employed to provide advanced insights, driving informed decision-making. By integrating these automated solutions across the data lifecycle, organizations can achieve greater efficiency, accuracy, and compliance, ultimately leading to more effective and strategic data utilization.
Conclusion
The interplay between data management and generative AI (GenAI) is transforming both technology and business practices. For enterprises to fully harness the potential of GenAI, they must evolve their data strategies to handle multimodal and unstructured data, ensure ethical and compliant data usage, and invest in advanced data processing capabilities. Conversely, GenAI can significantly enhance data lifecycle management by automating processes such as data labeling, cleansing, and anonymization, thereby improving data quality and facilitating more effective decision-making. As both fields advance, a strategic approach that balances technological innovation with rigorous data governance will be essential for achieving optimal value and driving progress in the digital age.
About the Author
Ruchi Agarwal is a seasoned expert in data analytics and Generative AI, currently a leader in the Global Black Belt team for Cloud Scale Analytics – North America at Microsoft. She has a wealth of experience in delivering cutting-edge analytics solutions to Fortune 100 companies and spearheading large-scale digital transformation projects. With deep knowledge in both the technical and strategic dimensions of data analytics, Ruchi has consistently led cross-functional teams to achieve significant business results. Her expertise in cloud computing, machine learning, and advanced analytics positions her as a key influencer in the evolution of AI-driven technologies. Additionally, Ruchi is a strong advocate for diversity and inclusion in the tech sector, actively mentoring emerging talent and supporting initiatives that promote equal opportunities in the industry.
Opinions expressed in this article are solely the Author’s own and do not express the views or opinions of the Author’s employer.
Works cited
https://www.rootstrap.com/blog/the-importance-of-data-in-artificial-intelligence-ai
https://sloanreview.mit.edu/article/how-ai-is-improving-data-management/