September 18, 2024
What do AI/ML teams need from their databases?
See Liquibase in Action
Accelerate database changes, reduce failures, and enforce governance across your pipelines.
Data scientists pushing the boundaries of AI/ML technology have a database problem.
While they’re rapidly experimenting to discover, build, maintain, and optimize cutting-edge AI/ML products, their pipeline hits a traffic jam when it comes to managing databases that support constantly evolving AI/ML datasets and structures.
Managing data, handling schema changes, and iterating on models can be a bottleneck for experimentation and releases, or worse. Insufficient change management can be a risk to reliability, security, and compliance, too.
Database change management simply isn’t a user-friendly experience for them. It’s cumbersome, frustrating, and slow – not built for their advanced use case and stifling the pace of progress.
Maximizing the AI/ML’s pipeline’s potential for experimentation and innovation means we must understand what these teams require of their databases and what challenges they face as they update schemas to support advanced elements.
With the plethora of DevOps-style automation opportunities available, they need guidance to best integrate with other up- and downstream groups who have skin in the game and wisdom to share, including DBA, DevOps, and development teams.
To achieve optimal integrity, efficiency, value, and security in these data science pipelines, we have to sincerely listen when we ask:
What do AI/ML teams need from their databases?
What do AI/ML teams need from their databases?
Databases built with application development and more traditional data pipelines in mind aren’t quite suited for the demands of AI product pipelines.
AI/ML teams need speed, scalability, and flexibility as they work with massive datasets. With instant experimentation, iteration, reverting, and adjusting, the data scientist's workflow is much more tightly cyclical and repetitive until they release. It helps to break down the data scientist’s workflow:
- Data collection and ingestion, when they gather raw data from multiple sources and import it into the system, often in large volumes and various formats.
- Data cleaning and preparation, in which inconsistencies, duplicates, and irrelevant data are removed, and the data is normalized and transformed for analysis.
- Feature engineering, where meaningful variables are identified and created to improve model performance, often involving dimensionality reduction or new feature creation.
- Model selection and training, during which appropriate machine learning models are chosen and trained on prepared data, requiring multiple iterations and computational power.
- Evaluation and validation, in which models are tested on a separate dataset to fine-tune performance based on metrics like accuracy or precision.
- Deployment, when the final model is integrated into production environments for real-time use, such as prediction systems or applications.
- Monitoring and maintenance, where models are continuously monitored and re-trained as needed to adapt to changing data or requirements.
This emphasizes the data-first approach of AI/ML compared to the code-centric workflows in traditional application development pipelines. That means their challenges differ, too.
What is a “Database for AI”?
When you think of a database for AI – a data store that specifically and comprehensively meets the needs of data scientists and the rest of the AI/ML team – it should focus on supporting complex model training and enabling RAG, retrieval-augmented generation.
Supporting model training
Data scientists training AI/ML models receive regular data updates from upstream teams, continuously enhancing their models and insights – ideally. When something goes awry, and performance starts to decline, data scientists need an easy way to look back through their data’s lineage. Finding when and where things went wrong, or were last “right,” enables them to initiate a rollback and quickly recover.
Lineage and versioning gets even more critical as AI/ML extends into more regulated and secure industries. Data teams need to show data source provenance during auditing or approval processes to ensure the right data was used in training, both short- and long-term.
Complicating the matter – the data at hand isn’t limited to the binary data types in traditional databases. A database for AI must handle a vast and diverse amount of data, including millions of large-format files like videos and images. It also needs to accommodate AI-specific elements, such as bounding boxes used in machine vision, and make those elements searchable, too.
Built for NoSQL and RAG
With its nearly endless scope of information relevance, an AI/ML tool needs more than traditional structured data to perform optimally. It needs unstructured data like images, video, text, and more. They thrive on diverse, rich datasets that simply can't be delivered by SQL solutions we’re used to.
MongoDB and Cassandra, among others, are big names in the AI/ML space since they can efficiently store unstructured or semi-structured data without predefined schemas, offering scalability and flexibility that are essential for AI workflows. However, schema-less databases in an already mismanaged change management workflow exacerbate every issue.
Another AI/ML-pervasive NoSQL database is the vector database because of their role in RAG: Retrieval-Augmented Generation. RAG models perform similarity searches across high-dimensional data, such as retrieving relevant documents or images based on vector embeddings and using methods like cosine similarity. This requires databases that can process large, complex data structures quickly and efficiently.
On top of similarity searches, RAG applications also require traditional filtering, such as by permissions, date ranges, or sources. A database for AI needs to balance both advanced vector search capabilities and basic filtering across large datasets. Structuring data in smaller, more granular pieces enables faster, more precise retrieval, which improves the quality and specificity of RAG model outputs. Managing this process requires robust data management techniques, including ETL and schema management, to optimize data for both traditional and AI-driven queries.
In addition to those mentioned earlier, AI-centric database options include:
- Elastisearch
- Deeplake
- Pinecone
- Milvus
- Qdrant
…and more launching nearly every day. Among these choices, some approach the challenges below better than others. All, though, require additional tooling for a complete workflow, positive experience, and reliable scalability.
The unique challenges of AI/ML database workflows
Balancing constant experimentation, massive datasets, endless change, and enduring pressures for reliability and security, AI/ML data science teams understand the importance of optimized change management. Yet, unfamiliar with the tools and processes at the database layer, which may not quite serve their needs anyway, they get stuck in problematic, high-friction experiences every time a schema change arises.
Let’s break out the main challenges causing delays and disruptions to experimentation and releases.
Data handling and quality issues
There’s a lot of preparation, cleaning, and quality control happening while managing large-scale datasets for AI/ML projects. They need high-quality data in order to avoid bias, improve models, and deliver optimal experiences.
Traditional tech and workflows aren’t cut out for the challenges of protecting data integrity at scale. They either break down at speed, or can’t even reach speed or scale due to how slow, manual, and error-prone they are. Automatic data cleansing and validation could slot in to accelerate their experimentation and avoid getting stuck in remediation mode. Yet in a new frontier of data management, the perfect workflow is still in flux.
Tracking and resetting experiments
These pipelines have to serve near-constant experimentation as data scientists and others run their tests and optimize their models – for better or worse. Iteration, testing, refining results, and pushing out the right releases all happens without much tracking or monitoring. The same issue extends all the way to production AI/ML environments, where the ability to rollback a release with a problematic schema change is unavailable.
That makes resetting to previous versions more difficult and risky, unlike in application pipelines where version control like Git make experimenting and resetting a much smoother endeavor.
Inefficient data streaming
Data scientists often find themselves working with cutting-edge hardware, such as GPUs, for intensive model training, but their database systems frequently lag behind in terms of data streaming. This mismatch leads to resource underutilization, with GPUs left idle as they wait for data to be processed and fed into the pipeline.
This inefficiency slows down both experimentation and the entire development pipeline, reducing opportunities for real-time learning and faster iteration. To keep up with AI hardware’s speed, data streaming solutions must be optimized, ensuring that data is delivered as fast as it can be processed. Without this, AI teams are left waiting for their pipelines to catch up, wasting both time and resources.
User-friendly database interfaces
AI/ML teams often find themselves stuck working with databases that weren’t designed with their needs in mind. Many data scientists lack the deep expertise in SQL or schema management that traditional database administration requires.
What they need instead are systems that allow them to focus on experimentation and model optimization without having to grapple with complex database interfaces. Otherwise, data scientists must frequently turn to database administrators for even routine tasks, introducing delays and disrupting their workflows.
User-friendly, self-service tools that allow them to manage data pipelines independently, without waiting for external assistance speed up workflows and empower teams to experiment more freely.
Solving the growing pains of AI/ML data pipelines
Just like in software development, AI models benefit from robust version control for datasets, allowing teams to experiment freely without the fear of losing past iterations. Easier rollbacks enable teams to revert to previous dataset versions or schemas quickly and efficiently, supporting smoother experimentation and deployment in AI databases.
Database DevOps automation can play a transformative role in AI/ML database workflows by streamlining complex processes like schema management and version control. By integrating and automating schema updates, reducing human errors, speeding up iterations, and ensuring consistency across environments – no matter how many or how frequently experiments run – the workflow meets the fast-paced experimentation needs of AI/ML and data science teams.
This approach simplifies database change management in the modern data pipeline while allowing data scientists to focus more on innovation and experimentation.