How Databricks is adding generative AI capabilities to its Delta Lake lakehouse

The Delta Lake updates aim at helping data professionals create generative AI capabilities for their enterprise with foundation models from MosaicML and Hugging Face, among others.

teamwork / collaboration / developers / development / engineers / binary code / virtual interface
Dean Mitchell / Getty Images

It's been a busy few weeks for Databricks. After releasing a new iteration of its data lakehouse with a universal table format and introducing Lakehouse Apps, the company on Wednesday announced new tools aimed at helping data professionals develop generative AI capabilities. 

The new capabilities — which include a proprietary enterprise knowledge enginem dubbed LakehouseIQ, a new vector search capability, a low-code large language model (LLM) tuning tool named AutoML, and open source foundation models — are being added to the company's  Delta Lake lakehouse.

The new capabilities draw on  technology from the company’s recent acquisitions — MosiacML this week, and Okera in May.

LakehouseIQ to open up enterprise search via NLP

The  new LakehouseIQ engine is meant to help enterprise users search for data and insights from its Delat Lake, without the need to seek technical help from data professionals. To simplify data search for nontechnical users the LakehouseIQ engine uses natural language processing (NLP).

In order to enable NLP-based enterprise searches, LakehouseIQ uses generative AI to understand jargon, data usage patterns, and concepts like organizational structure.  

It's a different approach than the common practice of creating knowledge graphs, a method used by companies including Glean and Salesforce.  A knowledge graph is a representation of structured and unstructured data in the form of nodes and edges, where nodes represent entities (such as people, places, or concepts) and edges represent relationships between these entities.  

In contrast, the LakehouseIQ engine, according to SanjMo principal analyst Sanjeev Mohan, consists of machine learning models that infer the context of the data sources and make them available for searching via natural language queries.

Enterprise users will be able to access the search capabilities of LakehouseIQ via Notebooks and the Assistant in its SQL editor, the company said. The Assistant will be able carry out various tasks such as writing queries and answering data-related questions.

Databricks said that it is adding LakehouseIQ to many management features inside its lakehouse, in order to deliver automated suggestions. These could include informing the user about an incomplete data set, or suggestions for debugging jobs and SQL queries.

Additionally, the company is exposing LakehouseIQ’s API, to help enterprises use its abilities in any custom applications they develop, said Joel Minnick, vice president of Marketing at Databricks.

The LakehouseIQ-powered Assistant is currently in preview.

Delta Lake gets AI toolbox for developing generative AI use cases

The addition of the Lakehouse AI toolbox to its lakehouse is meant to support the development of enterprise generative AI applications such as the creation of  intelligent assistants, Databricks said. The toolbox consists of features including vector search, low-code AutoML, a collection of open source models, MLflow 2.5, and Lakehouse Monitoring.

“With embeddings of files automatically created and managed in Unity Catalog, plus the ability to add query filters for searches, vector search will help developers improve the accuracy of generative AI responses,” Minnick said, adding that the embeddings are kept updated using Databricks’ Model Serving.

Embeddings are vectors or arrays that are used to give context to AI  models, a process known as grounding. This process allows enterprises to avoid having to fully train or finetune AI models using the enterprise information corpus.

Lakehouse AI also comes with a low-code interface to help enterprises tune foundational models.

“With AutoML, technically skilled developers and non-technical users have a low code way to fine-tune LLMs using their own enterprise data. The end result is a proprietary model with data input from within their organization, not third-party,” Minnick said, underlining the company’s open source foundation model policy.

As part of Lakehouse AI, Databricks is also providing several foundation models that can be accessed via the Databricks’ marketplace. Models from Stable Diffusion, Hugging Face and MosiacML, including MPT-7B and Falcon-7B, will be provided, the company said.

The addition of MLflow 2.5 — including new features such prompt tools and an AI Gateway  — is meant to help enterprises manage operations around LLMs.

While AI Gateway will enable enterprises to centrally manage credentials for SaaS models or model APIs and provide access-controlled routes for querying, the prompt tools provides a new no-code interface designed to  allow data scientists to compare various models' output based on a set of prompts before deploying them in production via Model Serving.

“Using AI Gateway, developers can easily swap out the backend model at any time to improve cost and quality, and switch across LLM providers,” Minnick said.

Enterprises will be able to continuously monitor and manage all data and AI assets within the lakehouse with the new Lakehouse Monitoring feature, Databricks said, adding that the feature provides end-to-end visibility into data pipelines.

Databricks'  already offers an  AI governance kit in the forms of  Unity Catalog.

Do Databricks’ updates leave Snowflake trailing?

The new updates from Databricks, specifically targeting development of generative AI applications in the enterprise, may leave Snowflake trailing, according to Constellation Research principal analyst Doug Henschen.

“Both Databricks and Snowflake want their customers to handle all their workloads on their respective platforms, but in my estimation, Databricks is already ready to help them with building custom ML [machine learning], AI and generative AI models and applications,” Henschen said, adding that Snowflake’s generative AI capabilities, such as the recently announced Snowpark Container Services, is currently in private preview.

Snowflake, according to Amalgam Insights principal analyst Hyoun Park, is just starting to build out language and generative AI capabilities through the NVIDIA NeMO partnership and the Neeva acquisition.

In contrast, most of Databricks’ capabilities are either in general availability or in public preview, analysts said.

Databricks’ new updates may also lead to query performance gains across generative AI use cases, according to Gartner analyst Aaron Rosenbaum, and this may act as a differentiator against rival Snowflake.

“While Snowflake and Databricks have many common customers, running a wide variety of SQL queries cheaply, quickly and simply is a goal for every one of them,” Rosenbaum said.

Copyright © 2023 IDG Communications, Inc.