Join leaders in Boston on March 27 for an exclusive night of networking, insights and conversation. Request an order here.
Today, Databricks announced the acquisition of Lilach , a Boston-based applied research startup that offers tools for data understanding and manipulation. Terms of the deal were not disclosed.
The Ali Godsi-led data giant plans to bring Lilac's team and technology to its data intelligence platform, formerly known as data lakehouse, to give domain users a smoother way to improve the quality of their datasets to develop a large, production-quality language model. (LLM) applications.
The deal comes as Databricks' latest effort to become a one-stop shop not only for data but also for all things artificial intelligence. Just recently, it also invested an undisclosed amount in Mistral, the generative AI startup that raised the largest seed round in Europe last year and has become a strong player in the Gen AI space.
How Lilac will facilitate data exploration
When Databricks acquired Mosaic AI in a massive deal last year, the company shifted gears toward an AI-driven future where users would use data stored securely on its platform to build AI applications. Since then, the company has made several developments in the field and even produced several open source models to give customers everything they need to build, deploy, and maintain high-quality Large Language Model (LLM) applications for various business use cases.
VB event
The AI Impact Tour – Atlanta
request an invitation
However, as widely stated in the industry, data remains critical to all AI efforts, including LLM systems. The teams need to make sure they have high-quality data to train the models, as well as test their real-world performance – covering aspects like biases and hallucinations. That's what Lilac helps and tackles with Databricks.
Traditionally, teams had to use time-consuming manual methods to explore unstructured data and address gaps in it. Founded by ex-Google engineers Daniel Smilkov and Nikhil Thorat in 2023, Lilac addresses this challenge with a scalable open source solution that offers an intuitive user interface and AI-driven features to analyze, understand and transform unstructured text data, at scale.
According to the company's website, data scientists and artificial intelligence researchers could do a lot with Lilach in handling unstructured data, from grouping and assigning categories to documents, performing semantic and keyword searches to identifying personal information or duplications and making necessary edits to remove them (with a comparison view) and adjust the array the data.
"The team behind Lilac specifically built their product to enable analysis of model outputs for bias or toxicity, and data preparation for RAG and fine-tuning or pre-training of LLMs," Databricks principals Matei Zaharia, Naveen Rao, Jonathan Frankle, Hanlin Tang and Akhil Gupta wrote in Shared blog post .
They added that Lilac's entire technology stack will be under Databricks' Mosaic AI tool to give developers a way to better curate datasets for custom AI systems. Although the details of the integration remain undisclosed at this time, it will do the same job: simplify data matching to make it easier for teams to assess and monitor the outputs of their LLMs, as well as prepare datasets for RAG, fine-tuning and pre-training.
"We believe that bringing Lilac's real-time, interactive data collection experience to Databricks' enterprise-scale platform will enable businesses to gain much more visibility and control over their unstructured data. This will enable world-class, customizable AI products that serve end users. Joining forces with Databricks will enable to a whole new class of corporate developers for