Onehouse Automates Vector Embedding for Its Data Lakehouse

The solution's vector embeddings generator relies on Change Data Capture to move data and vectors from sources to targets.

Aug 22nd, 2024 10:30am by Jelani Harper

Featued image for: Onehouse Automates Vector Embedding for Its Data Lakehouse

Onehouse, which offers a data lakehouse solution with a managed ETL cloud service, has added support for automating pipelines to generate vector embeddings. These capabilities enable users to pipeline Onehouse data to either OpenAI or Voyage AI and store the returned embeddings in the lakehouse.

Although the vendor refers to this functionality as a vector embeddings generator, the aforementioned embedding model frameworks embed the content with their models. In the future, Onehouse plans to expand its selection of resources, and supported models, for creating embeddings.

The data lakehouse’s vector embeddings generator is an alternative to the traditional Retrieval Augmented Generation (RAG) architecture in which users create a prompt, augment it with data contained in a vector database or other sources, and then send it to a language model for a response.

“If you look at this architecture as companies try to go to production, if you were at the AI Summit, you famously heard how 85% of these use cases are not yet in production,” commented Onehouse CEO Vinoth Chandar.

Onehouse’s vector embeddings generator is designed to account for the data management rigors that may otherwise become prohibitive for GenAI initiatives. By supporting open data formats on inexpensive cloud storage, the data lakehouse supplies a number of data management constructs that can scale, and reduce the costs, of working with vector embeddings.

“With vector databases, you see cost grows with the amount of embeddings that you store,” Chandar explained. “Having a lakehouse layer as a scalable storage layer built on open data formats and scalable, ubiquitous cloud storage, is a good option for these embeddings.”

Moreover, the platform’s Change Data Capture (CDC) approach to pipelining data enables users to quickly transmit data where it needs to go: from sources, to the lakehouse, to embedding models and back, and to downstream vector databases.

Pipeline Efficiency

The vector embeddings generator complements, rather than replaces, contemporary reliance on vector databases. It allows organizations to carefully cull which data, out of all their data assets, is utilized for generative model applications.

Onehouse’s ETL service implements pipelines by automatically provisioning the CDC infrastructure required to transmit source data to the lakehouse to support even low-latency use cases. When ingesting source data from PostgreSQL, for example, “as the upstream Postgres database gets updated, the lakehouse table gets updated,” Chandar said. “There is no lag and we auto-scale the compute infrastructure you need to go up and down to deal with volumes.”

Some users ingest raw data into Onehouse for its data management capabilities — including indexes to expedite updates between sources and targets — before deciding what data to embed as vectors. After they have, the automated pipelines transmit data to OpenAI or Voyage AI for embeddings, which are returned to the lakehouse.

“We make calls to OpenAI’s or Voyager AI’s APIs to generate these embeddings and efficiently save them in another downstream table,” Chandar revealed.” With this method, a retailer attempting to collect customer reviews of products from a transactional database could pipeline data from the source to Onehouse, choose the correct field to generate the embeddings from, quickly embed that data, and have it neatly arranged in a Onehouse table for downstream applications.

Data Management

Although Onehouse supports omnidirectional interoperability between Delta Lake, Apache Iceberg, and Apache Hudi, its storage capabilities involving Hudi are optimized for its rapid CDC-based ingestion. The data management typifying its data lakehouse experience naturally extends itself to working with vector embeddings. The platform’s capabilities for indexing and what Chandar described as “database clustering” are designed to improve the performance of queries. Coupling them with the aforementioned pipeline automation to vector databases makes Onehouse a compelling solution for specialized, nuanced applications of generative AI.

“So, if a retailer is having a free user, a paid user, and a prime user or a non-prime user, then you can choose to send your data to different vector databases by just querying and saying, ‘give me all the embeddings for the free users’ and send them to this vector database,” Chandar said. “All my premium users will go to a different one. For practical enterprise scenarios, you might need like 10s of vector databases to hold that much data.” Onehouse provides a handful of indexes, including those predicated on rules filters and file level statistics that can filter which portions of tables are scanned. There’s also “a hash index based on indexing specific records that maintain some mapping, and it works well on random write scenarios,” Chandar added.

Clustering and Space-Filling Curves

The database clustering Onehouse offers shouldn’t be confused with the unsupervised learning variety of clustering. The former is simply a method of sorting data or presenting it in a layout that “is conducive to the use of these indexes on the access side,” Chandar said. For example, users can cluster Onehouse’s data according to types of products, or demographic information like age or city, then query according to the cluster field to improve query performance.

The platform also supports space-filling curves which, conceptually, function as a sort of multidimensional clustering. “Space-filling curves do a great job of having you query by, not just one field, but by multiple ones,” Chandar mentioned. This technique, in addition to database clustering, works well for helping users understand their data in Onehouse. When coupled with the indexing options in the data lakehouse, it returns fast query results to determine which data should be vectorized and which vectors should be routed to vector databases for generative AI use cases.

Long Term Viability

The costs of employing vector databases can rapidly increase, particularly when these AI retrieval systems only provide in-memory methods for maintaining their embeddings and indices. Onehouse’s vector embeddings generator is a viable alternative to storing all vectors in these similarity search engines. It can potentially minimize costs, increase scalability, and heighten efficiency — particularly for enterprise applications of generative AI. The platform’s cheap storage, open data formats, and pipeline automation may prove an integral aspect of the long-term viability of storing and managing vector embeddings.

Jelani Harper has worked as a research analyst, research lead, information technology editorial consultant, and journalist for over 10 years. During that time he has helped myriad vendors and publications in the data management space strategize, develop, compose, and place...