Unstructured, A Big Data Business With an AI Emphasis, Raises $40 Million to Prepare Raw Data for LLM
- Business
- March 15, 2024
Unstructured Technologies Inc., a firm that processes data using generative artificial intelligence and has secured its second significant funding round in less than a year, has announced a $40 million fundraising.
Menlo Ventures led today’s Series B round, which included participation from a number of well-known backers, including the venture capital arm of Nvidia Corp., IBM Ventures, Databricks Ventures, and angel investors like Vivek Ranadivé, chairman of the Sacramento Kings, Chet Kapoor, CEO of Datastax Inc., and Allison Pickens of the New Normal Fund.
Following a $25 million round announced in July 2023, existing investors, including Madrona, Bain Capital Ventures, and Mango Capital, also participated in the round. In total, Unstructured has raised over $65 million in funding to far.
Unstructured data is gaining a lot of interest since it is at the forefront of transforming unstructured data—such as written notes, photos, audio, video, and so forth—into formats that large language models can understand with ease. Given that LLMs are the class of AI models that underpin generative AI services like Google LLC’s Gemini and OpenAI’s ChatGPT, and that few people will need reminding how popular they are these days, it’s an incredibly intriguing prospect for many organizations.
The startup observes that over half of businesses worldwide have increased their generative AI technology spending in the last 12 months, yet they are up against a significant data challenge. While advances in current data stacks have long since made structured data accessible for advanced analytics, unstructured data—which makes up over 80% of all information kept secret by businesses—is difficult to leverage. Generative AI is expected to significantly enhance its capabilities and create more potent chatbots and other applications if it can figure out a way to obtain this data more readily.
Unstructured has made it their mission to take on this obstacle, and they assert that they are the first and only business capable of ingesting and transforming any kind of unstructured data into a format that LLMs can use right away.
A cloud-hosted application programming interface, containers, and an open-source Python library are the three starting points that the startup offers its users on its platform. More than 20 different types of natural language files, including raw data and LLM-ready files, can be processed by the API. It has several enterprise-grade data connections to services, such as Dropbox and Elasticsearch, as well as Azure Blob and OneDrive from Microsoft Corp., S3 from Amazon Web Services Inc., and Cloud Storage and Google Drive from Google LLC.
Founded in 2022 by former U.S. Central Intelligence Agency analyst Brian Raymond, Unstructured developed its technology in partnership with commercial companies, the open-source community, and several U.S. government defense and intelligence agencies. The U.S. Air Force and Space Force have given Phase I and Phase II Small Business Innovation and Research contracts to the firm; U.S. Special Operations Command is providing further support.
Unstructured launched its platform that same year, and since then, it has grown to be a useful resource for businesses wishing to start producing own LLMs. Thanks to its technology, users may automatically convert unstructured data formats into a format that can be used for retrieval augmented generation (RAG), fine-tuning, and LLM training. RAG allows pretrained generative AI models to access more data to enrich their expertise.
According to CEO Raymond, firms are now able to create a new generation of LLMs and analytics products based on unstructured data thanks to the development of LLMs nested in RAG systems. “With large foundation models, developers can now interact with all of their data for the first time,” he stated.
Raymond argues that a crucial obstacle to maximizing the potential of LLMs is the inability to absorb and pre-process data created by humans, and his company will be the one to assist businesses in overcoming this. “2024 will be the year of moving LLM prototypes into production and organizations of all types and sizes are hungry to build out these architectures efficiently and at scale,” he said. “Automating the process of structuring data and seamlessly delivering it into storage is critical for enterprises that want to build solutions on this new tech stack and go to market quickly.”
Vice President and Principal Analyst Andy Thurai of Constellation Research Inc. told SiliconANGLE that data preparation is often overlooked in the creation of artificial intelligence (AI) since it is a far less glamorous process than prompt engineering, RAG, and the final products, LLMs. However, he stated that because data scientists spend the majority of their time preparing data, this is an area that may greatly benefit from automation.
“Unstructured data can be a real mess, primarily because there are no established standards and it is difficult to finding meaning within it,” Thurai said. “While vector databases help with storing unstructured data, getting the data ready to be put into a vector database or data lake is a considerable challenge.”
This difficulty is the reason Unstructured thinks its platform has already established itself as a vital component of generative AI projects’ infrastructure, converting data into LLM-ready format and enabling compatibility with vector databases, which hold unstructured data as more easily accessible numerical representations. Without requiring any modification, the company asserts that it can help to generate generative AI application speed increases of up to 20%.
According to the startup, this is the reason why its open-source library has been downloaded over six million times. More than 45,000 enterprises, including more than a third of the Fortune 500, and more than 12,000 code bases use it.
Since launching its paid software-as-a-service API in January, Unstructured has acquired over 1,000 paying clients. The next month saw the release of its enterprise platform, which is billed as the first in the world to continuously extract raw data from databases that are already in place and convert it in almost real time into formats that are suitable for LLM before loading it into a vector database.
Given that surveys indicate data scientists dedicate over 75% of their time to data preparation, it offers a significant benefit. Unstructured is uniquely equipped to keep LLMs current because it offers continuous and real-time access to the most recent unstructured data, according to the business.
Although there are other data preparation tools available for unstructured information, Thurai notes that many businesses still rely heavily on manual labor, hence these technologies are not commonly employed. Furthermore, he claimed that the process is getting harder because the most sophisticated LLMs require a lot more data than previous models. “Unstructured does have good traction with its open-source downloads, and the recently announced enterprise version of its platform helps companies more by continuously extracting raw, unstructured data from existing databases, which wasn’t possible before,” Thurai said. “Unstructured’s tools can be very useful for enterprises that need to use raw, unstructured information for RAG workloads, especially given its new ability to provide models with continually updated and current information.”
Unsurprisingly, Menlo Ventures partner Tim Tully went on to praise Unstructured even further, calling it a “exceptional platform” that can change the way developers create new data pipelines for RAG, AI applications, chatbots, and other uses. He claimed that “it has become the preferred way developers assemble data pipelines and build AI applications.” “Those in the know that RAG soon rose to prominence as the industry norm. They will quickly see that Unstructured is just the very tip of the RAG spear.
Unstructured announced that it will expand its technical and sales teams and quicken the development of its LLM data preparation tools with the money from today’s round.