SpreadsheetLLM, a research project that is still in its early stages, is described in a recent research pre-print publication. Its goal is to close the gap between sophisticated AI models and spreadsheets, which are a common business tool but provide a challenge for existing AI.
Spreadsheets can store a vast amount of data, including complex financial models and basic computations. Nonetheless, the organized layout, equations, and citations provide challenges for current LLMs to efficiently handle and evaluate the data.
SpreadsheetLLM, according to Microsoft, has a special encoding technique that maximizes LLM capabilities for spreadsheets.
The enormous volume of tokens (data units) they must process is one obstacle. Microsoft has created a framework called SheetCompressor to address this problem. According to Microsoft, it can reduce data by up to 96%, enabling LLMs to manage even big datasets within their processing constraints while maintaining the structure and relationships within the data.
The study paper claims that SheetCompressor outperforms the vanilla technique in spreadsheet table detection by more than 25%.
There are three modules in SpreadsheetLLM:
- The purpose of structural-anchor-based compression is to improve the LLM’s comprehension of the data arrangement by carefully placing “anchors” across the spreadsheet. The table is then reduced in size by eliminating superfluous rows and columns, resulting in a more streamlined “skeleton.”
- Translation of inverse index: This module addresses the problem of empty cells and repeated values that use too many tokens. It optimizes token usage without sacrificing data integrity by using a special JSON-based technique to build a dictionary that recognizes non-empty cells and combines text that is exactly the same.
- Data-format-aware aggregation: This module tackles the problem of numerical cells that are next to each other yet have different forms. It acknowledges that comprehending the precise numerical values is not as important as understanding the data structure as a whole. As a result, it groups nearby cells with comparable features and extracts data kinds and formats. This saves tokens by streamlining the procedure.
The “Chain of Spreadsheet” (CoS) framework is introduced by the SpreadsheetLLM model through the application of the “Chain of Thought” prompting methodology. Table detection, matching, and reasoning are the stages that this architecture breaks down into for spreadsheet reasoning. According to Microsoft, this wide application has the potential to drastically change spreadsheet data analysis and administration, opening the door to more effective user interactions.
SpreadsheetLLM performed fairly well on tasks involving answering questions based on spreadsheet data, although high rates of compression and long contexts decreased the accuracy. In tests, SpreadsheetLLM outperformed existing methods for spreadsheet table detection, “the foundational task of spreadsheet understanding,” by 12.3%.
Additionally, the model was able to greatly improve the understanding of spreadsheets by well-known LLMs such as GPT-3.5 and GPT-4, with GPT-4 obtaining a table detection score of about 79% with the help of SpreadsheetLLM.
SpreadsheetLLM has limits despite its promise. Due to increasing token usage, spreadsheets with complex formatting, such as borders and background colors, can still confound the model.
Furthermore, SheetCompressor has trouble right now with cells that have natural language in them.
As of right now, Microsoft has not made SpreadsheetLLM public; it is a research project. Although the probabilistic nature of GenAI will not always be a suitable fit for the precision of data stored in spreadsheets, the discovery opens up some intriguing possibilities.
An increasing number of LLMs have been announcing that they have improved upon their predecessors.
The most recent version of Meta’s LLM series, Llama 3, was published in April. The company claims notable performance gains over LLM 2 and LLM 3. There were two versions of Llama 3 available: Llama 3 70B, which has 70 billion parameters, and Llama 3 8B, which has 8 billion parameters.
More than 400 billion parameters are being trained by Meta in increasingly larger models. Future versions will hopefully support more than just text as a data format, be bilingual, and have enhanced coding and reasoning capabilities.