While Apple may have entered the large language models (LLM) arena later than some competitors, recent moves showcase the company’s innovative strides. Apple’s researchers unveiled a groundbreaking method aimed at significantly reducing the memory requirements for operating LLMs. Their approach utilizes inventive storage and memory management techniques that dynamically transfer the model’s weights between flash memory and DRAM, maintaining impressively low latency.
This breakthrough holds the potential to allow future iPhones and Macs to leverage the power of LLMs without overwhelming the device’s memory, a significant advantage for Apple in the absence of a hyperscaler business, emphasizing the importance of on-device AI.
The memory demands of LLMs have long posed challenges, with a 7-billion-parameter model requiring over 14 gigabytes of RAM, surpassing the capacity of many edge devices. Apple’s approach diverges from typical quantization methods, aiming to address the issue of deploying models on hardware with limited memory.
The strategy involves storing LLMs on flash memory and incrementally loading them into RAM for inference tasks. While flash memory is more abundant than DRAM, it is slower, and naive inference approaches can be sluggish and energy-intensive. Apple’s researchers introduce a suite of optimization techniques in their paper to streamline the process, ensuring swift loading from flash memory to DRAM while keeping memory consumption low.
Testing their techniques on different models and hardware setups, Apple’s research team achieved remarkable results. For instance, on an Apple M1 Max, the latency for a naive loading approach was reduced from 2.1 seconds per token to approximately 200 milliseconds with their techniques. They demonstrated the ability to run LLMs twice the size of available DRAM, achieving accelerated inference speeds of 4-5x on CPU and 20-25x on GPU compared to traditional loading methods.
The researchers assert that their findings hold significant implications for future research, emphasizing the need to consider hardware characteristics when developing inference-optimized algorithms. This paper underscores Apple’s fusion of insights from machine learning models with deep knowledge of hardware and memory design, providing a glimpse into the practical and applicable research the company may contribute in the future. These innovations are anticipated to integrate into upcoming Apple products, ushering in advanced AI capabilities for consumer devices.