Meta MultiRay has developed a platform that allows cost-effective running of state-of-the-art machine learning models. MultiRay allows running models on the same input to share most of the running cost with a small addiction cost per model.
Current AI models are very large as they can handle text, images and other modalities. The best results are obtained by training a massive model with a massive amount of data and then specializing the model on a particular task (ie armful speech). This process is very expensive. The best solution is to calculate the expensive part of the training once (embedding) and then reuse this training, with another dedicated training step, for the specific task.
This is the goal of MutliRay, the platform developed by Meta. MutliRay’s universal models are trained to perform well across a wide set of tasks and domains. This way, the ML teams in Meta can quickly iterate on tons of models across different functions, increasing efficiency instead of each team developing all the models individually.
MultiRay uses large point-return foundation models, called embeddings, in a high-dimensional vector space that represent an ML-friendly version of the original input. This way, the model can consume this input instead of handling the real input which is much simpler. The base models used in MultiRay are optimized to work on different tasks.
Another goal of MultiRay is to centralize the implementation of GPUs and cache to save time and money, democratizing access to large underlying models. Centralization as amortization has certain advantages over many teams; All the teams use the expensive specialized high-end hardware like GPUs, and the cost is shared between them. Simple development and operations: instead of each team managing the models, infrastructure, and maintenance of the model, centralization allows optimization techniques to be applied across different groups. Faster research to production, MultiRay is the sandbox of ML models and serving 125 clients among Meta, it supports up to 20 million queries per second (QPS) while serving 800 billion queries per day , this allows specialist systems to contribute to key optimization.
Two other major advantages of MultiRay are the efficiency of accelerators and cache management. Accelerator hardware works well when it processes an aggregate group of requests in parallel (batch). The batch can change if the hardware changes so MultiRay can manage the requests to create the optimized batches based on the available hardware. Cache management is an integral part of MultiRay as it uses cache to optimize recalculation costs. MultiRay models are large and the cache is finite so the results cannot be cached for a long time. MultiRay has an optimized algorithm that analyzes request patterns across clients and informs the cache size, time to live, and update policy for the best cache management strategy and cost save more recalculation.
A centralized service also presents challenges, such as quotas, client management, and cost allocation. These problems are considered solved for large-scale systems but had to be adapted to the AI domain. MultiRay models work well if used extensively, requiring a state-of-the-art model across multiple use cases, which entails a lot of work in versioning and refreshing models and investments in new training architecture and training flows to reduce research to production. time.