Is MLX Really Faster Than Ollama? A Performance Benchmark on Apple Silicon
Benchmarking Speed: MLX vs. Ollama on Apple Silicon
Current MLX performance shows it is relatively capable at LLM prompt-processing, being approximately 15% slower, and at token-generation, about 25% slower, while maintaining good RAM usage. However, it lags in quantized token-generation, being roughly 1/2 slower than expected, based on llama.cpp behavior. Additionally, MLX’s model loading time is significantly slower.
For specific performance metrics, consider the following:
- Llama-2–7B fp16:
- Llama.cpp: Model-load ~2.8s, Prompt-processing ~772 token/s, Token-generation ~23 token/s, ~16 GB RAM used (pure model: 12.55 GB)
- MLX: Model-load ~4s, Prompt-processing ~652 token/s, Token-generation ~19 token/s, ~16 GB RAM used.
- Llama-2–7B 4-Bit quantized:
- Llama.cpp: Model-load ~0.9s, Prompt-processing ~685 token/s, Token-generation ~61 token/s, ~6 GB RAM used (pure model: 3.56 GB)
- MLX: Model-load ~1.7s, Prompt-processing ~438 token/s, Token-generation ~31 token/s, ~6 GB RAM used.
Furthermore, data collected from various Apple Silicon devices illustrates performance differences among M1, M2, and M3 chips:
- The M1 Max achieves a maximum of 599.53 t/s for F16 PP and 400.26 t/s for Q4_0 PP.
- The M2 Ultra significantly increases these numbers with 1128.59 t/s for F16 PP and 1013.81 t/s for Q4_0 PP.
- The M3 Max records an F16 PP of 779.17 t/s.
As a note, Apple Silicon chips, from M1 to M4, have transformed Macs; however, they are not as fast as dedicated GPUs when handling large language models (LLMs), which appears surprising to new users. They excel in multitasking and efficiency but do lack the raw speed of GPUs optimized for machine learning.
However, MLX has a notable advantage. It can load L3-8Bit in under 10 seconds, while llama.cpp typically takes about 30 seconds to load into ‘vRAM.’ Nevertheless, maintaining MLX models in vRAM continuously poses a challenge.
As of mlx version 0.14, MLX has reached the same performance level as llama.cpp and Ollama, achieving about 65 t/s with llama 8b-4bit M3 Max. Reports suggest that mlx version 0.15 has further increased FFT performance.
For consideration of price-to-performance ratio, the best Mac for local LLM inference is the 2022 Apple Mac Studio, equipped with the M1 Ultra chip, featuring 48 GPU cores and options of 64 GB or 96 GB of RAM, demonstrating impressive capabilities.
Understanding Performance Optimization in MLX and Ollama
The memory requirements for running LLMs are significant: 7b parameter models generally require at least 8GB of RAM, while 13b parameter models require at least 16GB of RAM. This highlights the importance of having adequate hardware for optimal performance when working with models such as mistral:7b and llama2:13b.
To run inference faster, it is recommended to choose powerful GPUs, as the computation utilization occurs mostly on GPU cores and GPU VRAM. The benchmark results indicate that the throughput flow rate for comfortable interaction between humans and AI models is around 7 tokens/sec, with 13 tokens/sec being too fast for most users.
Running LLMs locally enhances data security and privacy while providing opportunities for professionals, developers, and enthusiasts. The benchmark performance shows that using Raspberry Pi 5 for LLMs inference is inadequate due to its slow speed, whereas an Apple Mac mini with 16GB RAM is considered good enough for running LLMs.
The experimentation setup includes benchmarking different models across various systems, including Raspberry Pi 5, Ubuntu installations, and Apple Mac mini, allowing for a comprehensive comparison of their performance. The findings suggest that for better performance, one should consider renting Cloud VMs with GPUs.
Ollama Performance Considerations:
- If you’re using Ollama, you might have encountered situations where response times leave you tapping your fingers in frustration. There are multiple factors at play:
- Hardware Limitations: Insufficient CPU, RAM, and GPU can bottleneck the performance.
- Model Size: Larger models, like the Llama3:70b, require more computational power & memory.
- Optimization Settings: Incorrect configuration can hinder performance.
- Context Size: With a large context window, models may slow down as they strain to manage more data.
- Software Updates: Not using the latest version can prevent you from taking advantage of the latest optimizations.
Upgrading components can significantly enhance Ollama’s performance. Here’s what to look at:
- CPU Power: Choose a processor with high clock speeds & multiple cores. Think Intel Core i9 or AMD Ryzen 9 for a robust performance boost.
- Memory Matters: Aim for a minimum of 16GB RAM to comfortably handle smaller models, bump it to 32GB for medium-sized ones, or a whopping 64GB for those hulking beasts (30B+ parameters).
- Leverage GPUs: If you’re not already using a powerful GPU, consider investing in an NVIDIA RTX series for fantastic CUDA support which Ollama can take full advantage of.
Additionally, setting the appropriate configurations to split the load is crucial. For example: export OLLAMA_NUM_GPUS=4. By enabling multi-GPU support, you ensure that heavy lifting gets distributed, which tends to yield faster inference times. Once you have the right hardware, it’s time to optimize your software settings. Ensure you’re always using the latest version of Ollama to benefit from enhanced performance improvements.
Consider reducing the context size to lighten the processing load: By experimenting with different sizes, you can strike a balance between speed & the capacity to understand context. When selecting models in Ollama, it’s advantageous to choose those optimized for speed, especially if response times are critical to your tasks.
By carefully analyzing hardware choices, optimizing configuration settings, & adjusting your model selections, performance in Ollama can improve dramatically. Suppliers constantly innovate, so ensure you keep up with the latest developments in Ollama & more broadly the LLM space for maximal returns on your investments.
Model selection significantly impacts Ollama’s performance. Smaller models generally run faster but may have lower capabilities. Consider using models optimized for speed: These models offer a good balance between performance and capabilities. Quantization reduces model size and improves inference speed. Ollama supports various quantization levels.
Performance Tuning Steps:
- Adjusting Resource Allocation: Learn how to manage CPU and memory resources dedicated to Ollama to balance performance with system demands.
- Model Optimization: Techniques for refining model parameters to enhance speed and accuracy without compromising the quality of results.
Comparison with MLX:
As of mlx version 0.14, mlx already achieved the same performance of llama.cpp and Ollama, with about 65 t/s for llama 8b-4bit M3 Max. I’ve read that mlx version 0.15 increased the FFT performance by 30x.
While Ollama can run on CPUs, its performance is significantly better with modern, powerful processors. Consider upgrading to a CPU with: For example, an Intel Core i9 or AMD Ryzen 9 processor can provide a substantial performance boost for Ollama. RAM plays a crucial role in Ollama’s performance, especially when working with larger models.
Fine-tuning Ollama models can significantly enhance the performance of LLMs in various applications. This process involves adjusting the model parameters to better fit specific tasks or datasets, leading to improved accuracy and relevance in outputs.