How to Increase Ollama Context Size: A Complete Step-by-Step Guide

Step-by-Step Guide to Increase Ollama Context Size

Step-by-Step Guide to Increase Ollama Context Size

Below is a comprehensive guide on how to modify an Ollama model’s context window:

  1. Adjusting the Context Window: First, you can adjust the context window parameter when running Ollama in the console. For example: ollama run llama3.1>>> /set parameter num_ctx 4096 This command sets the parameter ‘num_ctx’ to ‘4096’, allowing you to execute longer prompts.
  2. Creating a Custom Model: Here’s how you can do it in a few easy steps:
    1. Create a Configuration File Using a text editor like Nano, create a file with the following content and name it llama3.1_ctx_4096: FROM llama3.1:latest PARAMETER num_ctx 4096
    2. Build the New Model Use the Ollama create command: ollama create llama3.1_ctx_4096 -f llama3.1_ctx_4096
    3. Run Fabric with the New Model Execute the following command: fabric –remoteOllamaServer 192.168.1.100 –model “llama3.1_ctx_4096:latest” -sp extract_wisdom < video_transcript.txt
  3. Important Considerations: Increasing the context window allows the model to consider more context, improving the coherence of longer texts. However, it also requires more computational resources. Conversely, reducing this value can speed up the generation process but might result in less coherent or contextually aware outputs for longer texts.
  4. Increasing Context Size in Modelfile: To increase the context window in Ollama, you need to create a new Modelfile that extends the context size. The Modelfile should specify the model you are using and include the parameter ‘num_ctx’ with a value that you desire. For example: FROM llama3.1:8b PARAMETER num_ctx 32768 This will give you a context window that is 4x larger than the default one. This can be applied using the command: ollama create -f Modelfile llama3.1:8b
  5. Default Context Size: By default, Ollama’s LLAMA3.1 model supports a context window of about 8K tokens. After creating the new Modelfile to increase the context window size to 32K tokens, the performance in processing large amounts of data improved significantly without losing information. If you need an even larger context window, you can set ‘num_ctx’ to 131072, which is the maximum limit supported by LLAMA3.1.
  6. VRAM Considerations: When increasing the context size, it is important to note that this will also increase the amount of VRAM used. While exact numbers were not provided, it was found that a context size of 32768 was sufficiently fast and low on VRAM for most practical purposes, making it suitable for handling larger documents.
  7. Parallel Request Processing: Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
  8. Editing Context Size in Workspace: You can increase the context size of a model within the Models section of the Workspace option within the chat sidebar. Here, all of your models are listed, and you will find a pencil icon for editing them. You can edit a model and then press on Show for Advanced Parameters.

It’s crucial to understand the limitations and enhancements that come with context size adjustments to optimize performance accordingly.

Understanding Context Window Limits in Ollama Models

Understanding Context Window Limits in Ollama Models

The default context window size in Ollama is set to 2048 tokens. However, if you need broader context understanding, you can change this size using the command:

ollama run llama3 –set parameter num_ctx 4096

You can also specify the context window size via the API or the Modelfile. In some models within the Ollama library, a larger context window size is used by default. Additionally, context window size can be adjusted in the ollama section of the plugin settings. Here’s how it works:

  • If you add a context window size field, ensure it is filled with the desired context window size. If the field is non-empty, send its value as a key/value pair ‘num_ctx’: in the options dict in every chat API request. Remember to send it in every request, or else the server will revert back to default model parameter settings for any parameters left out.
  • If the field is left empty, do not send num_ctx in the chat API request’s options dict, and the server will utilize the model default in that case.

Moreover, context size should be determined dynamically at runtime based on the amount of memory available. It is crucial to avoid costly overflow from VRAM to RAM and from RAM to SSD/swap. Always prefer longer context over loading multiple models or maintaining multiple contexts.

For cases when a context is too small after applying the above rules, set it to a reasonable minimum, such as 10% of the model size, especially since Llama3 can readily write responses to simple questions that are 700 tokens long, potentially exhausting a 2048-token context by the third turn.

If you require an even larger context window, you can increase it to 131072, which corresponds to the 128k context limit that llama3.1 has. The context limit, which defaults to 2048, can be made larger using the num_ctx parameter in the API.

To effectively adjust the context window size, you can run:

ollama run

and utilize:

/set parameter num_ctx

This approach allows for the execution of longer prompts; however, it’s important to note that when using Fabric, you cannot include the parameter directly. Instead, create a custom Llama model with the new parameter.

Finally, while small documents yield accurate results, larger ones may lead to inaccurate or incomplete data extraction. The OpenAI API integration with Ollama currently does not offer a way to modify the context window, which restricts flexibility and may exacerbate inconsistencies in results across varying document sizes.

A simple binary search can help find the limit of max tokens, as the model will optimally use memory and set the upper limit with a percentage (e.g., allow -max_mem 95% of GPU memory for context expansion). Be aware that a larger context window costs substantial memory, which is severely limited on GPUs and, to a certain degree, also on CPUs.

Best Practices for Modifying Ollama Context Length for Optimal Performance

Best Practices for Modifying Ollama Context Length for Optimal Performance

To optimize performance when modifying the context length in Ollama, follow these best practices:

  • Context Window Size: It is recommended to set the ‘Size of context window’ in Dify or Continue to a sufficiently small value. Preferably use the default value (2048) or 4096. Testing with a small number of words can help determine if the issue is resolved.
  • Adjusting Settings in Dify: In Dify, you can open the LLM block in the studio app to find ‘Size of content window’. You can uncheck it or enter 4096. The default value is 2048 when unchecked.
  • Changing Values in Continue: In the Continue extension for Visual Studio Code, you can change the contextLength and maxTokens values to 4096 and 2048, respectively.
  • Checking Context Length: It is crucial to check the context length using the command ‘ollama show ‘ to ensure the correct settings are applied. For instance, Llama 3.1 has a context length of 131072, which can handle approximately 65,536 words of text.
  • Finding Suitable Size: Finding a suitable size for context length involves adjusting the value manually. A context length of 24576 (4096*6) was found to work well for Llama 3.1 8B F16 and DeepSeek-Coder-V2-Lite-Instruct Q6_K. However, using non-multiple-of-4096 values may cause character corruption.
  • Avoiding Slow Processing: When using Ollama through Dify or Continue, understand that slow processing can occur due to a large context length. For example, a model with a max context length of 131072 may run slowly if the size is larger than the actual model size (16 GB), causing Ollama to process as a larger model.
  • Consider Hardware Limitations: Aiming for a shorter context length is advisable if you are facing hardware limitations. It’s important to keep the context within RAM constraints to ensure optimal performance of your Ollama models.
  • Ensure Sufficient RAM: To optimize your Ollama models effectively, ensure that you have enough RAM for their model’s needs. For instance, running the Llama 370B model effectively may require upwards of 64GB RAM. If your system falls short, consider running a smaller model or optimizing for lower memory usage using quantization techniques.
  • Tweaking Model Parameters: Tweaking model parameters can significantly affect output and speed. For example, adjusting settings such as Temperature, Top-K, and Top-P can lead to smarter sampling strategies and faster outputs.
  • Monitoring Performance: Regularly monitoring your model’s logs is crucial to identify spikes in performance or lag. Utilizing performance profiling tools will help to identify bottlenecks in your model’s performance, making it easier to implement necessary optimizations.
  • Dynamic Model Management: Ollama allows loading and unloading models dynamically based on user needs. If a model isn’t being used, unloading it can free resources, contributing to overall performance optimization.
  • Experimenting with Context Window Size: Experiment with different sizes to find the optimal balance between speed and context understanding for your use case. Caching can significantly improve Ollama’s performance, especially for repeated queries or similar prompts.
  • Streamlining Processing: Adjusting context window sizes can lead to faster processing without sacrificing too much understanding. For instance, using –context-size 2048 can streamline inference times drastically.
  • Using Quantized Models: Ollama supports various quantization levels. For example, this runs the Llama 2 7B model with 4-bit quantization, which is faster and uses less memory than the full-precision version. Remember that the context window size affects both performance and the model’s ability to understand context.
  • Adjusting Settings in Dify: When adding an Ollama model to Dify, you can override the default value of 4096 for Model context size and Upper bound for max tokens. Since setting an upper limit may make debugging difficult if issues arise, it’s better to set both values to the model’s context length and adjust the Size of content window in individual AI apps.

Leave a Reply

Your email address will not be published. Required fields are marked *