There are probably some websites that let you try out the model while they run it on their own equipment (or have it rented out through Amazon, etc.). But the biggest advantage to these models is being able to run it locally if you have the hardware to handle it (beefy GPU for quicker responses and a lot of RAM).
I know that LM Studio supports Both NVIDIA and AMD GPUs.
Text-Generation-WebUI can support AMD GPUs as well, it just requires some additional setup to get it working.
Some things to keep in mind…
Hardware requirements:
- RAM is the biggest limiting factor with which model you can run while your GPU/CPU will decide how quickly the LLM can respond.
- If you can fit the entire model inside of your GPU’s VRAM you’ll get the most speed. In this case I would suggest using a GPTQ model instead of GGUF https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GPTQ
- Even the newest consumer grade GPUs only have 24GB of VRAM right now (RTX 4090, RTX 3090, and RX 7900 XTX). And the next generation of Consumer GPUs are looking like they will be capped at 24GB of VRAM as well unless AMD decides this is their way of competing with NVIDIA.
GGUF models let you compensate for VRAM limitations by loading the model first in VRAM and anything leftover will get loaded into system RAM.
Context Length: Think of an LLM like something that only has a fixed amount of short term memory. The bigger you set the context length, the more short term memory you can give it (the maximum length you can set depends on the model you’re using and setting it to the max also requires more RAM). Mixtral 8x7B models have a Max context length of 32k.
Well that’s a loaded question.
There are probably some websites that let you try out the model while they run it on their own equipment (or have it rented out through Amazon, etc.). But the biggest advantage to these models is being able to run it locally if you have the hardware to handle it (beefy GPU for quicker responses and a lot of RAM).
To quickly answer your question, you can download the model from here:
https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF
I would recommend Q5_K_M.
But you’ll also need some software to run it.
A large number of users are using “Text-Generation-WebUI” https://github.com/oobabooga/text-generation-webui
There’s also “LM Studio” https://lmstudio.ai/
Ollama https://github.com/ollama/ollama
And more.
I know that LM Studio supports Both NVIDIA and AMD GPUs.
Text-Generation-WebUI can support AMD GPUs as well, it just requires some additional setup to get it working.
Some things to keep in mind…
Hardware requirements:
- RAM is the biggest limiting factor with which model you can run while your GPU/CPU will decide how quickly the LLM can respond.
- If you can fit the entire model inside of your GPU’s VRAM you’ll get the most speed. In this case I would suggest using a GPTQ model instead of GGUF https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GPTQ
- Even the newest consumer grade GPUs only have 24GB of VRAM right now (RTX 4090, RTX 3090, and RX 7900 XTX). And the next generation of Consumer GPUs are looking like they will be capped at 24GB of VRAM as well unless AMD decides this is their way of competing with NVIDIA.
GGUF models let you compensate for VRAM limitations by loading the model first in VRAM and anything leftover will get loaded into system RAM.
Context Length: Think of an LLM like something that only has a fixed amount of short term memory. The bigger you set the context length, the more short term memory you can give it (the maximum length you can set depends on the model you’re using and setting it to the max also requires more RAM). Mixtral 8x7B models have a Max context length of 32k.