Can't relate at all.

Sips'@slrpnk.net · 12 days ago

Can't relate at all.

brucethemoose@lemmy.world · edit-2 12 days ago

No, all the weights, all the “data” essentially has to be in RAM. If you “talk to” a LLM on your GPU, it is not making any calls to the internet, but making a pass through all the weights every time a word is generated.

There are system to augment the prompt with external data (RAG is one word for this), but fundamentally the system is closed.

Hackworth@lemmy.world · 12 days ago

Yeah, I’ve had decent results running the 7B/8B models, particularly the fine tuned ones for specific use cases. But as ya mentioned, they’re only really good in thier scope for a single prompt or maybe a few follow-ups. I’ve seen little improvement with the 13B/14B models and find them mostly not worth the performance hit.

brucethemoose@lemmy.world · 12 days ago

Depends which 14B. Arcee’s 14B SuperNova Medius model (which is a Qwen 2.5 with some training distilled from larger models) is really incrtedible, but old Llama 2-based 13B models are awful.

Hackworth@lemmy.world · 12 days ago

I’ll try it out! It’s been a hot minute, and it seems like there are new options all the time.

brucethemoose@lemmy.world · edit-2 12 days ago

Try a new quantization as well! Like an IQ4-M depending on the size of your GPU, or even better, an 4.5bpw exl2 with Q6 cache if you can manage to set up TabbyAPI.

uis@lemm.ee · 12 days ago

If you “talk to” a LLM on your GPU, it is not making any calls to the internet,

No, I’m talking about https://en.m.wikipedia.org/wiki/External_memory_algorithm

Unrelated to RAGs