Print Page - 🚀 The Way of Local AI: From Prompting to Fine-Tuning 🧠🔥

Title: 🚀 The Way of Local AI: From Prompting to Fine-Tuning 🧠🔥
Post by: Theo Gottwald on April 24, 2026, 03:44:21 PM

🚀 The Way of Local AI: From Prompting to Fine-Tuning 🧠🔥

If you go the "way of the local AI", you will of course test multiple AIs/models, different prompts, different system messages, different quantizations, different samplers, different context sizes and all sorts of settings.

But the final step, when you really want the model to behave in your own special way, is:

🎯 Model Fine-Tuning

This is where you stop only "talking nicely" to the model and start teaching it a specific behavior, style, format, workflow or domain knowledge.

Examples:

✅ Your own coding style
✅ Your own support-bot behavior
✅ Your own company documentation style
✅ Your own tool-calling format
✅ Your own agent workflow
✅ Your own prompt language
✅ Your own local AI assistant personality
✅ Better answers for your exact use-case

But fine-tuning is not magic. It is also the point where many things can fall on your leg. 😄

🧩 First Important Question: Do You Really Need Fine-Tuning?

Before fine-tuning, check these easier steps:

1. Better prompting ✍️
Many problems are simply bad prompts, missing examples or unclear system instructions.

2. RAG / Document Search 📚
If the model only needs access to facts, manuals, PDFs, source code or documentation, then RAG is often better than fine-tuning.

3. Tool usage / MCP / Agents 🛠�
If the model must do real actions, use tools. Fine-tuning does not magically give the model access to files, APIs or programs.

4. Fine-tuning 🧠
Use fine-tuning when you want the model to learn a repeatable response pattern, format, coding style, classification behavior or domain-specific task behavior.

🖥� Hardware Needed for Local Fine-Tuning

For local fine-tuning, the most important thing is not CPU speed.

The most important thing is:

🔥 VRAM, VRAM, VRAM

The GPU memory decides what model size, batch size, context length and training method you can use.

Rough practical guide:

8 GB VRAM – small models only, very limited experiments
12 GB VRAM – small 3B/7B experiments with heavy quantization
16 GB VRAM – useful for 7B/8B QLoRA experiments
24 GB VRAM – good for 7B/8B/14B fine-tuning, depending on settings
32 GB VRAM – very good local enthusiast level, e.g. RTX 5090 class
48 GB+ VRAM – serious workstation level
80 GB+ VRAM – A100/H100/H200/B200 cloud or enterprise level

A modern local high-end card like an RTX 5090 with 32 GB VRAM is already very strong for LoRA/QLoRA style fine-tuning, but it is still not the same as having an 80 GB data-center GPU.

For most private users, the realistic local path is:

✅ LoRA / QLoRA fine-tuning
❌ Not full fine-tuning of huge models

🧠 LoRA, QLoRA and Full Fine-Tuning

Full fine-tuning changes the whole model.
This needs much more VRAM, more compute, more storage and more care.

LoRA trains small adapter layers instead of the whole model.
This is much cheaper and much easier.

QLoRA uses quantization to reduce memory usage even more.
This is currently the realistic method for most local users.

So for local AI users, the usual recommendation is:

Start with QLoRA. Do not start with full fine-tuning.

☁️ Renting GPUs Instead of Buying Hardware

If you do not want to buy expensive hardware, you can rent GPUs in the cloud.

Typical GPU rental platforms include providers like:

RunPod
Vast.ai
Lambda
Paperspace
FluidStack
Akash
Other cloud GPU marketplaces

The advantage:

✅ No need to buy a 3000–6000 EUR workstation
✅ You can rent stronger GPUs like A100, H100, H200 or B200
✅ You only pay while training
✅ Good for experiments
✅ Good if you need more VRAM only sometimes

The disadvantage:

⚠️ You must upload your data
⚠️ You must secure your API keys and SSH keys
⚠️ Bad configuration can waste money fast
⚠️ Storage costs can continue after the GPU is stopped
⚠️ Some cheap marketplace machines may be unreliable
⚠️ You need Linux knowledge
⚠️ You must download your finished adapters/models before deleting the machine

Very important:

🛑 Stop the GPU when you are finished.
🛑 Delete unused volumes if you no longer need them.
🛑 Do not leave expensive machines running overnight by mistake.

A small LoRA job can be cheap.
A badly configured cloud training run can become expensive very quickly. 💸

🧱 Dense Models vs. MoE Models

This is one of the most important differences.

1. Dense Models

A dense model uses the whole model for each token.

Example:

Llama 8B
Qwen 7B / 14B / 32B
Mistral 7B
Gemma dense models

If it is a 14B dense model, then basically the 14B model is active during inference.

Advantages:

✅ Easier to understand
✅ Easier to fine-tune
✅ Easier to deploy
✅ More predictable memory behavior
✅ Usually simpler for beginners

Disadvantages:

⚠️ Bigger dense models need more VRAM
⚠️ Training cost rises directly with model size
⚠️ A 32B dense model is much heavier than a 7B dense model

2. MoE Models – Mixture of Experts

MoE means "Mixture of Experts".

The model contains multiple expert networks, but only some experts are active for each token.

Example idea:

A model may have 8 experts, but only 2 are used per token.

So the model may have a large total parameter count, but a smaller active parameter count.

Advantages:

✅ Can be very powerful
✅ Can have high total capacity
✅ Only part of the model is active per token
✅ Often strong for reasoning and broad knowledge

Disadvantages:

⚠️ More complicated to fine-tune
⚠️ More complicated to serve
⚠️ Can need a lot of total VRAM anyway
⚠️ Expert routing can behave unexpectedly
⚠️ Multi-GPU setups can become more difficult
⚠️ Not always beginner-friendly

Important:

A MoE model may say "only 13B active parameters",
but you may still need to load a much larger total model into memory.

So do not only look at "active parameters".
Also check:

total parameters
quantization format
VRAM requirement
context length
framework support
fine-tuning support
inference speed

📦 What Data Do You Need?

Fine-tuning quality depends heavily on your dataset.

Bad data creates a bad model.

You need examples like:

Code Select

User: Please convert this CSV to JSON.
Assistant: Sure. Here is the JSON output...

Or for coding:

Code Select

Instruction: Write a PowerBASIC function that trims and validates a string.
Answer: FUNCTION ...

Good training data should be:

✅ Clean
✅ Consistent
✅ Correct
✅ Deduplicated
✅ Legally usable
✅ In the right format
✅ Similar to the task you want later

Bad training data causes:

❌ hallucinations
❌ broken code
❌ strange formatting
❌ overfitting
❌ repeated phrases
❌ worse general ability
❌ model personality damage

⚠️ Things That Can Fall on Your Leg

Here are the common traps:

1. Too little VRAM 🧯
The training crashes with CUDA out-of-memory errors.

2. Wrong model format 📦
GGUF is usually for inference. Fine-tuning often needs Hugging Face / safetensors models.

3. Wrong tokenizer 🔤
If tokenizer and model do not match, the result can be broken.

4. Bad dataset format 📄
The model learns garbage formatting.

5. Too high learning rate 🔥
The model becomes stupid very quickly.

6. Too many epochs 🔁
The model memorizes your examples instead of generalizing.

7. No evaluation set 🧪
You do not know if the model improved or just became worse.

8. Mixing languages badly 🌍
If you mix English, German, code, comments and instructions without structure, the model may become inconsistent.

9. Expecting new knowledge from fine-tuning 🧠
Fine-tuning is not a database. For facts, use RAG.

10. Fine-tuning the wrong base model 🎯
If the base model is bad at your task, fine-tuning may not rescue it.

11. Ignoring licensing ⚖️
Check model license and data license before commercial usage.

12. Forgetting deployment 🚀
Training is only half the job. You also need to run the result locally in LM Studio, Ollama, text-generation-webui, vLLM, llama.cpp or your own system.

🛠� Typical Fine-Tuning Workflow

A practical workflow looks like this:

[list=1]

Choose the base model
Prepare clean training examples
Split into train and validation data
Start with LoRA or QLoRA
Use a small test run first
Evaluate with real prompts
Adjust learning rate, epochs and dataset
Merge or export the adapter if needed
Quantize for local inference if needed
Test inside your real application

Do not start with a huge training run.

Start small:

✅ 100 examples
✅ short test
✅ check result
✅ then scale up

🧪 What Should You Fine-Tune For?

Good fine-tuning targets:

✅ answer format
✅ coding style
✅ tool-call syntax
✅ classification
✅ support replies
✅ domain-specific workflows
✅ structured output
✅ JSON output
✅ agent behavior
✅ company-specific style

Bad fine-tuning targets:

❌ storing large documentation
❌ replacing a search engine
❌ fixing a fundamentally bad base model
❌ forcing a small model to become GPT-5
❌ training from random scraped garbage
❌ training without tests

🏁 Practical Recommendation

For most local AI users:

Use a good dense 7B/8B/14B model first.
Then try QLoRA.
Then test with your real use-case.
Only then move to larger dense or MoE models.

Dense models are usually easier for beginners.

MoE models can be powerful, but they bring more complexity, especially for fine-tuning and deployment.

If you have a strong local GPU, use it for experiments.
If you need more VRAM, rent a cloud GPU for a few hours.

But always remember:

Fine-tuning is not magic.
Fine-tuning is data quality + correct settings + evaluation. 🧠⚙️

The real secret is not just the GPU.

The real secret is:

Good examples. Good tests. Good workflow. 🚀

2026-04-24 15_35_59-Greenshot.png

Title: Re: 🚀 The Way of Local AI: From Prompting to Fine-Tuning 🧠🔥
Post by: Theo Gottwald on April 25, 2026, 02:38:45 PM

Finetuning the Qwen 27b-q4 for PowerBasic.
Lets see what we get. The model is even without the Fine-Tuning really good.
Lets see if it get more nuances.

Finetuning.png

Interactive PowerBasic Forum

CX32 and CX64 Compiler Project => Project Progress and Learning => Topic started by: Theo Gottwald on April 24, 2026, 03:44:21 PM