Fine-tuning German LLMs with Model Merging and DPO for Improving Customer Support

Fine-tuning German LLMs with Model Merging and DPO for Improving Customer Support

Avatar von Daniel Hallmann

Large Language Models (LLM) can perform various natural language processing tasks such as text generation, summarization, question-answering, and other complex language understanding tasks. LLMs are trained on large datasets in specific languages, which could be high for, e.g., English texts, which provide the best performance in this context. Besides, performing tasks in other languages like German can result in weaker results because of the LLMs‘ limited knowledge capabilities in this language, which is caused by less training material. Fine-tuning these models can help mitigate such issues, optimize an LLM’s performance, and create a more capable system that better supports customers, e.g., chatbots.

This blog post covers Model Merging and Direct Preference Optimization (DPO) as state-of-the-art approaches that could be combined to best level up LLM’s language performance.

Foundations of Model Merging and DPO

Let’s start with a quick overview of both methods.

Model Merging

Initially, models stand to gain significantly from consolidating various models into one, filling in any existing knowledge gaps [3]. By assimilating the knowledge base of these diverse models, we enhance and expand the capabilities of the unified model, enabling it to learn additional tasks.

Merging LLMs involves integrating multiple pre-existing models into a single, efficient model. This can simplify deployment, reduce resource requirements, and enhance performance. The merging process requires understanding each model’s structures and weights to maintain the learned patterns, reduce redundancy, and optimize speed. A standard method is aligning similar parameters or neurons within the models. This process is complex and often powered by advanced algorithms and tools. The primary challenge is maintaining the models‘ performance despite their reduction in size due to the merge.


DPO [4] provides a simple solution using policy optimization based on user preferences to fine-tune LLMs. Unlike earlier Reinforcement Learning from Human Feedback (RLHF) methods, DPO leverages a specific reward model parameterization, eliminating the need for a training loop. Thus, it moves from focusing on reward functions to centering on policies. This optimizes based on models suiting human preferences without constructing an independent reward model. The resulting policy network includes both the language model and implicit reward.

So, DPO can directly fine-tune the LLM without training an explicit reward model upfront. The data itself, which contains the information as pairs of chosen and rejected answers, is used to leverage the model behavior. As a result, the training procedure uses an implicit model to find the policy that drives the optimization.

Both strategies can now be combined to enhance the performance of LLMs for German. Model merging uses a performant large language model and merges smaller German models, resulting in a performant German-speaking LLM. Secondly, DPO catches this first optimized version and trains it with the DPO dataset to gain a frontier German model.

Showing steps how it works

We went through the merging models and DPO pipeline to show how the process works. We explain the steps with real-world examples to make what is happening in the background more transparent. We also link the models and code repos that we used. In detail, we cover:

  • Merging models
  • Build a DPO training dataset
  • DPO training
  • Run benchmarks

Merging models

Mergekit is a set of tools designed to combine pre-existing language models. By utilizing an out-of-core approach, MergeKit allows for exceedingly complex merges even in conditions where resources are limited. These merges can be run completely on a CPU or hastened using a minimum of 8 GB of VRAM. A wide array of merging algorithms are supported, with plans to include more as they become noteworthy. We used LazyMergekit as a notebook that allows you to merge multiple models using Mergekit easily. The following listing shows our merge configuration.

  - model: mistralai/Mistral-7B-v0.1
    # No parameters necessary for base model
  - model: DiscoResearch/DiscoLM_German_7b_v1
      density: 0.6
      weight: 0.3
  - model: DRXD1000/Phoenix
      density: 0.6
      weight: 0.3
  - model: OpenPipe/mistral-ft-optimized-1227
      density: 0.6
      weight: 0.4
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
  int8_mask: true
dtype: bfloat16

We started with Mistral-7B-v0.1, a 7.3B parameter pre-trained generative text model that is easy to fine-tune on any task. In addition, we chose DiscoLM_German_7b_v1 as a German generative text model trained on a substantial dataset of German instructions. It is optimized for German text and skillful in understanding, generating, and conversing in German. Subsequently, we used Phoenix, a well-performing pre-trained generative text model for the German language. Finally, we include mistral-ft-optimized-1227 as it is intended to be a strong base suitable for downstream fine-tuning on various tasks. We run the configuration with LazyMergkit to get our Merged-German Model as the base model for the DPO.

Build DPO training dataset

For the DPO training, we first built a German dataset that was used to fine-tune our base model. We started with orca_dpo_pairs, a well-curated 12k English dataset containing sample completions to contrast rejected and chosen answers to a specific user prompt. The semantical differences between the texts create the reward function, which is used to optimize the base model. We translated it to German with Azure Machine Learning and uploaded it to intel_orca_dpo_pairs_de.

DPO training

We used the LLaMA-Factory-de library to run the DPO training. LLaMA-Factory is an open-source Python-based project that offers a framework to assemble, optimize, and train language models, such as GPT, Bert, RoBERTa, etc., using PyTorch backends. It offers customizable configurations and allows users to modify structures or parameters for improved performance. The framework supports pre-trained models, which can be immediately used or fine-tuned for other NLP tasks.

Listing 2 shows the DPO training command.

python src/
    --stage dpo # it is about dpo-tuning
    --model_name_or_path mayflowergmbh/Merged-German-Model //the merged model
    --create_new_adapter //a new lora adapter is created
    --dataset orca_dpo_de //the used german dpo dataset
    --template chatml_de //chatml_de is used as chat template
    --finetuning_type lora //use the lora finetuning
    --lora_target q_proj,v_proj //only the modules q_proj, v_proj are used as target to train favorably
    --output_dir mayflowergmbh/DiscoPhoenix-7B-dpo // DPO fine-tuned model
    --per_device_train_batch_size 8 //8 things are always trained in parallel, i.e. batchsize, which allocates 24 gb ram for mistral
    --gradient_accumulation_steps 4
    --lr_scheduler_type cosine
    --logging_steps 10
    --save_steps 1000
    --learning_rate 5e-6
    --num_train_epochs 3.0 //use 3 epochs in the training, i.e. each entry in the dataset is used 3 times. 2 would not be enough, not much happens after 3.
    --weight_decay 0.0
    --warmup_ratio 0.1
    --use_unsloth //unsloth reduces the time and memory requirements by a little more than half
    --quantization_bit 4 //we use qlora quantization, i.e. only 1/4 as much memory is used

This details the DPO-tuning process involving the mayflowergmbh/Merged-German-Model as the base model. It defines a new LoRA adapter and configures our DPO-dataset intel_orca_dpo_pairs_de. We used chatml_de as a chat template containing the human and system prompts. The training included only the q_proj and v_proj modules to lower training costs. We used three rounds of training (epochs) in recycling each dataset entry three times. The process included parallel training of eight units (batchsize), consuming 24 GB RAM. The incorporation of unsloth curtailed time and memory needed by slightly beyond 50%. Lastly, qlora quantization was applied, reducing memory usage by 75%.

We placed our fine-tuned DPO model in our huggingface repo.

Run benchmarks

Finally, we tested the performance and effectiveness of our fine-tuned model. We used MT-Bench [5] as a challenging multi-turn benchmark that measures the ability of LLMs to engage in coherent, informative, and engaging conversations. We used FastEval and the German mt-bench-de version.

Listing 3 covers our results regarding benchmarking.




    "first_turn": 6.39375,
    "second_turn": 5.1625,
    "categories": {
        "writing": 7.45,
        "roleplay": 7.9,
        "reasoning": 4.3,
        "math": 3.25,
        "coding": 2.5,
        "extraction": 5.9,
        "stem": 7.125,
        "humanities": 7.8
    "average": 5.778
    "first_turn": 7.350,
    "second_turn": 5.875,
    "categories": {
    "writing": 7.525,
        "roleplay": 8.025,
        "reasoning": 6.5,
        "math": 4.55,
        "coding": 3.6,
        "extraction": 5.45,
        "stem": 8.55,
        "humanities": 8.7
     "average": 6.613
    "first_turn": 7.335,
    "second_turn": 6.65,
    "categories": {
    "writing": 8.7,
        "roleplay": 7.605,
        "reasoning": 5.75,
        "math": 3.3,
        "coding": 5.3,
        "extraction": 7.55,
        "stem": 8.4,
        "humanities": 9.35
    "average": 6.993

Our DPO fine-tuned model (last column) retrieves better results in six out of eight metrics. We gathered an increased performance of 16% in writing, 6% in reasoning, 74% in coding, 12% in extraction, 7% in stem, and 13% in humanities on average, which led to an average performance optimization of 13%. We also see decreased performance of -4% in roleplay and -15% in math. The reason for the negative results can be the fine-tuning focus on the special purpose to writing. Optimizing models on a specific task can influence the behaviour on a different task, e.g., roleplay, because changing the model weights can mitigate the embedded model’s general knowledge and skills [2]. In addition, we assume the decrease in performance on math tasks can also be related to the general issues from LLMs performing math jobs [1].

Best practices

We experienced a couple of best practices during our fine-tuning.

Model Merging:

  1. Select base models that have been pre-trained on wide-ranging and comprehensive datasets, as this will provide a solid foundation for fine-tuning. Best, determine models that are robust regarding issues in forgetting knowledge to stabilize performing all desired tasks.
  2. The weighting between the old and the new tasks is crucial in model merging. Pay close attention to it to ensure that the merged model retains knowledge from the original base model while also learning the new task.


  1. Use a specific dataset for fine-tuning that closely resembles the target task. In some cases, you may want to generate custom synthetic data.
  2. Choose a lower learning rate during fine-tuning. This allows the model to adjust slowly and avoid drastic changes to its parameters that might cause it to forget.

In general:

  1. Continuously monitor models after model merging and fine-tuning to check for any potential drifts or anomalies that need to be corrected.
  2. Consider balancing the trade-off between capabilities for general and specific fine-tuned skills.

Summary and future directions

LLMs can perform less in languages where fewer texts were available during training. Fine-tuning LLMs can help to optimize the model performance in these languages. In this blog post, we covered the combination of Model Merging and DPO as techniques to overcome these limitations and set up an optimized German model that outperforms the basic models in six out of eight metrics. Fine-tuned models can help mitigate language-specific issues regarding e.g., text generation in providing more capable systems which can positively influence the customer experiences. Fine-tuning LLMs is a field under research. Future developments in optimizing its performance include enhanced multilingual capabilities, performance improvement on unseen data, reduced computational resources use, increased explainability, bias detection, and safe usage. So, stay tuned to catch up on these cutting-edge techniques.

[1] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, W. Yin. (2024) Large Language Models for Mathematical Reasoning: Progresses and Challenges.

[2] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang. (2023) An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.

[3] M. McQuade, C. Goddard. Arcee and mergekit unite. (2024) Last accessed: 25. March, 2024.

[4] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, & C. Finn (2023) Direct preference optimization: Your language model is secretly a reward model. arXiv preprint

[5] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint

Avatar von Daniel Hallmann


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Für das Handling unseres Newsletters nutzen wir den Dienst HubSpot. Mehr Informationen, insbesondere auch zu Deinem Widerrufsrecht, kannst Du jederzeit unserer Datenschutzerklärung entnehmen.