[실전 LLM 파인튜닝] Day9 Chapter 03 Fine Tuning, 3.4.8

티스토리 뷰

Books

[실전 LLM 파인튜닝] Day9 Chapter 03 Fine Tuning, 3.4.8 - 3.4.11

juniz 2025. 1. 14. 00:17

3.4.8 Set Parameters

training_args = TrainingArguments(
    output_dir="./keywords_gemma_results",  # Directory to save the model checkpoints and logs.
    max_steps=800,  # Total number of training steps to perform. This will override `num_train_epochs`.
    per_device_train_batch_size=4,  # The batch size used during training on each GPU/TPU.
                                    # If you have multiple GPUs, the effective batch size will be multiplied.
    per_device_eval_batch_size=8,  # The batch size used during evaluation on each GPU/TPU.
                                    # Larger batch sizes during evaluation can speed up the process.
    warmup_steps=0,  # The number of warmup steps before the learning rate starts to decrease.
                     # Set to 0 here, meaning no warmup phase.
    weight_decay=0.01,  # Weight decay (L2 regularization) to help reduce overfitting.
    learning_rate=2e-4,  # The initial learning rate. You’ve set a relatively high learning rate.
                         # Consider adjusting based on your dataset size and model behavior.
    logging_dir="./logs",  # Directory to save log files for TensorBoard or other tools.
    logging_steps=100,  # Log metrics every 100 steps to track progress.
    report_to="wandb",  # Send logs and metrics to Weights & Biases.
)

3.4.9 Define Evaluation Metrics

Wandb

Key Features of W&B:

Experiment Tracking:

Log metrics (e.g., loss, accuracy) during training and evaluation.
Save and visualize model parameters, gradients, and other information.

Hyperparameter Management:

Log and analyze hyperparameter values for experiments.
Compare experiments with different configurations to identify optimal parameters.

Model Versioning:

Track different versions of your model during development.
Save and restore model weights easily.

Visualization:

Interactive plots for training and validation metrics.
Tools like loss curves, confusion matrices, and custom visualizations.

Collaboration:

Share experiment dashboards with team members or publicly.
Tag experiments for organization and better searchability.

Scalable Logging:

Handle experiments at scale, suitable for large datasets and complex pipelines.

Integrations:

Works with Python scripts and Jupyter Notebooks.
Compatible with cloud platforms like AWS, Google Cloud, and Azure.

Sweeps:

Automates hyperparameter tuning using grid search, random search, or Bayesian optimization.

Data Management:

Supports logging raw datasets, intermediate results, and processed outputs for reproducibility.

Reports:

Allows creating detailed reports combining text, code, and visualizations for documentation and presentations.

3.4.10 Model Training and Evaluation

eval_loss

What it is:

• eval_loss represents the average loss computed on the evaluation (or validation) dataset.

• Loss functions measure how well a model’s predictions match the true labels. Lower loss indicates better performance.

Why it’s used:

• It provides a single scalar value summarizing model performance. This is particularly useful for tasks like regression, classification, or sequence-to-sequence problems.

Common Loss Functions:

• Mean Squared Error (MSE) for regression.

• Cross-Entropy Loss for classification.

• Negative Log-Likelihood for probabilistic models.

Example:

• In natural language processing (NLP), if the model is predicting tokens in a sequence, eval_loss could be the cross-entropy loss averaged over the evaluation dataset.

eval_bleu

What it is:

• eval_bleu refers to the BLEU (Bilingual Evaluation Understudy) score, a metric used to evaluate text generation models (e.g., machine translation, text summarization).

• BLEU measures how similar the generated text is to one or more reference texts based on overlapping n-grams (unigrams, bigrams, etc.).

• The score ranges from 0 to 1 (or 0 to 100% if expressed as a percentage), where higher is better.

• Why it’s used:

• It quantitatively evaluates how “fluent” or “accurate” the generated text is compared to a human-provided reference.

Limitations:

• It doesn’t account for semantic similarity—text can have different words but mean the same thing.

Example:

• A BLEU score of 0.85 means that 85% of the model’s n-grams match those in the reference.

eval_accuracy

What it is:

• eval_accuracy measures the percentage of correct predictions made by the model on the evaluation dataset.

Why it’s used:

• Accuracy is a straightforward metric, especially for tasks like binary or multi-class classification, where the predictions are categorical labels.

Limitations:

• It doesn’t account for class imbalances. For example, if 99% of data belongs to one class, predicting the majority class always can result in high accuracy but poor model quality.

Example:

• If a model correctly classifies 90 out of 100 samples in the evaluation dataset, eval_accuracy would be 90%.

3.4.11 Test fine tuned model

Work with epochs to get better results

저작자표시 동일조건

'Books' 카테고리의 다른 글

[실전 LLM 파인튜닝] Day8 Chapter 03 Fine Tuning, 3.4.5 - 3.4.7 (0)	2025.01.11
[실전 LLM 파인튜닝] Day7 Fine Tuning, 3.4 (0)	2025.01.09
[실전 LLM 파인튜닝] Day6 Fine Tuning, 3.3 (0)	2025.01.06
[실전 LLM 파인튜닝] Day5 Fine Tuning, 3.2(119p~131p) (4)	2025.01.04
[실전 LLM 파인튜닝] Day4 Fine Tuning (0)	2025.01.04

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

니즈 개발 일기

티스토리 뷰