티스토리 뷰
3.4.8 Set Parameters
training_args = TrainingArguments(
output_dir="./keywords_gemma_results", # Directory to save the model checkpoints and logs.
max_steps=800, # Total number of training steps to perform. This will override `num_train_epochs`.
per_device_train_batch_size=4, # The batch size used during training on each GPU/TPU.
# If you have multiple GPUs, the effective batch size will be multiplied.
per_device_eval_batch_size=8, # The batch size used during evaluation on each GPU/TPU.
# Larger batch sizes during evaluation can speed up the process.
warmup_steps=0, # The number of warmup steps before the learning rate starts to decrease.
# Set to 0 here, meaning no warmup phase.
weight_decay=0.01, # Weight decay (L2 regularization) to help reduce overfitting.
learning_rate=2e-4, # The initial learning rate. You’ve set a relatively high learning rate.
# Consider adjusting based on your dataset size and model behavior.
logging_dir="./logs", # Directory to save log files for TensorBoard or other tools.
logging_steps=100, # Log metrics every 100 steps to track progress.
report_to="wandb", # Send logs and metrics to Weights & Biases.
)
3.4.9 Define Evaluation Metrics
Wandb
Key Features of W&B:
- Experiment Tracking:
- Log metrics (e.g., loss, accuracy) during training and evaluation.
- Save and visualize model parameters, gradients, and other information.
- Hyperparameter Management:
- Log and analyze hyperparameter values for experiments.
- Compare experiments with different configurations to identify optimal parameters.
- Model Versioning:
- Track different versions of your model during development.
- Save and restore model weights easily.
- Visualization:
- Interactive plots for training and validation metrics.
- Tools like loss curves, confusion matrices, and custom visualizations.
- Collaboration:
- Share experiment dashboards with team members or publicly.
- Tag experiments for organization and better searchability.
- Scalable Logging:
- Handle experiments at scale, suitable for large datasets and complex pipelines.
- Integrations:
- Works with Python scripts and Jupyter Notebooks.
- Compatible with cloud platforms like AWS, Google Cloud, and Azure.
- Sweeps:
- Automates hyperparameter tuning using grid search, random search, or Bayesian optimization.
- Data Management:
- Supports logging raw datasets, intermediate results, and processed outputs for reproducibility.
- Reports:
- Allows creating detailed reports combining text, code, and visualizations for documentation and presentations.
3.4.10 Model Training and Evaluation
eval_loss
What it is:
• eval_loss represents the average loss computed on the evaluation (or validation) dataset.
• Loss functions measure how well a model’s predictions match the true labels. Lower loss indicates better performance.
Why it’s used:
• It provides a single scalar value summarizing model performance. This is particularly useful for tasks like regression, classification, or sequence-to-sequence problems.
Common Loss Functions:
• Mean Squared Error (MSE) for regression.
• Cross-Entropy Loss for classification.
• Negative Log-Likelihood for probabilistic models.
Example:
• In natural language processing (NLP), if the model is predicting tokens in a sequence, eval_loss could be the cross-entropy loss averaged over the evaluation dataset.
eval_bleu
What it is:
• eval_bleu refers to the BLEU (Bilingual Evaluation Understudy) score, a metric used to evaluate text generation models (e.g., machine translation, text summarization).
• BLEU measures how similar the generated text is to one or more reference texts based on overlapping n-grams (unigrams, bigrams, etc.).
• The score ranges from 0 to 1 (or 0 to 100% if expressed as a percentage), where higher is better.
• Why it’s used:
• It quantitatively evaluates how “fluent” or “accurate” the generated text is compared to a human-provided reference.
Limitations:
• It doesn’t account for semantic similarity—text can have different words but mean the same thing.
Example:
• A BLEU score of 0.85 means that 85% of the model’s n-grams match those in the reference.
eval_accuracy
What it is:
• eval_accuracy measures the percentage of correct predictions made by the model on the evaluation dataset.
Why it’s used:
• Accuracy is a straightforward metric, especially for tasks like binary or multi-class classification, where the predictions are categorical labels.
Limitations:
• It doesn’t account for class imbalances. For example, if 99% of data belongs to one class, predicting the majority class always can result in high accuracy but poor model quality.
Example:
• If a model correctly classifies 90 out of 100 samples in the evaluation dataset, eval_accuracy would be 90%.
3.4.11 Test fine tuned model
Work with epochs to get better results
'Books' 카테고리의 다른 글
[실전 LLM 파인튜닝] Day8 Chapter 03 Fine Tuning, 3.4.5 - 3.4.7 (0) | 2025.01.11 |
---|---|
[실전 LLM 파인튜닝] Day7 Fine Tuning, 3.4 (0) | 2025.01.09 |
[실전 LLM 파인튜닝] Day6 Fine Tuning, 3.3 (0) | 2025.01.06 |
[실전 LLM 파인튜닝] Day5 Fine Tuning, 3.2(119p~131p) (4) | 2025.01.04 |
[실전 LLM 파인튜닝] Day4 Fine Tuning (0) | 2025.01.04 |
- Total
- Today
- Yesterday
- Git
- docker
- Kubernetes
- book
- csv
- palindrome
- 나는리뷰어다
- collator
- go
- Binary
- Python
- Fine-Tuning
- 한빛미디어
- kubens
- 책리뷰
- 파이썬
- Algorithm
- Shell
- 키보드
- BASIC
- Gemma
- Container
- lllm
- leetcode
- feed-forward
- LLM
- K8S
- kubernetes context
- error
- AWS
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 |