티스토리 뷰

반응형

3.4.8 Set Parameters

training_args = TrainingArguments(
    output_dir="./keywords_gemma_results",  # Directory to save the model checkpoints and logs.
    max_steps=800,  # Total number of training steps to perform. This will override `num_train_epochs`.
    per_device_train_batch_size=4,  # The batch size used during training on each GPU/TPU.
                                    # If you have multiple GPUs, the effective batch size will be multiplied.
    per_device_eval_batch_size=8,  # The batch size used during evaluation on each GPU/TPU.
                                    # Larger batch sizes during evaluation can speed up the process.
    warmup_steps=0,  # The number of warmup steps before the learning rate starts to decrease.
                     # Set to 0 here, meaning no warmup phase.
    weight_decay=0.01,  # Weight decay (L2 regularization) to help reduce overfitting.
    learning_rate=2e-4,  # The initial learning rate. You’ve set a relatively high learning rate.
                         # Consider adjusting based on your dataset size and model behavior.
    logging_dir="./logs",  # Directory to save log files for TensorBoard or other tools.
    logging_steps=100,  # Log metrics every 100 steps to track progress.
    report_to="wandb",  # Send logs and metrics to Weights & Biases.
)

3.4.9 Define Evaluation Metrics

Wandb

Key Features of W&B:

  1. Experiment Tracking:
  • Log metrics (e.g., loss, accuracy) during training and evaluation.
  • Save and visualize model parameters, gradients, and other information.
  1. Hyperparameter Management:
  • Log and analyze hyperparameter values for experiments.
  • Compare experiments with different configurations to identify optimal parameters.
  1. Model Versioning:
  • Track different versions of your model during development.
  • Save and restore model weights easily.
  1. Visualization:
  • Interactive plots for training and validation metrics.
  • Tools like loss curves, confusion matrices, and custom visualizations.
  1. Collaboration:
  • Share experiment dashboards with team members or publicly.
  • Tag experiments for organization and better searchability.
  1. Scalable Logging:
  • Handle experiments at scale, suitable for large datasets and complex pipelines.
  1. Integrations:
  • Works with Python scripts and Jupyter Notebooks.
  • Compatible with cloud platforms like AWS, Google Cloud, and Azure.
  1. Sweeps:
  • Automates hyperparameter tuning using grid search, random search, or Bayesian optimization.
  1. Data Management:
  • Supports logging raw datasets, intermediate results, and processed outputs for reproducibility.
  1. Reports:
  • Allows creating detailed reports combining text, code, and visualizations for documentation and presentations.

3.4.10 Model Training and Evaluation

eval_loss

What it is:

• eval_loss represents the average loss computed on the evaluation (or validation) dataset.

• Loss functions measure how well a model’s predictions match the true labels. Lower loss indicates better performance.

Why it’s used:

• It provides a single scalar value summarizing model performance. This is particularly useful for tasks like regression, classification, or sequence-to-sequence problems.

Common Loss Functions:

• Mean Squared Error (MSE) for regression.

• Cross-Entropy Loss for classification.

• Negative Log-Likelihood for probabilistic models.

Example:

• In natural language processing (NLP), if the model is predicting tokens in a sequence, eval_loss could be the cross-entropy loss averaged over the evaluation dataset.

eval_bleu

What it is:

• eval_bleu refers to the BLEU (Bilingual Evaluation Understudy) score, a metric used to evaluate text generation models (e.g., machine translation, text summarization).

• BLEU measures how similar the generated text is to one or more reference texts based on overlapping n-grams (unigrams, bigrams, etc.).

• The score ranges from 0 to 1 (or 0 to 100% if expressed as a percentage), where higher is better.

Why it’s used:

• It quantitatively evaluates how “fluent” or “accurate” the generated text is compared to a human-provided reference.

Limitations:

• It doesn’t account for semantic similarity—text can have different words but mean the same thing.

Example:

• A BLEU score of 0.85 means that 85% of the model’s n-grams match those in the reference.

eval_accuracy

What it is:

• eval_accuracy measures the percentage of correct predictions made by the model on the evaluation dataset.

Why it’s used:

• Accuracy is a straightforward metric, especially for tasks like binary or multi-class classification, where the predictions are categorical labels.

Limitations:

• It doesn’t account for class imbalances. For example, if 99% of data belongs to one class, predicting the majority class always can result in high accuracy but poor model quality.

Example:

• If a model correctly classifies 90 out of 100 samples in the evaluation dataset, eval_accuracy would be 90%.

3.4.11 Test fine tuned model

Work with epochs to get better results

반응형
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2025/02   »
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28
글 보관함
반응형