fine-tuning

fine-tuning#

## for google colab:
# %%capture
# %pip install transformers trl
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
from trl import SFTTrainer, SFTConfig
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datasets import load_dataset
      2 from transformers import (
      3     AutoModelForCausalLM,
      4     AutoTokenizer,
      5     TrainingArguments,
      6     pipeline
      7 )
      8 from trl import SFTTrainer, SFTConfig

ModuleNotFoundError: No module named 'datasets'
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
dataset = load_dataset("gofilipa/bedtime_stories")

Padding is necessary to account for different sizes of text in our dataset.

From the docs on 🤗: Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences. In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well.

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
training_params = SFTConfig(
    output_dir="./results",
    num_train_epochs = 3, # how many times we iterate over the dataset as a whole
    learning_rate = 2e-4, # how many "steps" we take in adjusting the parameters to make up for loss
    weight_decay = 0.001, # way of regularizing the parameters
    dataset_text_field = "stories",
    report_to="none" # this is a new param, to avoid a login to W&B
)
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset['train'],
    processing_class = tokenizer,
    args = training_params
)
### commenting out this line so it doesn't run when I create this website

# trainer.train()
[75/75 00:48, Epoch 3/3]
Step Training Loss

TrainOutput(global_step=75, training_loss=1.781717529296875, metrics={'train_runtime': 54.1603, 'train_samples_per_second': 11.023, 'train_steps_per_second': 1.385, 'total_flos': 40251401496576.0, 'train_loss': 1.781717529296875})
trainer.train()

What’s happening in the training process? Basically the process includes three functions:

  • hypothesis function that makes a guess as to what word to generate following a given word

  • loss function that calculates the difference between the guess and the actual word

  • gradient descent that very slowly updates numbers as to minimize the loss

running inference on our new model#

# commenting out this code so that it doesn't run (when I create this website)!

### first, save the model to a folder called "models" 
# trainer.model.save_pretrained("models")
# trainer.tokenizer.save_pretrained("models")

### then, load our model from that folder
# model = AutoModelForCausalLM.from_pretrained("./models")
# tokenizer = AutoTokenizer.from_pretrained("./models")

### create a pipe() function that calls our new model
# pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=50)

### run inference!
pipe("There once was a little girl named Filipa and")
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
[{'generated_text': 'There once was a little girl named Filipa and her momma. Every night, she would go outside and look up at the stars in the sky. One night, she saw a magical rainbow that would bring her all the joy she wanted. She'}]