fine-tuning

fine-tuning#

# %%capture
# %pip install transformers trl

from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
from trl import SFTTrainer

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datasets import load_dataset
      2 from transformers import (
      3     AutoModelForCausalLM,
      4     AutoTokenizer,
      5     TrainingArguments,
      6     pipeline
      7 )
      8 from trl import SFTTrainer

ModuleNotFoundError: No module named 'datasets'

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")

dataset = load_dataset("gofilipa/gender_congress_117-118")

Found cached dataset csv (/Users/caladof/.cache/huggingface/datasets/gofilipa___csv/gofilipa--gender_congress_117-118-304e9fdc48b3d0d4/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)

Padding is necessary to account for different sizes of text in our dataset.

From the docs on 🤗: Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences. In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well.

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs = 3, # how many times we iterate over the dataset as a whole
    learning_rate = 2e-4, # how many "steps" we take in adjusting the parameters to make up for loss
    weight_decay = 0.001, # way of regularizing the parameters
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset['train'],
    dataset_text_field = "definitions",
    tokenizer = tokenizer,
    args = training_params
)

/Users/caladof/anaconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:246: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(
Loading cached processed dataset at /Users/caladof/.cache/huggingface/datasets/gofilipa___csv/gofilipa--gender_congress_117-118-304e9fdc48b3d0d4/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-3618ad3aa0d944c2.arrow

trainer.train()

What’s happening in the training process? Basically the process includes three functions:

hypothesis function that makes a guess as to what word to generate following a given word
loss function that calculates the difference between the guess and the actual word
gradient descent that very slowly updates numbers as to minimize the loss

trainer.model.save_pretrained("models")
trainer.tokenizer.save_pretrained("models")

model = AutoModelForCausalLM.from_pretrained("./models")
tokenizer = AutoTokenizer.from_pretrained("./models")

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=50)

pipe("To affirm a person's gender means")

The results aren’t great. To get better results, we’d need to adjust the hyperparameters (learning rate and epochs, primarily) at the top.