fine-tuning

fine-tuning#

## for google colab:
# %%capture
# %pip install transformers trl

from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
from trl import SFTTrainer, SFTConfig

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datasets import load_dataset
      2 from transformers import (
      3     AutoModelForCausalLM,
      4     AutoTokenizer,
      5     TrainingArguments,
      6     pipeline
      7 )
      8 from trl import SFTTrainer, SFTConfig

ModuleNotFoundError: No module named 'datasets'

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")

dataset = load_dataset("gofilipa/love_is_blind_sample")

Padding is necessary to account for different sizes of text in our dataset.

From the docs on 🤗: Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences. In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well.

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

training_params = SFTConfig(
    output_dir="./results",
    num_train_epochs = 3, # how many times we iterate over the dataset as a whole
    learning_rate = 2e-4, # how many "steps" we take in adjusting the parameters to make up for loss
    weight_decay = 0.001, # way of regularizing the parameters
    dataset_text_field = "text",
    report_to="none" # this is a new param, to avoid a login to W&B
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset['train'],
    processing_class = tokenizer,
    args = training_params
)

### commenting out this line so it doesn't run when I create this website

# trainer.train()

[750/750 02:32, Epoch 3/3]

Step	Training Loss
500	2.853200

TrainOutput(global_step=750, training_loss=2.3747747802734374, metrics={'train_runtime': 156.1686, 'train_samples_per_second': 38.42, 'train_steps_per_second': 4.803, 'total_flos': 83606677954560.0, 'train_loss': 2.3747747802734374, 'epoch': 3.0})

trainer.train()

What’s happening in the training process? Basically the process includes three functions:

hypothesis function that makes a guess as to what word to generate following a given word
loss function that calculates the difference between the guess and the actual word
gradient descent that very slowly updates numbers as to minimize the loss

running inference on our new model#

# commenting out this code so that it doesn't run (when I create this website)!

# ## first, save the model to a folder called "models" 
# trainer.model.save_pretrained("models")
# trainer.tokenizer.save_pretrained("models")

# ## then, load our model from that folder
# model = AutoModelForCausalLM.from_pretrained("./models")
# tokenizer = AutoTokenizer.from_pretrained("./models")

# ## create a pipe() function that calls our new model
# pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=50)

# ### run inference!
# pipe("I love")

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

[{'generated_text': "I love that you made an effort to make me happy. - Thank you. - I appreciate it. - I'm just so glad you made an effort. - I'm so glad you chose me. - I'm so happy for you. -"}]

pipe("I cannot imagine")

[{'generated_text': "I cannot imagine a life without her, and I don't think she's gonna be okay. - I don't know, I think. - I don't know. - I don't know. - I don't know. - I don't"}]

fine-tuning

Contents

fine-tuning#

running inference on our new model#