fine-tuning - INFO-664 notebooks

downloading Transformers & required libraries¶

The below cells download the Transformers and required libraries for doing machine learning tasks.

Based on your system, either colab or jupyter, choose the appropriate cell below. On Jupyter, you only need to run the relevant cell one time, the first time that you ever load the Transformers library.

# ### Google Colab ###
# ### un-comment the code below to run it ###

# %pip install transformers trl

# ### Jupyter-Lab ###
# ### un-comment the code below to run it ###
# ### only run this cell one time, after that it will be already downloaded ###

# !pip install transformers datasets trl

loading our libraries¶

!python --version

Python 3.13.12

# we need these libraries to do the fine-tuning

# imports the dataset
from datasets import load_dataset

# imports the transfomers fine-tuning classes and functions
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
from trl import SFTTrainer, SFTConfig

Note: if you are on Jupyter and you get an Attribute Error that points to an issue with the Pyarrow module, run this code to re-install a working version of Pyarrow. Then, restart your Jupyter notebook and the above cell should work. (See huggingface/datasets#6985)

!pip install pyarrow==15.0.2

loading our training dataset and base model¶

# loads our fine-tuning dataset: transcript text from the Love Is Blind TV show

dataset = load_dataset('gofilipa/love_is_blind_sample')

# print the first 10 lines of the dataset (the show transcripts)

dataset['train']['text'][:10]

['What if I come here and everyone\'s like, "That girl has an unattractive voice."',
 "And that's all I have right now.",
 'I have a dirt bike too, just to top it off.',
 'I wish I looked like J. Lo. - Hello? -',
 '- Uh, I work in real estate.',
 "'Cause I feel like you have to be somewhat attractive to sell a lot.",
 "There's a lot of shallow people in this world.",
 'Is that a normal question to ask?',
 'You have to ask all the questions up front, okay?',
 'Obviously, I have, like, a lot of friends.']

# loading our model and tokenizer. This will be the "base" model we will be fine-tuning
# so that it generates text like the show.

model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-125m')
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neo-125m')

# some tokenizer specs that we need

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

configuring training parameters¶

# our training parameters

training_params = SFTConfig(
    output_dir="./results",
    num_train_epochs = 3, # how many times we iterate over the dataset as a whole
    learning_rate = 2e-4, # how many "steps" we take in adjusting the parameters to make up for loss
    weight_decay = 0.001, # way of regularizing the parameters
    dataset_text_field = "text",
    report_to="none" # this is a new param, to avoid a login to W&B
)

# creating the "trainer" object

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset['train'],
    processing_class = tokenizer,
    args = training_params
)

/opt/anaconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:309: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(

training!¶

Most of the cells below are commented out to avoid them from running each time this website is generated. In order to run them, make sure you un-comment the cells first.

# # Un-comment this cell to run the train() function

# # finally -- training! 
# # (this will take a couple of minutes)

# trainer.train()

TrainOutput(global_step=750, training_loss=2.9969449055989585, metrics={'train_runtime': 126.0682, 'train_samples_per_second': 47.593, 'train_steps_per_second': 5.949, 'total_flos': 83606677954560.0, 'train_loss': 2.9969449055989585, 'epoch': 3.0})

## now, to interact with our new model, we have to save it, and then re-load it
## into our python notebook.

# # Un-comment this cell to run it
# # this code saves the model to a folder called "models" in your working directory
# # you should now be able to see this folder in the sidebar (exciting!)

# trainer.save_model('./models')

# # Un-comment this cell to run it
# # then, load our model from that folder we just created

# model = AutoModelForCausalLM.from_pretrained("./models")
# tokenizer = AutoTokenizer.from_pretrained("./models")

# create a pipe() function that calls our new model
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=50)

# run inference!
pipe("I love")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

[{'generated_text': "I love that you're taking my mom and I on your lives."}]

pipe("My name is Filipa and")

[{'generated_text': "My name is Filipa and I'm here to find my mom."}]

pipe("When I'm with you")

[{'generated_text': 'When I\'m with you is like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What'}]