downloading Transformers & required libraries¶
The below cells download the Transformers and required libraries for doing machine learning tasks.
Based on your system, either colab or jupyter, choose the appropriate cell below. On Jupyter, you only need to run the relevant cell one time, the first time that you ever load the Transformers library.
# ### Google Colab ###
# ### un-comment the code below to run it ###
# %pip install transformers trl# ### Jupyter-Lab ###
# ### un-comment the code below to run it ###
# ### only run this cell one time, after that it will be already downloaded ###
# !pip install transformers datasets trlloading our libraries¶
!python --versionPython 3.13.12
# we need these libraries to do the fine-tuning
# imports the dataset
from datasets import load_dataset
# imports the transfomers fine-tuning classes and functions
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
pipeline
)
from trl import SFTTrainer, SFTConfigNote: if you are on Jupyter and you get an Attribute Error that points to an issue with the Pyarrow module, run this code to re-install a working version of Pyarrow. Then, restart your Jupyter notebook and the above cell should work. (See huggingface
!pip install pyarrow==15.0.2
loading our training dataset and base model¶
# loads our fine-tuning dataset: transcript text from the Love Is Blind TV show
dataset = load_dataset('gofilipa/love_is_blind_sample')# print the first 10 lines of the dataset (the show transcripts)
dataset['train']['text'][:10]['What if I come here and everyone\'s like, "That girl has an unattractive voice."',
"And that's all I have right now.",
'I have a dirt bike too, just to top it off.',
'I wish I looked like J. Lo. - Hello? -',
'- Uh, I work in real estate.',
"'Cause I feel like you have to be somewhat attractive to sell a lot.",
"There's a lot of shallow people in this world.",
'Is that a normal question to ask?',
'You have to ask all the questions up front, okay?',
'Obviously, I have, like, a lot of friends.']# loading our model and tokenizer. This will be the "base" model we will be fine-tuning
# so that it generates text like the show.
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-125m')
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neo-125m')# some tokenizer specs that we need
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"configuring training parameters¶
# our training parameters
training_params = SFTConfig(
output_dir="./results",
num_train_epochs = 3, # how many times we iterate over the dataset as a whole
learning_rate = 2e-4, # how many "steps" we take in adjusting the parameters to make up for loss
weight_decay = 0.001, # way of regularizing the parameters
dataset_text_field = "text",
report_to="none" # this is a new param, to avoid a login to W&B
)# creating the "trainer" object
trainer = SFTTrainer(
model = model,
train_dataset = dataset['train'],
processing_class = tokenizer,
args = training_params
)/opt/anaconda3/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:309: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
warnings.warn(
training!¶
Most of the cells below are commented out to avoid them from running each time this website is generated. In order to run them, make sure you un-comment the cells first.
# # Un-comment this cell to run the train() function
# # finally -- training!
# # (this will take a couple of minutes)
# trainer.train()TrainOutput(global_step=750, training_loss=2.9969449055989585, metrics={'train_runtime': 126.0682, 'train_samples_per_second': 47.593, 'train_steps_per_second': 5.949, 'total_flos': 83606677954560.0, 'train_loss': 2.9969449055989585, 'epoch': 3.0})## now, to interact with our new model, we have to save it, and then re-load it
## into our python notebook.# # Un-comment this cell to run it
# # this code saves the model to a folder called "models" in your working directory
# # you should now be able to see this folder in the sidebar (exciting!)
# trainer.save_model('./models')# # Un-comment this cell to run it
# # then, load our model from that folder we just created
# model = AutoModelForCausalLM.from_pretrained("./models")
# tokenizer = AutoTokenizer.from_pretrained("./models")# create a pipe() function that calls our new model
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=50)
# run inference!
pipe("I love")Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
[{'generated_text': "I love that you're taking my mom and I on your lives."}]pipe("My name is Filipa and")[{'generated_text': "My name is Filipa and I'm here to find my mom."}]pipe("When I'm with you")[{'generated_text': 'When I\'m with you is like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What the fuck?" - I\'m like, "What'}]