Now that you have a sense of how things work on the HF website, we are going to practice running inference on Google Colab.
Our goal is to create a text generator, using Python code, taking the following steps:
Will use the model, “gpt-neo-125m”, importing this model into the colab coding space.
Then we will write code that processes an input text to generate an output, a continuation.
Finally, we will import a dataset from the library and practice running inference with it.
downloading Transformers & required libraries (only run once)¶
The below cells download the Transformers and required libraries for doing machine learning tasks. You only need to run the relevant cell one time, the first time that you ever load the Transformers library.
Based on your system, either colab or jupyter, choose the appropriate cell below.
# ### Google Colab ###
# ### uncomment the code below to run it ###
# %pip install transformers trl# ### Jupyter-Lab ###
# ### uncomment the code below to run it ###
# !pip install transformers datasets trl
# Read more about installations here: https://huggingface.co/docs/transformers/installation After installing, go back to the models page. Search for gpt-neo, select 125m. On the top right, click on “Use in Transformers.” Copy that code, and paste it to your notebook.
from transformers import pipeline
# if you have a GPU (Mac M1 chip)
pipe = pipeline("text-generation", model="EleutherAI/gpt-neo-125m", device = 1)
# if you do not have a GPU
# pipe = pipeline("text-generation", model="EleutherAI/gpt-neo-125m")Here we have a function, called pipeline(), which takes parameters (a
fancy word for input).
The parameters specify the task and the model that we will be using.
We save the function to a variable called pipe, which we will later
use to process our prompt.
inference¶
Now we are going to “run inference.”
First, we will type up a prompt, and save it to a variable prompt. Then we will pass that prompt to the pipe variable that we created before, saving the output to a new variable, called output.
prompt = "Hello, my name is Filipa and"
pipe(prompt, max_length = 50)Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
[{'generated_text': "Hello, my name is Filipa and I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in"}]# saving the output
output = pipe(prompt, max_length = 50)Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Now let’s look at the response, and inspect the data structure contained within it, which is a list.
list is a collection of objects, or bits of information. So our output is saved as this collection type of object.
output[{'generated_text': "Hello, my name is Filipa and I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in"}]type(output)listWhat if we wanted to extract just the output text, not the rest of the data, how would we go about it? We use list indexing. When we check the type, we find out the first item of the list is a dict.
output[0]{'generated_text': "Hello, my name is Filipa and I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in"}type(output[0])dictTo get items from a dict, access them by their keys.
output[0]['generated_text']"Hello, my name is Filipa and I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in the world of web development. I'm a newbie in"accessing data from datasets:¶
Now we will practice what we’ve learned about accessing data on the Datasets library from HF.
from datasets import load_dataset# load the dataset and its subset
dataset = load_dataset("gofilipa/love_is_blind_sample")
# check the dataset object
datasethuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 2000
})
})type(dataset)datasets.dataset_dict.DatasetDict# how do we get items from a dict? by the key
dataset['train']Dataset({
features: ['text'],
num_rows: 2000
})# how would we get the second row from this dataset?
dataset['train']['text'][0]'What if I come here and everyone\'s like, "That girl has an unattractive voice."'# another row
dataset['train']['text'][10]"I think I'm really good at these pods because, like, with my job, I have, like, five to, like, 20 people in my car, like, at a time."Now, we are going to feed lines from our dataset into the pipe() function we created above.
outputs = []
for i in dataset['train']['text'][:5]:
out = pipe(i, max_new_tokens=100)
outputs.append(out)Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
outputs[[{'generated_text': 'What if I come here and everyone\'s like, "That girl has an unattractive voice."\n\nI\'m not saying that I\'m not a girl, but I\'m saying that I\'m not a girl. I\'m saying that I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl.'}],
[{'generated_text': 'And that\'s all I have right now. I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," she said. "I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," he said. "I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," she said. "I\'m going to be back in the office."\n\n"I\'m not'}],
[{'generated_text': 'I have a dirt bike too, just to top it off. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike.'}],
[{'generated_text': "I wish I looked like J. Lo. - Hello? - I'm a little confused. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do"}],
[{'generated_text': "- Uh, I work in real estate. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm"}]]Let’s get just the output text from the outputs.
outs = []
for i in outputs:
out = i[0]['generated_text']
outs.append(out)outs['What if I come here and everyone\'s like, "That girl has an unattractive voice."\n\nI\'m not saying that I\'m not a girl, but I\'m saying that I\'m not a girl. I\'m saying that I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl. I\'m not a girl.',
'And that\'s all I have right now. I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," she said. "I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," he said. "I\'m going to be back in the office."\n\n"I\'m not going to be back in the office," she said. "I\'m going to be back in the office."\n\n"I\'m not',
'I have a dirt bike too, just to top it off. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike. I have a couple of other bikes that I have been riding for a while, but I have never ridden a dirt bike.',
"I wish I looked like J. Lo. - Hello? - I'm a little confused. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do this. I'm trying to find the right way to do",
"- Uh, I work in real estate. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm a real estate agent. I'm"]