transformers: introduction

Contents

transformers: introduction#

# first, install the library Transformers
# you only need to install this library once. 

!pip install transformers

Requirement already satisfied: transformers in /Users/caladof/anaconda3/lib/python3.11/site-packages (4.29.2)

Requirement already satisfied: filelock in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (3.9.0)

Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.15.1)
Requirement already satisfied: numpy>=1.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (1.24.3)
Requirement already satisfied: packaging>=20.0 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (23.0)
Requirement already satisfied: pyyaml>=5.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2022.7.9)
Requirement already satisfied: requests in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.13.2)
Requirement already satisfied: tqdm>=4.27 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.4.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2.0.4)

Requirement already satisfied: idna<4,>=2.5 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2023.7.22)

# import the transformers library, along with the pipeline and set_seed functions

import transformers
from transformers import pipeline, set_seed

text generation#

generates new text based on an input prompt, like a chatbot.

# pulling in the text generation "pipeline", and setting it to the variable
# called "generator"

generator = pipeline('text-generation')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.

# taking the generator function and passing a sentence and maximum length and 
# number of responses to the function

generator('This summer, I was rock climbing in Yosemite when',
          max_length=50,
          num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

[{'generated_text': 'This summer, I was rock climbing in Yosemite when I heard the first of six live moose, one of many of the first live booms with my mom. For weeks as we worked to clean up after the booms, we walked around the'},
 {'generated_text': "This summer, I was rock climbing in Yosemite when a group of volunteers showed up to see me. I wasn't on the route, but I had been working hard and had a good time.\n\nI got a photo of my buddy Jesse,"}]

fill mask#

Fills the word in the blank with a guess

# create the "unmasker" variable set to the "fill-mask" task

unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

# give it a sentence, with the <mask> as a fill in the blank
# the "top_k" argument means we will get 4 responses

unmasker('To be or not to be; that is the <mask>', top_k=4)

[{'score': 0.10899507254362106,
  'token': 2249,
  'token_str': ' difference',
  'sequence': 'To be or not to be; that is the difference'},
 {'score': 0.057924505323171616,
  'token': 2031,
  'token_str': ' choice',
  'sequence': 'To be or not to be; that is the choice'},
 {'score': 0.05728177726268768,
  'token': 3157,
  'token_str': ' truth',
  'sequence': 'To be or not to be; that is the truth'},
 {'score': 0.04440455138683319,
  'token': 1948,
  'token_str': ' answer',
  'sequence': 'To be or not to be; that is the answer'}]

unmasker('My name is Professor Calado and I teach at <mask>', top_k=4)

[{'score': 0.13512930274009705,
  'token': 20124,
  'token_str': ' MIT',
  'sequence': 'My name is Professor Calado and I teach at MIT'},
 {'score': 0.07084149122238159,
  'token': 10441,
  'token_str': ' UCLA',
  'sequence': 'My name is Professor Calado and I teach at UCLA'},
 {'score': 0.06717373430728912,
  'token': 8607,
  'token_str': ' Stanford',
  'sequence': 'My name is Professor Calado and I teach at Stanford'},
 {'score': 0.06465483456850052,
  'token': 23706,
  'token_str': ' BYU',
  'sequence': 'My name is Professor Calado and I teach at BYU'}]

summarization#

Takes a longer text and condenses it.

# taking the "summarization" task and saving it to "summarizer"
# then passing some text into the "summarizer"

# we use three quotes at the beginning and end of the string 
# if we want to put in a text that spans multiple lines

summarizer = pipeline('summarization')
summarizer('''The past 3 years of work in NLP have been characterized 
by the development and deployment of ever larger language models, 
especially for English. BERT, its variants, GPT-2/3, and others, 
most recently Switch-C, have pushed the boundaries of the possible 
both through architectural innovations and through sheer size. Using 
these pretrained models and the methodology of fine-tuning them for 
specific tasks, researchers have extended the state of the art on a 
wide array of tasks as measured by leaderboards on specific benchmarks 
for English. In this paper, we take a step back and ask: How big is too 
big? What are the possible risks associated with this technology and 
what paths are available for mitigating those risks? We provide 
recommendations including weighing the environmental and financial costs 
first, investing resources into curating and carefully documenting 
datasets rather than ingesting everything on the web, carrying out 
pre-development exercises evaluating how the planned approach fits into 
research and development goals and supports stakeholder values, and 
encouraging research directions beyond ever larger language models.''')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[8], line 7
# taking the "summarization" task and saving it to "summarizer"
# then passing some text into the "summarizer"

# we use three quotes at the beginning and end of the string 
# if we want to put in a text that spans multiple lines
----> 7 summarizer = pipeline('summarization')
summarizer('''The past 3 years of work in NLP have been characterized 
by the development and deployment of ever larger language models, 
especially for English. BERT, its variants, GPT-2/3, and others, 
   (...)
research and development goals and supports stakeholder values, and 
encouraging research directions beyond ever larger language models.''')

File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/__init__.py:788, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
# Infer the framework from the model
# Forced if framework already defined, inferred if it's None
# Will load the correct model if possible
model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 788 framework, model = infer_framework_load_model(
   model,
   model_classes=model_classes,
   config=config,
   framework=framework,
   task=task,
   **hub_kwargs,
   **model_kwargs,
)
model_config = model.config
hub_kwargs["_commit_hash"] = model.config._commit_hash

File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/base.py:270, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
   logger.warning(
       "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
       "Trying to load the model with Tensorflow."
   )
try:
--> 270     model = model_class.from_pretrained(model, **kwargs)
   if hasattr(model, "eval"):
       model = model.eval()

File ~/anaconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:467, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
elif type(config) in cls._model_mapping.keys():
   model_class = _get_model_class(config, cls._model_mapping)
--> 467     return model_class.from_pretrained(
       pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
   )
raise ValueError(
   f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
   f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
)

File ~/anaconda3/lib/python3.11/site-packages/transformers/modeling_utils.py:2432, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
try:
   # Load from URL or cache if already cached
   cached_file_kwargs = {
       "cache_dir": cache_dir,
       "force_download": force_download,
   (...)
       "_commit_hash": commit_hash,
   }
-> 2432     resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
   # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
   # result when internet is up, the repo and revision exist, but the file does not.
   if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
       # Maybe the checkpoint is sharded, we try to grab the index name in this case.

File ~/anaconda3/lib/python3.11/site-packages/transformers/utils/hub.py:417, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
user_agent = http_user_agent(user_agent)
try:
   # Load from URL or cache if already cached
--> 417     resolved_file = hf_hub_download(
       path_or_repo_id,
       filename,
       subfolder=None if len(subfolder) == 0 else subfolder,
       repo_type=repo_type,
       revision=revision,
       cache_dir=cache_dir,
       user_agent=user_agent,
       force_download=force_download,
       proxies=proxies,
       resume_download=resume_download,
       use_auth_token=use_auth_token,
       local_files_only=local_files_only,
   )
except RepositoryNotFoundError:
   raise EnvironmentError(
       f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
       "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
       "pass a token having permission to this repo with `use_auth_token` or log in with "
       "`huggingface-cli login` and pass `use_auth_token=True`."
   )

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
if check_use_auth_token:
   kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:1364, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout)
with temp_file_manager() as temp_file:
   logger.info("downloading %s to %s", url, temp_file.name)
-> 1364     http_get(
       url_to_download,
       temp_file,
       proxies=proxies,
       resume_size=resume_size,
       headers=headers,
       expected_size=expected_size,
   )
if local_dir is None:
   logger.info(f"Storing {url} in cache at {blob_path}")

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:544, in http_get(url, temp_file, proxies, resume_size, headers, timeout, max_retries, expected_size)
   if chunk:  # filter out keep-alive new chunks
       progress.update(len(chunk))
--> 544         temp_file.write(chunk)
if expected_size is not None and expected_size != temp_file.tell():
   raise EnvironmentError(
       f"Consistency check failed: file should be of size {expected_size} but has size"
       f" {temp_file.tell()} ({displayed_name}).\nWe are sorry for the inconvenience. Please retry download and"
       " pass `force_download=True, resume_download=False` as argument.\nIf the issue persists, please let us"
       " know by opening an issue on https://github.com/huggingface/huggingface_hub."
   )

File ~/anaconda3/lib/python3.11/tempfile.py:483, in _TemporaryFileWrapper.__getattr__.<locals>.func_wrapper(*args, **kwargs)
@_functools.wraps(func)
def func_wrapper(*args, **kwargs):
--> 483     return func(*args, **kwargs)

KeyboardInterrupt: 

question-answering#

Takes an input question and context and provides an answer

# calling the question-answer pipeline
# passing a question and context into the pipeline
# the function will look into the context to get the answer

question_answer = pipeline('question-answering')
question_answer(question='Was the writer of Frankenstien a man or a woman?', 
                context='''Frankenstien is a book written by Mary Shelley who is 
                a woman''')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

{'score': 0.6520994901657104, 'start': 71, 'end': 78, 'answer': 'a woman'}

ner (named entity recognition)#

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

ner = pipeline("ner", grouped_entities=True)
ner("My name is Filipa Calado and I work at City College in Manhattan.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

/usr/local/lib/python3.9/dist-packages/transformers/pipelines/token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="simple"` instead.
  warnings.warn(

[{'entity_group': 'PER',
  'score': 0.9985998,
  'word': 'Filipa Calado',
  'start': 11,
  'end': 24},
 {'entity_group': 'ORG',
  'score': 0.9940423,
  'word': 'City College',
  'start': 39,
  'end': 51},
 {'entity_group': 'LOC',
  'score': 0.9883624,
  'word': 'Manhattan',
  'start': 55,
  'end': 64}]