transformers: introduction#

# first, install the library Transformers
# you only need to install this library once. 

!pip install transformers
Requirement already satisfied: transformers in /Users/caladof/anaconda3/lib/python3.11/site-packages (4.29.2)
Requirement already satisfied: filelock in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (3.9.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.15.1)
Requirement already satisfied: numpy>=1.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (1.24.3)
Requirement already satisfied: packaging>=20.0 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (23.0)
Requirement already satisfied: pyyaml>=5.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2022.7.9)
Requirement already satisfied: requests in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.13.2)
Requirement already satisfied: tqdm>=4.27 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.4.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2023.7.22)
# import the transformers library, along with the pipeline and set_seed functions

import transformers
from transformers import pipeline, set_seed

text generation#

generates new text based on an input prompt, like a chatbot.

# pulling in the text generation "pipeline", and setting it to the variable
# called "generator"

generator = pipeline('text-generation')
No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
# taking the generator function and passing a sentence and maximum length and 
# number of responses to the function

generator('This summer, I was rock climbing in Yosemite when',
          max_length=50,
          num_return_sequences=2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'This summer, I was rock climbing in Yosemite when I heard the first of six live moose, one of many of the first live booms with my mom. For weeks as we worked to clean up after the booms, we walked around the'},
 {'generated_text': "This summer, I was rock climbing in Yosemite when a group of volunteers showed up to see me. I wasn't on the route, but I had been working hard and had a good time.\n\nI got a photo of my buddy Jesse,"}]

fill mask#

Fills the word in the blank with a guess

# create the "unmasker" variable set to the "fill-mask" task

unmasker = pipeline('fill-mask')
No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
# give it a sentence, with the <mask> as a fill in the blank
# the "top_k" argument means we will get 4 responses

unmasker('To be or not to be; that is the <mask>', top_k=4)
[{'score': 0.10899507254362106,
  'token': 2249,
  'token_str': ' difference',
  'sequence': 'To be or not to be; that is the difference'},
 {'score': 0.057924505323171616,
  'token': 2031,
  'token_str': ' choice',
  'sequence': 'To be or not to be; that is the choice'},
 {'score': 0.05728177726268768,
  'token': 3157,
  'token_str': ' truth',
  'sequence': 'To be or not to be; that is the truth'},
 {'score': 0.04440455138683319,
  'token': 1948,
  'token_str': ' answer',
  'sequence': 'To be or not to be; that is the answer'}]
unmasker('My name is Professor Calado and I teach at <mask>', top_k=4)
[{'score': 0.13512930274009705,
  'token': 20124,
  'token_str': ' MIT',
  'sequence': 'My name is Professor Calado and I teach at MIT'},
 {'score': 0.07084149122238159,
  'token': 10441,
  'token_str': ' UCLA',
  'sequence': 'My name is Professor Calado and I teach at UCLA'},
 {'score': 0.06717373430728912,
  'token': 8607,
  'token_str': ' Stanford',
  'sequence': 'My name is Professor Calado and I teach at Stanford'},
 {'score': 0.06465483456850052,
  'token': 23706,
  'token_str': ' BYU',
  'sequence': 'My name is Professor Calado and I teach at BYU'}]

summarization#

Takes a longer text and condenses it.

# taking the "summarization" task and saving it to "summarizer"
# then passing some text into the "summarizer"

# we use three quotes at the beginning and end of the string 
# if we want to put in a text that spans multiple lines

summarizer = pipeline('summarization')
summarizer('''The past 3 years of work in NLP have been characterized 
by the development and deployment of ever larger language models, 
especially for English. BERT, its variants, GPT-2/3, and others, 
most recently Switch-C, have pushed the boundaries of the possible 
both through architectural innovations and through sheer size. Using 
these pretrained models and the methodology of fine-tuning them for 
specific tasks, researchers have extended the state of the art on a 
wide array of tasks as measured by leaderboards on specific benchmarks 
for English. In this paper, we take a step back and ask: How big is too 
big? What are the possible risks associated with this technology and 
what paths are available for mitigating those risks? We provide 
recommendations including weighing the environmental and financial costs 
first, investing resources into curating and carefully documenting 
datasets rather than ingesting everything on the web, carrying out 
pre-development exercises evaluating how the planned approach fits into 
research and development goals and supports stakeholder values, and 
encouraging research directions beyond ever larger language models.''')
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[8], line 7
      1 # taking the "summarization" task and saving it to "summarizer"
      2 # then passing some text into the "summarizer"
      3 
      4 # we use three quotes at the beginning and end of the string 
      5 # if we want to put in a text that spans multiple lines
----> 7 summarizer = pipeline('summarization')
      8 summarizer('''The past 3 years of work in NLP have been characterized 
      9 by the development and deployment of ever larger language models, 
     10 especially for English. BERT, its variants, GPT-2/3, and others, 
   (...)
     23 research and development goals and supports stakeholder values, and 
     24 encouraging research directions beyond ever larger language models.''')

File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/__init__.py:788, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    784 # Infer the framework from the model
    785 # Forced if framework already defined, inferred if it's None
    786 # Will load the correct model if possible
    787 model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 788 framework, model = infer_framework_load_model(
    789     model,
    790     model_classes=model_classes,
    791     config=config,
    792     framework=framework,
    793     task=task,
    794     **hub_kwargs,
    795     **model_kwargs,
    796 )
    798 model_config = model.config
    799 hub_kwargs["_commit_hash"] = model.config._commit_hash

File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/base.py:270, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
    264     logger.warning(
    265         "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
    266         "Trying to load the model with Tensorflow."
    267     )
    269 try:
--> 270     model = model_class.from_pretrained(model, **kwargs)
    271     if hasattr(model, "eval"):
    272         model = model.eval()

File ~/anaconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:467, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    465 elif type(config) in cls._model_mapping.keys():
    466     model_class = _get_model_class(config, cls._model_mapping)
--> 467     return model_class.from_pretrained(
    468         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    469     )
    470 raise ValueError(
    471     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    472     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    473 )

File ~/anaconda3/lib/python3.11/site-packages/transformers/modeling_utils.py:2432, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2417 try:
   2418     # Load from URL or cache if already cached
   2419     cached_file_kwargs = {
   2420         "cache_dir": cache_dir,
   2421         "force_download": force_download,
   (...)
   2430         "_commit_hash": commit_hash,
   2431     }
-> 2432     resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
   2434     # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
   2435     # result when internet is up, the repo and revision exist, but the file does not.
   2436     if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
   2437         # Maybe the checkpoint is sharded, we try to grab the index name in this case.

File ~/anaconda3/lib/python3.11/site-packages/transformers/utils/hub.py:417, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
    414 user_agent = http_user_agent(user_agent)
    415 try:
    416     # Load from URL or cache if already cached
--> 417     resolved_file = hf_hub_download(
    418         path_or_repo_id,
    419         filename,
    420         subfolder=None if len(subfolder) == 0 else subfolder,
    421         repo_type=repo_type,
    422         revision=revision,
    423         cache_dir=cache_dir,
    424         user_agent=user_agent,
    425         force_download=force_download,
    426         proxies=proxies,
    427         resume_download=resume_download,
    428         use_auth_token=use_auth_token,
    429         local_files_only=local_files_only,
    430     )
    432 except RepositoryNotFoundError:
    433     raise EnvironmentError(
    434         f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
    435         "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
    436         "pass a token having permission to this repo with `use_auth_token` or log in with "
    437         "`huggingface-cli login` and pass `use_auth_token=True`."
    438     )

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    115 if check_use_auth_token:
    116     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:1364, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout)
   1361 with temp_file_manager() as temp_file:
   1362     logger.info("downloading %s to %s", url, temp_file.name)
-> 1364     http_get(
   1365         url_to_download,
   1366         temp_file,
   1367         proxies=proxies,
   1368         resume_size=resume_size,
   1369         headers=headers,
   1370         expected_size=expected_size,
   1371     )
   1373 if local_dir is None:
   1374     logger.info(f"Storing {url} in cache at {blob_path}")

File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:544, in http_get(url, temp_file, proxies, resume_size, headers, timeout, max_retries, expected_size)
    542     if chunk:  # filter out keep-alive new chunks
    543         progress.update(len(chunk))
--> 544         temp_file.write(chunk)
    546 if expected_size is not None and expected_size != temp_file.tell():
    547     raise EnvironmentError(
    548         f"Consistency check failed: file should be of size {expected_size} but has size"
    549         f" {temp_file.tell()} ({displayed_name}).\nWe are sorry for the inconvenience. Please retry download and"
    550         " pass `force_download=True, resume_download=False` as argument.\nIf the issue persists, please let us"
    551         " know by opening an issue on https://github.com/huggingface/huggingface_hub."
    552     )

File ~/anaconda3/lib/python3.11/tempfile.py:483, in _TemporaryFileWrapper.__getattr__.<locals>.func_wrapper(*args, **kwargs)
    481 @_functools.wraps(func)
    482 def func_wrapper(*args, **kwargs):
--> 483     return func(*args, **kwargs)

KeyboardInterrupt: 

question-answering#

Takes an input question and context and provides an answer

# calling the question-answer pipeline
# passing a question and context into the pipeline
# the function will look into the context to get the answer

question_answer = pipeline('question-answering')
question_answer(question='Was the writer of Frankenstien a man or a woman?', 
                context='''Frankenstien is a book written by Mary Shelley who is 
                a woman''')
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
{'score': 0.6520994901657104, 'start': 71, 'end': 78, 'answer': 'a woman'}

ner (named entity recognition)#

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

ner = pipeline("ner", grouped_entities=True)
ner("My name is Filipa Calado and I work at City College in Manhattan.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/usr/local/lib/python3.9/dist-packages/transformers/pipelines/token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="simple"` instead.
  warnings.warn(
[{'entity_group': 'PER',
  'score': 0.9985998,
  'word': 'Filipa Calado',
  'start': 11,
  'end': 24},
 {'entity_group': 'ORG',
  'score': 0.9940423,
  'word': 'City College',
  'start': 39,
  'end': 51},
 {'entity_group': 'LOC',
  'score': 0.9883624,
  'word': 'Manhattan',
  'start': 55,
  'end': 64}]