transformers: introduction#
# first, install the library Transformers
# you only need to install this library once.
!pip install transformers
Requirement already satisfied: transformers in /Users/caladof/anaconda3/lib/python3.11/site-packages (4.29.2)
Requirement already satisfied: filelock in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (3.9.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.15.1)
Requirement already satisfied: numpy>=1.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (1.24.3)
Requirement already satisfied: packaging>=20.0 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (23.0)
Requirement already satisfied: pyyaml>=5.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2022.7.9)
Requirement already satisfied: requests in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (0.13.2)
Requirement already satisfied: tqdm>=4.27 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.4.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /Users/caladof/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2023.7.22)
# import the transformers library, along with the pipeline and set_seed functions
import transformers
from transformers import pipeline, set_seed
text generation#
generates new text based on an input prompt, like a chatbot.
# pulling in the text generation "pipeline", and setting it to the variable
# called "generator"
generator = pipeline('text-generation')
No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
# taking the generator function and passing a sentence and maximum length and
# number of responses to the function
generator('This summer, I was rock climbing in Yosemite when',
max_length=50,
num_return_sequences=2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'This summer, I was rock climbing in Yosemite when I heard the first of six live moose, one of many of the first live booms with my mom. For weeks as we worked to clean up after the booms, we walked around the'},
{'generated_text': "This summer, I was rock climbing in Yosemite when a group of volunteers showed up to see me. I wasn't on the route, but I had been working hard and had a good time.\n\nI got a photo of my buddy Jesse,"}]
fill mask#
Fills the word in the blank with a guess
# create the "unmasker" variable set to the "fill-mask" task
unmasker = pipeline('fill-mask')
No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
# give it a sentence, with the <mask> as a fill in the blank
# the "top_k" argument means we will get 4 responses
unmasker('To be or not to be; that is the <mask>', top_k=4)
[{'score': 0.10899507254362106,
'token': 2249,
'token_str': ' difference',
'sequence': 'To be or not to be; that is the difference'},
{'score': 0.057924505323171616,
'token': 2031,
'token_str': ' choice',
'sequence': 'To be or not to be; that is the choice'},
{'score': 0.05728177726268768,
'token': 3157,
'token_str': ' truth',
'sequence': 'To be or not to be; that is the truth'},
{'score': 0.04440455138683319,
'token': 1948,
'token_str': ' answer',
'sequence': 'To be or not to be; that is the answer'}]
unmasker('My name is Professor Calado and I teach at <mask>', top_k=4)
[{'score': 0.13512930274009705,
'token': 20124,
'token_str': ' MIT',
'sequence': 'My name is Professor Calado and I teach at MIT'},
{'score': 0.07084149122238159,
'token': 10441,
'token_str': ' UCLA',
'sequence': 'My name is Professor Calado and I teach at UCLA'},
{'score': 0.06717373430728912,
'token': 8607,
'token_str': ' Stanford',
'sequence': 'My name is Professor Calado and I teach at Stanford'},
{'score': 0.06465483456850052,
'token': 23706,
'token_str': ' BYU',
'sequence': 'My name is Professor Calado and I teach at BYU'}]
summarization#
Takes a longer text and condenses it.
# taking the "summarization" task and saving it to "summarizer"
# then passing some text into the "summarizer"
# we use three quotes at the beginning and end of the string
# if we want to put in a text that spans multiple lines
summarizer = pipeline('summarization')
summarizer('''The past 3 years of work in NLP have been characterized
by the development and deployment of ever larger language models,
especially for English. BERT, its variants, GPT-2/3, and others,
most recently Switch-C, have pushed the boundaries of the possible
both through architectural innovations and through sheer size. Using
these pretrained models and the methodology of fine-tuning them for
specific tasks, researchers have extended the state of the art on a
wide array of tasks as measured by leaderboards on specific benchmarks
for English. In this paper, we take a step back and ask: How big is too
big? What are the possible risks associated with this technology and
what paths are available for mitigating those risks? We provide
recommendations including weighing the environmental and financial costs
first, investing resources into curating and carefully documenting
datasets rather than ingesting everything on the web, carrying out
pre-development exercises evaluating how the planned approach fits into
research and development goals and supports stakeholder values, and
encouraging research directions beyond ever larger language models.''')
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 7
1 # taking the "summarization" task and saving it to "summarizer"
2 # then passing some text into the "summarizer"
3
4 # we use three quotes at the beginning and end of the string
5 # if we want to put in a text that spans multiple lines
----> 7 summarizer = pipeline('summarization')
8 summarizer('''The past 3 years of work in NLP have been characterized
9 by the development and deployment of ever larger language models,
10 especially for English. BERT, its variants, GPT-2/3, and others,
(...)
23 research and development goals and supports stakeholder values, and
24 encouraging research directions beyond ever larger language models.''')
File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/__init__.py:788, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
784 # Infer the framework from the model
785 # Forced if framework already defined, inferred if it's None
786 # Will load the correct model if possible
787 model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 788 framework, model = infer_framework_load_model(
789 model,
790 model_classes=model_classes,
791 config=config,
792 framework=framework,
793 task=task,
794 **hub_kwargs,
795 **model_kwargs,
796 )
798 model_config = model.config
799 hub_kwargs["_commit_hash"] = model.config._commit_hash
File ~/anaconda3/lib/python3.11/site-packages/transformers/pipelines/base.py:270, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
264 logger.warning(
265 "Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. "
266 "Trying to load the model with Tensorflow."
267 )
269 try:
--> 270 model = model_class.from_pretrained(model, **kwargs)
271 if hasattr(model, "eval"):
272 model = model.eval()
File ~/anaconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:467, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
465 elif type(config) in cls._model_mapping.keys():
466 model_class = _get_model_class(config, cls._model_mapping)
--> 467 return model_class.from_pretrained(
468 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
469 )
470 raise ValueError(
471 f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
472 f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
473 )
File ~/anaconda3/lib/python3.11/site-packages/transformers/modeling_utils.py:2432, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2417 try:
2418 # Load from URL or cache if already cached
2419 cached_file_kwargs = {
2420 "cache_dir": cache_dir,
2421 "force_download": force_download,
(...)
2430 "_commit_hash": commit_hash,
2431 }
-> 2432 resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
2434 # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
2435 # result when internet is up, the repo and revision exist, but the file does not.
2436 if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
2437 # Maybe the checkpoint is sharded, we try to grab the index name in this case.
File ~/anaconda3/lib/python3.11/site-packages/transformers/utils/hub.py:417, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
414 user_agent = http_user_agent(user_agent)
415 try:
416 # Load from URL or cache if already cached
--> 417 resolved_file = hf_hub_download(
418 path_or_repo_id,
419 filename,
420 subfolder=None if len(subfolder) == 0 else subfolder,
421 repo_type=repo_type,
422 revision=revision,
423 cache_dir=cache_dir,
424 user_agent=user_agent,
425 force_download=force_download,
426 proxies=proxies,
427 resume_download=resume_download,
428 use_auth_token=use_auth_token,
429 local_files_only=local_files_only,
430 )
432 except RepositoryNotFoundError:
433 raise EnvironmentError(
434 f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
435 "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
436 "pass a token having permission to this repo with `use_auth_token` or log in with "
437 "`huggingface-cli login` and pass `use_auth_token=True`."
438 )
File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
115 if check_use_auth_token:
116 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)
File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:1364, in hf_hub_download(repo_id, filename, subfolder, repo_type, revision, library_name, library_version, cache_dir, local_dir, local_dir_use_symlinks, user_agent, force_download, force_filename, proxies, etag_timeout, resume_download, token, local_files_only, legacy_cache_layout)
1361 with temp_file_manager() as temp_file:
1362 logger.info("downloading %s to %s", url, temp_file.name)
-> 1364 http_get(
1365 url_to_download,
1366 temp_file,
1367 proxies=proxies,
1368 resume_size=resume_size,
1369 headers=headers,
1370 expected_size=expected_size,
1371 )
1373 if local_dir is None:
1374 logger.info(f"Storing {url} in cache at {blob_path}")
File ~/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:544, in http_get(url, temp_file, proxies, resume_size, headers, timeout, max_retries, expected_size)
542 if chunk: # filter out keep-alive new chunks
543 progress.update(len(chunk))
--> 544 temp_file.write(chunk)
546 if expected_size is not None and expected_size != temp_file.tell():
547 raise EnvironmentError(
548 f"Consistency check failed: file should be of size {expected_size} but has size"
549 f" {temp_file.tell()} ({displayed_name}).\nWe are sorry for the inconvenience. Please retry download and"
550 " pass `force_download=True, resume_download=False` as argument.\nIf the issue persists, please let us"
551 " know by opening an issue on https://github.com/huggingface/huggingface_hub."
552 )
File ~/anaconda3/lib/python3.11/tempfile.py:483, in _TemporaryFileWrapper.__getattr__.<locals>.func_wrapper(*args, **kwargs)
481 @_functools.wraps(func)
482 def func_wrapper(*args, **kwargs):
--> 483 return func(*args, **kwargs)
KeyboardInterrupt:
question-answering#
Takes an input question and context and provides an answer
# calling the question-answer pipeline
# passing a question and context into the pipeline
# the function will look into the context to get the answer
question_answer = pipeline('question-answering')
question_answer(question='Was the writer of Frankenstien a man or a woman?',
context='''Frankenstien is a book written by Mary Shelley who is
a woman''')
No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
{'score': 0.6520994901657104, 'start': 71, 'end': 78, 'answer': 'a woman'}
ner (named entity recognition)#
Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.
ner = pipeline("ner", grouped_entities=True)
ner("My name is Filipa Calado and I work at City College in Manhattan.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/usr/local/lib/python3.9/dist-packages/transformers/pipelines/token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="simple"` instead.
warnings.warn(
[{'entity_group': 'PER',
'score': 0.9985998,
'word': 'Filipa Calado',
'start': 11,
'end': 24},
{'entity_group': 'ORG',
'score': 0.9940423,
'word': 'City College',
'start': 39,
'end': 51},
{'entity_group': 'LOC',
'score': 0.9883624,
'word': 'Manhattan',
'start': 55,
'end': 64}]