the spaCy pipeline#

Note: Because we are using Python on Google Colab’s cloud environment, no installation of Python nor spaCy is necessary. But if you’re not on Google colab, and using a Python on your local computer (through a distribution like Anaconda), you can follow the instructions to download and install spaCy here: https://spacy.io/usage

The spaCy library offers powerful text processing capabilities. It processes text by adding tags, what the library’s creators call “annotations,” to text. These annotations are filled with linguistic information about each word or bit of punctuation in the text. For example, they can describe parts of speech, grammatical dependency, punctuation, sentence and clause spans, and a lot more.

How does spaCy know what information to annotate to each piece of text? The program gets this information from a language model, such as en_core_web_sm, that you load before you can process the text. This language model is a statistical model that enable spaCy to make predictions about a word’s part of speech, for example. It has been trained from popular lexical datasets, like Princeton’s WordNet, and also contains information for words, such as their root form, in data files and lookup tables. From this information, the model can make predictions about the linguistic features of new text.

For example, if it comes across the word “trans,” which can be an adjective, a prefix, or a shortened form of a longer word like “transgender,” it will make a guess on how to categorize this particular usage of “trans” based on other aspects of the sentence, such as the parts of speech of the other words surrounding it.

The code below demonstrates how to import the library, load up the lanugage model, and save it to nlp().

# importing our spacy library
import spacy

# loading up the model in english
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # importing our spacy library
----> 2 import spacy
      4 # loading up the model in english
      5 nlp = spacy.load("en_core_web_sm")

ModuleNotFoundError: No module named 'spacy'

After saving the model to nlp, it can be used to process our text. In spaCy speak, this is called passing a text through the “pipeline.”

# passing a dataset (a sentence, in this case) into the nlp() function
doc = nlp("My name is Filipa, and I teach workshops about Python programming at Princeton University.")

the pipeline#

So what happens to a text when we process it with the nlp() function? It goes through a series of “pipes” that separate out the individual words and adds linguistic annotations to each of them. These steps include “tokenization,” “tagging,” “parsing,” and “named entity recognition.”

We will go through the steps one by one.

image of the pipeline

tokenization#

First in the pipeline is tokenization. This is the separation of our dataset, which is a long list of words within sentences within documents, into individual words or tokens. Tokens make data more amenable to counting and other kinds of analysis. Punctuation counts as a token, by the way, an important one!

for token in doc:
    print(token)
    print(token.lower_)
    print(token.lemma_)
My
my
my
name
name
name
is
is
be
Filipa
filipa
Filipa
,
,
,
and
and
and
I
i
I
teach
teach
teach
workshops
workshops
workshop
about
about
about
Python
python
Python
programming
programming
programming
at
at
at
Princeton
princeton
Princeton
University
university
University
.
.
.

Parts of Speech & Dependency#

After the tokenizer come the tagger and the parser. For each token in the text, the tagger applies “parts of speech” tags (like noun, verb, adverb, preposition, and so on), and the parser derives its grammatical dependency (subject, predicate, object, for example).

Some of these annotations might seem excessive, but later on they will become really useful for writing code that selects only specific parts of the text.

for token in doc:
    # prints the token, part of speech, and grammatical dependency
    print(token, token.pos_, token.dep_)
    
My PRON poss
name NOUN nsubj
is AUX ROOT
Filipa PROPN attr
, PUNCT punct
and CCONJ cc
I PRON nsubj
teach VERB conj
workshops NOUN dobj
about ADP prep
Python PROPN compound
programming NOUN pobj
at ADP prep
Princeton PROPN compound
University PROPN pobj
. PUNCT punct
# doing something similar, but using "f-strings" to format the 
# results within strings

for token in doc:
    print(f'token: {token}')
    print(f'part of speech: {token.pos_}')
token: My
part of speech: PRON
token: name
part of speech: NOUN
token: is
part of speech: AUX
token: Filipa
part of speech: PROPN
token: ,
part of speech: PUNCT
token: and
part of speech: CCONJ
token: I
part of speech: PRON
token: teach
part of speech: VERB
token: workshops
part of speech: NOUN
token: about
part of speech: ADP
token: Python
part of speech: PROPN
token: programming
part of speech: NOUN
token: at
part of speech: ADP
token: Princeton
part of speech: PROPN
token: University
part of speech: PROPN
token: .
part of speech: PUNCT

We can also use a module called displaCy to visualize these relationships.

from spacy import displacy
displacy.render(doc, style="dep", options={"compact":True})
My PRON name NOUN is AUX Filipa, PROPN and CCONJ I PRON teach VERB workshops NOUN about ADP Python PROPN programming NOUN at ADP Princeton PROPN University. PROPN poss nsubj attr cc nsubj conj dobj prep compound pobj prep compound pobj

Named Entity Recognition (NER)#

Named Entity Recognition, or NER, annotates words that seem to indicate real-world objects, people, or concepts–“entities,” in other words. Entities include names of persons, places, titles, monetary and other values, times, and dates.

NER is based on predictions made from the underlying model, en_core_web_sm.

# we use doc.ents to access the entities, and label_ to get their label
for ent in doc.ents:
    print(ent, ent.label_)
Filipa PERSON
Princeton University ORG
# we can also use displacy to visualize the entites in context

displacy.render(doc, style="ent")
My name is Filipa PERSON , and I teach workshops about Python programming at Princeton University ORG .

the Doc object#

At the end of the pipeline, we are left with the Doc object, which contains all of the annotations on our text, including POS, Dependency, and NER.

In the next sections, I will show you how we can leverage these annotations to search for specific patterns in the text.