the PhraseMatcher#

The PhraseMatcher allows you to write specific phrases or sequences of text to find in the dataset. This is really useful if you already know the kind of thing that you want to pick out, including exact variations of those phrases. But not so useful if you want to account for more than a few variations. For a way to handle more complex variations of phrases, see the token Matcher section.

The process of using the PhraseMatcher involves four steps, divided into four sections below.

  1. Write down & code the exact phrase you’re looking for in the text

  2. Create the PhraseMatcher object and pass your phrase into it

  3. Run the PhraseMatcher on your doc

  4. Print out the matches

1. write down & code the phrase#

From close reading the bills dataset (in the defining gender section), we saw that the definitions include at least a single quote in the form of a backtick, terms like “gender” and “sex”, and the word “means”. The PhraseMatcher requires that we narrow down to the most common element that appears in all of them. This would be the backtick ` and terms like “gender” and “sex”. I am leaving out everything after the term “gender” or “sex” because sometimes they are followed by single quotes and sometimes by double quotes, and I want to catch all of the possibilities for now.

Our patterns would therefore be the following:

`gender
`sex

2. create PhraseMatcher object and pass your phrase#

First, we will import the necessary libraries and load our text through the nlp() pipeline.

import spacy
from spacy.matcher import PhraseMatcher
import requests # for getting the dataset

# loading up the model in english
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import spacy
      2 from spacy.matcher import PhraseMatcher
      3 import requests # for getting the dataset

ModuleNotFoundError: No module named 'spacy'

Then, we create the PhraseMatcher object, code our phrases, and pass them into the object.

# create a matcher object.
# we will then add phrases to the object

matcher = PhraseMatcher(nlp.vocab)
# adding a number of phrases, "definition"
# also, running each of our phrases through the nlp, to create it's
# own "doc" object for each one. 
matcher.add("definitions", [
  nlp("`gender"),
  nlp("`sex")])

3. run the PhraseMatcher#

We can now run the PhraseMatcher on our doc. The results will first appear in a numeric form, but we will convert them to plain text in the next step.

Before running the matcher, let’s load up our dataset, convert it to a string, and finally a doc object in spaCy.

# loading up our sample text, which is the first million characters
# of our cleaned dataset

source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
type(text)
bytes
decoded = text.decode('utf-8')
# passing our dataset into the nlp() function
# will have to use slicing in order to get around the memory constraints

doc = nlp(decoded[:500000])
# remember list slicing?

doc[:100]
b"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeS. 5242 Introduced in Senate (IS)<DOC>117th CONGRESS2d SessionS. 5242To prevent international violence against women, and for otherpurposes. IN THE SENATE OF THE UNITED STATES December 13, 2022Mrs. Shaheen (for herself and Ms. Collins) introduced the following bill; which was read twice and referred to the Committee on ForeignRelations A BILL To prevent international violence against women, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1.
type(doc)
spacy.tokens.doc.Doc
len(doc)
86025
# run the matcher on the doc
matches = matcher(doc)

# printing out the first 10 results.
# we get the hash, start and end locations
matches[:10]
[(5344954752463023658, 2287, 2289),
 (5344954752463023658, 4384, 4386),
 (5344954752463023658, 7828, 7830),
 (5344954752463023658, 8041, 8043),
 (5344954752463023658, 8169, 8171),
 (5344954752463023658, 8340, 8342),
 (5344954752463023658, 8463, 8465),
 (5344954752463023658, 8470, 8472),
 (5344954752463023658, 8490, 8492),
 (5344954752463023658, 8501, 8503)]
# see how many we got total
len(matches)
72