the PhraseMatcher
#
The PhraseMatcher
allows you to write specific phrases or sequences of text to find in the dataset. This is really useful if you already know the kind of thing that you want to pick out, including exact variations of those phrases. But not so useful if you want to account for more than a few variations. For a way to handle more complex variations of phrases, see the token Matcher
section.
The process of using the PhraseMatcher
involves four steps, divided into four sections below.
Write down & code the exact phrase you’re looking for in the text
Create the
PhraseMatcher
object and pass your phrase into itRun the
PhraseMatcher
on your docPrint out the matches
1. write down & code the phrase#
From close reading the bills dataset (in the defining gender section), we saw that the definitions include at least a single quote in the form of a backtick, terms like “gender” and “sex”, and the word “means”. The PhraseMatcher
requires that we narrow down to the most common element that appears in all of them. This would be the backtick ` and terms like “gender” and “sex”. I am leaving out everything after the term “gender” or “sex” because sometimes they are followed by single quotes and sometimes by double quotes, and I want to catch all of the possibilities for now.
Our patterns would therefore be the following:
`gender
`sex
2. create PhraseMatcher
object and pass your phrase#
First, we will import the necessary libraries and load our text through the nlp()
pipeline.
import spacy
from spacy.matcher import PhraseMatcher
import requests # for getting the dataset
# loading up the model in english
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import spacy
2 from spacy.matcher import PhraseMatcher
3 import requests # for getting the dataset
ModuleNotFoundError: No module named 'spacy'
Then, we create the PhraseMatcher
object, code our phrases, and pass them into the object.
# create a matcher object.
# we will then add phrases to the object
matcher = PhraseMatcher(nlp.vocab)
# adding a number of phrases, "definition"
# also, running each of our phrases through the nlp, to create it's
# own "doc" object for each one.
matcher.add("definitions", [
nlp("`gender"),
nlp("`sex")])
3. run the PhraseMatcher
#
We can now run the PhraseMatcher
on our doc
. The results will first appear in a numeric form, but we will convert them to plain text in the next step.
Before running the matcher, let’s load up our dataset, convert it to a string, and finally a doc
object in spaCy
.
# loading up our sample text, which is the first million characters
# of our cleaned dataset
source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
type(text)
bytes
decoded = text.decode('utf-8')
# passing our dataset into the nlp() function
# will have to use slicing in order to get around the memory constraints
doc = nlp(decoded[:500000])
# remember list slicing?
doc[:100]
b"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeS. 5242 Introduced in Senate (IS)<DOC>117th CONGRESS2d SessionS. 5242To prevent international violence against women, and for otherpurposes. IN THE SENATE OF THE UNITED STATES December 13, 2022Mrs. Shaheen (for herself and Ms. Collins) introduced the following bill; which was read twice and referred to the Committee on ForeignRelations A BILL To prevent international violence against women, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1.
type(doc)
spacy.tokens.doc.Doc
len(doc)
86025
# run the matcher on the doc
matches = matcher(doc)
# printing out the first 10 results.
# we get the hash, start and end locations
matches[:10]
[(5344954752463023658, 2287, 2289),
(5344954752463023658, 4384, 4386),
(5344954752463023658, 7828, 7830),
(5344954752463023658, 8041, 8043),
(5344954752463023658, 8169, 8171),
(5344954752463023658, 8340, 8342),
(5344954752463023658, 8463, 8465),
(5344954752463023658, 8470, 8472),
(5344954752463023658, 8490, 8492),
(5344954752463023658, 8501, 8503)]
# see how many we got total
len(matches)
72
4. print the results#
Finally, we print out the plain text of our results.
# our first match consists of numbers, which are numerica hashes
# and positions of our matches in our text data
matches[0]
(5344954752463023658, 2287, 2289)
# to see the actual text, we need to use the .sent attribute
number, start, end = matches[0]
print(doc[start:end].sent)
Gender analysis.--The term ``gender analysis''--(A) means a socioeconomic analysis of available or gathered quantitative and qualitative information to identify, understand, and explain gaps between men and women, which typically involves examining--(i) differences in the status of women and men and differential access to and control over assets, resources, education, opportunities, and services;(ii) the influence of gender roles, structural barriers, and norms on the division of time between paid, unpaid work (including the subsistence production and care for family members), and volunteer activities;(iii) the influence of gender roles, structural barriers, and norms on leadership roles and decision making; constraints, opportunities, and entry points for narrowing gender gaps and empowering women; and(iv) potential differential impacts of development policies and programs on men and women, including unintended or negative consequences; and(B) includes conclusions and recommendations to enable development policies and programs--(i) to narrow gender gaps; and(ii) to improve the lives of women and girls.(5) Office.--The term ``Office'' means the Office of Global Women's Issues established by the Secretary of State pursuant to section 202(a).(6)
# to see the actual text, need to write code to access the text
# version of that information, like "text", "doc[start]" and
# "doc[end]"
# we can also print out the whole sentence, with doc.sent
for match in matches[:10]:
number, start, end = match
print(doc[start:end].sent)
print('\n')
Gender analysis.--The term ``gender analysis''--(A) means a socioeconomic analysis of available or gathered quantitative and qualitative information to identify, understand, and explain gaps between men and women, which typically involves examining--(i) differences in the status of women and men and differential access to and control over assets, resources, education, opportunities, and services;(ii) the influence of gender roles, structural barriers, and norms on the division of time between paid, unpaid work (including the subsistence production and care for family members), and volunteer activities;(iii) the influence of gender roles, structural barriers, and norms on leadership roles and decision making; constraints, opportunities, and entry points for narrowing gender gaps and empowering women; and(iv) potential differential impacts of development policies and programs on men and women, including unintended or negative consequences; and(B) includes conclusions and recommendations to enable development policies and programs--(i) to narrow gender gaps; and(ii) to improve the lives of women and girls.(5) Office.--The term ``Office'' means the Office of Global Women's Issues established by the Secretary of State pursuant to section 202(a).(6)
Gender reassignment medical intervention defined``For purposes of this chapter, the term `gender reassignment medical intervention' means--``(1) performing a surgery that sterilizes an individual, including castration, vasectomy, hysterectomy, oophorectomy, metoidioplasty, penectomy, phalloplasty, and vaginoplasty, to change the body of such individual to correspond to a sex that is discordant with biological sex;``(2) performing a mastectomy on an individual for the purpose described in paragraph (1); and``(3) administering or supplying to an individual medications for the purpose described in paragraph (1), including--``(A)
3. PUBLIC ACCOMMODATIONS.(a) Prohibition on Discrimination or Segregation in Public Accommodations.--Section 201 of the Civil Rights Act of 1964 (42 U.S.C. 2000a) is amended--(1) in subsection (a), by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin''; and(2) in subsection (b)--(A) in paragraph (3), by striking ``stadium'' and all that follows and inserting ``stadium or other place of or establishment that provides exhibition, entertainment, recreation, exercise, amusement, public gathering, or public display;'';(B) by redesignating paragraph (4) as paragraph (6); and(C) by inserting after paragraph (3) the following:``(4) any establishment that provides a good, service, or program, including a store, shopping center, online retailer or service provider, salon, bank, gas station, food bank, service or care center, shelter, travel agency, or funeral parlor, or establishment that provides health care, accounting, or legal services;``(5) any train service, bus service, car service, taxi service, airline service, station, depot, or other place of or establishment that provides transportation service; and''.(b) Prohibition on Discrimination or Segregation Under Law.--Section 202 of such Act (42 U.S.C. 2000a-1) is amended by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin''.(c)
3. PUBLIC ACCOMMODATIONS.(a) Prohibition on Discrimination or Segregation in Public Accommodations.--Section 201 of the Civil Rights Act of 1964 (42 U.S.C. 2000a) is amended--(1) in subsection (a), by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin''; and(2) in subsection (b)--(A) in paragraph (3), by striking ``stadium'' and all that follows and inserting ``stadium or other place of or establishment that provides exhibition, entertainment, recreation, exercise, amusement, public gathering, or public display;'';(B) by redesignating paragraph (4) as paragraph (6); and(C) by inserting after paragraph (3) the following:``(4) any establishment that provides a good, service, or program, including a store, shopping center, online retailer or service provider, salon, bank, gas station, food bank, service or care center, shelter, travel agency, or funeral parlor, or establishment that provides health care, accounting, or legal services;``(5) any train service, bus service, car service, taxi service, airline service, station, depot, or other place of or establishment that provides transportation service; and''.(b) Prohibition on Discrimination or Segregation Under Law.--Section 202 of such Act (42 U.S.C. 2000a-1) is amended by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin''.(c)
4. DESEGREGATION OF PUBLIC FACILITIES.Section 301(a) of the Civil Rights Act of 1964 (42 U.S.C. 2000b(a)) is amended by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin''.
6. FEDERAL FUNDING.Section 601 of the Civil Rights Act of 1964 (42 U.S.C. 2000d) is amended by inserting ``sex (including sexual orientation and gender identity),'' before ``or national origin,''.
Unlawful Employment Practices.--Section 703 of the Civil Rights Act of 1964 (42 U.S.C. 2000e-2) is amended--(1) in the section header, by striking ``sex,'' and inserting ``sex (including sexual orientation and gender identity),'';(2) except in subsection (e), by striking ``sex,'' each place it appears and inserting ``sex (including sexual orientation and gender identity),'';(3) in subsection (e)(1), by striking ``enterprise,'' and inserting ``enterprise, if, in a situation in which sex is a bona fide occupational qualification, individuals are recognized as qualified in accordance with their gender identity,''; and(4) in subsection (h), by striking ``sex'' the second place it appears and inserting ``sex (including sexual orientation and gender identity),''.(c) Other Unlawful Employment Practices.--Section 704(b) of the Civil Rights Act of 1964 (42 U.S.C. 2000e-3(b)) is amended--(1) by striking ``sex,'' the first place it appears and inserting ``sex (including sexual orientation and gender identity),''; and(2) by striking ``employment.''
Unlawful Employment Practices.--Section 703 of the Civil Rights Act of 1964 (42 U.S.C. 2000e-2) is amended--(1) in the section header, by striking ``sex,'' and inserting ``sex (including sexual orientation and gender identity),'';(2) except in subsection (e), by striking ``sex,'' each place it appears and inserting ``sex (including sexual orientation and gender identity),'';(3) in subsection (e)(1), by striking ``enterprise,'' and inserting ``enterprise, if, in a situation in which sex is a bona fide occupational qualification, individuals are recognized as qualified in accordance with their gender identity,''; and(4) in subsection (h), by striking ``sex'' the second place it appears and inserting ``sex (including sexual orientation and gender identity),''.(c) Other Unlawful Employment Practices.--Section 704(b) of the Civil Rights Act of 1964 (42 U.S.C. 2000e-3(b)) is amended--(1) by striking ``sex,'' the first place it appears and inserting ``sex (including sexual orientation and gender identity),''; and(2) by striking ``employment.''
Unlawful Employment Practices.--Section 703 of the Civil Rights Act of 1964 (42 U.S.C. 2000e-2) is amended--(1) in the section header, by striking ``sex,'' and inserting ``sex (including sexual orientation and gender identity),'';(2) except in subsection (e), by striking ``sex,'' each place it appears and inserting ``sex (including sexual orientation and gender identity),'';(3) in subsection (e)(1), by striking ``enterprise,'' and inserting ``enterprise, if, in a situation in which sex is a bona fide occupational qualification, individuals are recognized as qualified in accordance with their gender identity,''; and(4) in subsection (h), by striking ``sex'' the second place it appears and inserting ``sex (including sexual orientation and gender identity),''.(c) Other Unlawful Employment Practices.--Section 704(b) of the Civil Rights Act of 1964 (42 U.S.C. 2000e-3(b)) is amended--(1) by striking ``sex,'' the first place it appears and inserting ``sex (including sexual orientation and gender identity),''; and(2) by striking ``employment.''
Unlawful Employment Practices.--Section 703 of the Civil Rights Act of 1964 (42 U.S.C. 2000e-2) is amended--(1) in the section header, by striking ``sex,'' and inserting ``sex (including sexual orientation and gender identity),'';(2) except in subsection (e), by striking ``sex,'' each place it appears and inserting ``sex (including sexual orientation and gender identity),'';(3) in subsection (e)(1), by striking ``enterprise,'' and inserting ``enterprise, if, in a situation in which sex is a bona fide occupational qualification, individuals are recognized as qualified in accordance with their gender identity,''; and(4) in subsection (h), by striking ``sex'' the second place it appears and inserting ``sex (including sexual orientation and gender identity),''.(c) Other Unlawful Employment Practices.--Section 704(b) of the Civil Rights Act of 1964 (42 U.S.C. 2000e-3(b)) is amended--(1) by striking ``sex,'' the first place it appears and inserting ``sex (including sexual orientation and gender identity),''; and(2) by striking ``employment.''
We can see that we’ve captured a lot here, even more than what we wanted, which is definitions of our gender terms.
For example, we captured phrases like “striking ‘sex’” and “inserting ‘sex’”. In the token Matcher
section, we will
look at ways of writing patterns that can handle more variations in our results.
Let’s save the data to a plain text file.
# first, create an empty list to store our definitions
defs = []
# then, write a loop that appends our data to that list with some useful labels
for match in matches:
number, start, end = match
defs.append(f'Phrase: "{doc[start:end]}", ')
defs.append('\n')
defs.append(f"Sentence: {doc[start].sent}")
defs.append('\n')
defs.append(f'Starts: {start} of {len(doc)}')
defs.append('\n')
defs.append('\n')
# finally, save that list to a plain text file called 'definitions'
with open('./out/definitions.txt', 'w') as f:
for item in defs:
f.write(str(item))