the token Matcher
#
The token Matcher
is very similar to the PhraseMatcher
from two sections before. The difference is that this Matcher
is allows for more variation, so we can capture different forms of the same basic pattern. For example, we could get definitions of gender (and sex, and sexuality) that use different words (like “means” or “includes”) or kinds of punctuation (like single or double quotes) in the defintition.
The token Matcher
works by writing a pattern of tokens that we define using the token attributes. We can, for example, leverage the work we did with the EntityRuler
in the previous section to help create our token Matcher
.
First, we will import the matcher to create a matcher object. Then, we will write patterns and save them. After that, we add our new patterns to the matcher. Finally, we will run the matcher on our document. The steps are the following:
write patterns to matcher
configure and run matcher
print the results
Let’s take it one step at a time.
1. write patterns to the matcher#
# loading up our libraries and text
import spacy
import requests
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 # loading up our libraries and text
----> 2 import spacy
3 import requests
4 from spacy.matcher import Matcher
ModuleNotFoundError: No module named 'spacy'
We want to capture not just “gender,” but “sex” and “sexuality,” as well as other synonyms for these terms. That’s where the custom entities from the last section will become useful.
Below I am re-creating the custom entity ruler so that we can leverage these entites in our token matcher.
ruler = nlp.add_pipe("entity_ruler", after="ner")
patterns = [
{"label": "GENDER", "pattern": 'gender'},
{"label": "GENDER", "pattern": 'trans'},
{"label": "GENDER", "pattern": 'nonbinary'},
{"label": "GENDER", "pattern": 'male'},
{"label": "GENDER", "pattern": 'female'},
{"label": "SEX", "pattern": 'sex'},
{"label": "SEX", "pattern": 'biological'},
{"label": "SEXUALITY", "pattern": 'sexuality'},
{"label": "SEXUALITY", "pattern": 'orientation'},
{"label": "SEXUALITY", "pattern": 'queer'},
{"label": "IDENTITY", "pattern": 'LGBTQ'},
{"label": "IDENTITY", "pattern": 'LGBT'},
{"label": "IDENTITY", "pattern": 'LGBTQIA+'},
{"label": "IDENTITY", "pattern": 'queer'}
]
ruler.add_patterns(patterns)
This is the basic format of the Matcher
. We will add much more detail to this format later on, but it’s a good idea to get a sense of how it’s structured now, using JSON key-value pairs.
Also, it draws the token attributes from this page: https://spacy.io/api/matcher
pattern_format = [
{
'LOWER': 'gender'
},
{
'IS_PUNCT': True
},
{
'LOWER': 'means'
}
]
We want to capture a specific pattern where gender is being defined. We’d want a phrase that captures text like: “gender means”, and to also get variations on the punctuation and/or terminologies in that text. For example, we want to get instances where they use both single and double quotes.
pattern = [
# specifying the entity type, which can be one of our three
# custom entities
{"ENT_TYPE": {
'IN': [
'GENDER', 'SEX', 'SEXUALITY'
]
}
},
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{'OP': '?'}, # catches a "wild card" if it appears zero or one time.
{
'IS_PUNCT': True, 'OP': '+' #one or more times
},
{
# getting the lowercase word of any of the following terms
'LOWER': {
'IN': [
'means', 'signifies', 'includes'
]
}
}
]
2. configure and run matcher#
Now we can configure the Matcher
. First, create the matcher, then add our pattern to the matcher, and finally run the mather on our doc.
# loading up our sample text, which is the first million characters
# of our cleaned dataset
source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
decoded = text.decode('utf-8')
doc = nlp(decoded[:500000])
# use matcher class to create a matcher object
matcher = Matcher(nlp.vocab)
# add pattern to matcher
matcher.add('definition', [pattern])
# run matcher over doc
matches = matcher(doc)
# how many matches did we get?
len(matches)
33
print the matches#
Let’s see the actual text.
for match_id, start, end in matches[:10]:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(string_id, start, end, span.text)
print(doc[start].sent)
print('\n')
definition 2288 2292 gender analysis''--(A) means
Gender analysis.--The term ``gender analysis''--(A) means a socioeconomic analysis of available or gathered quantitative and qualitative information to identify, understand, and explain gaps between men and women, which typically involves examining--(i) differences in the status of women and men and differential access to and control over assets, resources, education, opportunities, and services;(ii) the influence of gender roles, structural barriers, and norms on the division of time between paid, unpaid work (including the subsistence production and care for family members), and volunteer activities;(iii) the influence of gender roles, structural barriers, and norms on leadership roles and decision making; constraints, opportunities, and entry points for narrowing gender gaps and empowering women; and(iv) potential differential impacts of development policies and programs on men and women, including unintended or negative consequences; and(B) includes conclusions and recommendations to enable development policies and programs--(i) to narrow gender gaps; and(ii) to improve the lives of women and girls.(5) Office.--The term ``Office'' means the Office of Global Women's Issues established by the Secretary of State pursuant to section 202(a).(6)
definition 9680 9684 gender identity' means
Gender identity.--The term `gender identity' means the gender-related identity, appearance, mannerisms, or other gender-related characteristics of an individual, regardless of the individual's designated sex at birth.
definition 9785 9788 orientation' means
``(5) Sexual orientation.--The term `sexual orientation' means homosexuality, heterosexuality, or bisexuality.
definition 11856 11861 gender transition procedure' means
In general.--The term `gender transition procedure' means any medical or surgical service which seeks to alter or remove physiological or anatomical characteristics or features which are typical for the individual's biological sex, or to instill or create physiological or anatomical characteristics which resemble a sex different from the individual's birth sex, for the purpose of gender transition, including--``(I) physician's services and inpatient and outpatient hospital services, including gender transition surgery, and``(II) prescribed drugs related to gender transition, including puberty-blocking drugs, cross-sex hormones, or other mechanisms to promote the development of feminizing or masculinizing features (in the opposite sex).``(ii) Exceptions.--Such term does not include--``(I) services for treatment of a medically-verifiable disorder of sex development, including--``(aa) external biological sex characteristics which are irresolvably ambiguous, such as presence of 46 XX chromosomes with virilization, 46 XY chromosomes with undervirilization, or both ovarian and testicular tissue, or``(bb) other physician-diagnosed disorder of sexual development, with respect to which the physician has determined through genetic or biochemical testing that the individual does not have normal sex chromosome structure, sex steroid hormone production, or sex steroid hormone action for a biological male or biological female, or``(II) treatment of any infection, injury, disease, or disorder caused or exacerbated by the performance of any gender transition procedure, whether or not the gender transition procedure was performed in accordance with State and Federal law or whether not a deduction for expenses in connection with the gender transition procedure is allowable under this chapter.
definition 12143 12146 gender' means
Gender.--The term `gender' means the psychological, behavioral, social, and cultural aspects of being male or female.
definition 12168 12172 gender transition' means
Gender transition.--The term `gender transition' means the process in which an individual goes from identifying with and living as a gender that corresponds to his or her biological sex to identifying with and living as a gender different from his or her biological sex, and may involve social, legal, or physical changes.
definition 12232 12237 gender transition surgery' means
In general.--The term `gender transition surgery' means any surgical service, including genital or non-genital surgery, performed for the purpose of assisting an individual with a gender transition.
definition 12548 12552 sex hormones' means
hormones.--The term `cross-sex hormones' means testosterone or other androgens given to biological females at doses which are profoundly larger or more potent than would normally occur naturally in healthy biological females, and estrogen given to biological males at doses which are profoundly larger or more potent than would normally occur naturally in healthy biological males.
definition 12542 12552 sex hormones.--The term `cross-sex hormones' means
``(ix) Cross-sex
definition 13845 13848 sex'' means
Biological sex.--The term ``biological sex'' means the biological indication of male and female in the context of reproductive potential or capacity, such as sex chromosomes, naturally occurring sex hormones, gonads, and nonambiguous internal and external genitalia present at birth, without regard to a person's psychological, chosen, or subjective experience of gender.(2)
Due to the versatility of the token Matcher
, we can catch instances like gender dysphoria’’ means and orientation includes, which goes beyond what we were able to do with the PhraseMatcher
. Pretty cool, right?
Next step is to save our data as a plain text file, so we can review it later.
We will include just the matched phrase and the full sentence from which it occurs.
with open('./out/matcher_defs.txt', 'w') as f:
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
f.write(str(span.text))
f.write(str('\n'))
f.write(str(doc[start].sent))
f.write(str('\n'))
f.write(str('\n'))
That’s all, folks!