the Entity-Ruler
#
“Entities,” from “Named Entity Recognition,” are labels added to certain words or numbers that fit within a category like person, place, time, date. These categories represent words/numbers that are relatively important in the data, which is why NER picks them out and label them.
By writing a custom Entity-Ruler
, or “ruler” for short, we can define our own entity and write instructions for how to find and label entites of our choosing.
For this project, we will write a custom ruler that captures words and phrases related to gender in the text, like “gender,” “sex,” “male,” “female,” for example. After we write the pattern we want, then we add it to our ruler. At the end, we will pass our text (our dataset of the bills) through the nlp()
pipeline, going through the entire pipeline process once more, this time, with our ruler added to the NER pipe. Here are the steps in order:
Write down & code the exact pattern you’re looking for in the text
Create the custom ruler and add the patterns to the ruler
Run the
nlp()
pipeline (which now includes our custom ruler) on the textprint the results
Note for advanced users: If you wanted to train a model to find definitions of gender/sex/sexuality from new data, you could use the Entity-Ruler
to help prepare the dataset. You would first write a ruler, run it on your dataset, then use the results to “fine-tune” a model. Then, when the model is trained, you can use it to process new text (that it’s never seen before) and automatically apply your entity rules to that text. For more on training an NER, see Dr. Mattingly’s excellent tutorial on the subject.
# loading up our libraries and text
import spacy
import requests
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 2
1 # loading up our libraries and text
----> 2 import spacy
3 import requests
4 nlp = spacy.load("en_core_web_sm")
ModuleNotFoundError: No module named 'spacy'
1. write down & code the patterns#
Now we will use NER to add entities to our text. For example, we can create a custom entity to represent anytime a word like “gender” or “sex” is mentioned. When writing our patterns, let’s try to separate out terms for gender, sex, and sexuality within the bills.
# List of Entities and Patterns
# the syntax is to use a JSON format to add a label and the pattern
# that matches the label. The patterns is the exact text which the ruler
# will be looking for in the data.
patterns = [
{"label": "GENDER", "pattern": 'gender'},
{"label": "SEX", "pattern": 'sex'},
{"label": "SEXUALITY", "pattern": 'sexuality'},
{"label": "SEXUALITY", "pattern": 'orientation'}
]
If we want to catch more examples of gender, sex, and sexuality terms, we can add more words to our entity ruler.
patterns = [
{"label": "GENDER", "pattern": 'gender'},
{"label": "GENDER", "pattern": 'trans'},
{"label": "GENDER", "pattern": 'nonbinary'},
{"label": "GENDER", "pattern": 'male'},
{"label": "GENDER", "pattern": 'female'},
{"label": "SEX", "pattern": 'sex'},
{"label": "SEX", "pattern": 'biological'},
{"label": "SEXUALITY", "pattern": 'sexuality'},
{"label": "SEXUALITY", "pattern": 'orientation'},
{"label": "SEXUALITY", "pattern": 'queer'},
{"label": "IDENTITY", "pattern": 'LGBTQ'},
{"label": "IDENTITY", "pattern": 'LGBT'},
{"label": "IDENTITY", "pattern": 'LGBTQIA+'},
{"label": "IDENTITY", "pattern": 'queer'}
]
2. create our ruler & add patterns#
# create the EntityRuler object
ruler = nlp.add_pipe("entity_ruler", after="ner")
# after writing the pattern, we need to add it to our ruler
ruler.add_patterns(patterns)
# check to see that our ruler is now in the pipeline
print(nlp.pipe_names)
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'entity_ruler']
3. run nlp()
on our text#
Remember that we need to run nlp()
after adding our pattern to the ruler. This will ensure that our new pipeline (which contains our custom ruler) has a chance to run on our text.
# loading up our sample text, which is the first million characters
# of our cleaned dataset
source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
decoded = text.decode('utf-8')
doc = nlp(decoded[:500000])
4. print the results#
# extract entities
for ent in doc.ents[:60]:
print (ent.text, ent.label_)
the U.S. Government Publishing ORG
Senate ORG
IS)<DOC>117th ORG
5242To CARDINAL
THE UNITED STATES GPE
December 13 DATE
2022Mrs CARDINAL
Collins PERSON
the Committee on ForeignRelations A ORG
Senate ORG
House of Representatives ORG
the United States of America GPE
Congress ORG
SECTION 1 LAW
TITLE ORG
the ``International Violence Against Women Act PRODUCT
Sec ORG
1 CARDINAL
Sec ORG
2 CARDINAL
Sec ORG
3 CARDINAL
STRATEGY TO PREVENT PERSON
GENDER-BASEDVIOLENCE ORG
101 CARDINAL
201 CARDINAL
Sec ORG
202 CARDINAL
Sec ORG
203 CARDINAL
204 CARDINAL
SEC ORG
2 CARDINAL
An estimated 1 CARDINAL
3 CARDINAL
sex SEX
Up to 70 percent PERCENT
gender GENDER
Swaziland GPE
Tanzania GPE
Zimbabwe GPE
Kenya GPE
Haiti GPE
between 28 CARDINAL
38 percent PERCENT
between 9 and 18 percent DATE
18 years DATE
6 CARDINAL
the International Men and Gender Equality Survey dataset.(6 ORG
gender GENDER
gender GENDER
gender GENDER
up to three CARDINAL
The World Health Organization ORG
more than 50 percent PERCENT
four-fold CARDINAL
gender GENDER
gender GENDER
gender GENDER
The World Health Organization ORG
# Remember that we also have to run the nlp() again through our text
# in order to ensure our new entities are in the pipe.
doc = nlp(decoded[:500000])
# extract entities again
for ent in doc.ents[:20]:
print (ent.text, ent.label_)
the U.S. Government Publishing ORG
Senate ORG
IS)<DOC>117th ORG
5242To CARDINAL
THE UNITED STATES GPE
December 13 DATE
2022Mrs CARDINAL
Collins PERSON
the Committee on ForeignRelations A ORG
Senate ORG
House of Representatives ORG
the United States of America GPE
Congress ORG
SECTION 1 LAW
TITLE ORG
the ``International Violence Against Women Act PRODUCT
Sec ORG
1 CARDINAL
Sec ORG
2 CARDINAL
# extract entities if gender
for ent in doc.ents[:100]:
if ent.label_ == 'GENDER':
print(ent.label_, ent.text)
GENDER gender
GENDER male
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER female
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
GENDER gender
# extract entities if sex
for ent in doc.ents[:300]:
if ent.label_ == 'SEX':
print(ent.label_, ent.text)
SEX sex
SEX sex
SEX sex
SEX sex
SEX biological
SEX sex
SEX sex
SEX sex
SEX sex
SEX biological
# extract entities if sexuality
for ent in doc.ents[:500]:
if ent.label_ == 'SEXUALITY':
print(ent.label_, ent.text)
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
SEXUALITY orientation
In the next section, we will leverage these entities to write a more sophisticated pattern matcher using the Matcher
class.