the Entity-Ruler#

“Entities,” from “Named Entity Recognition,” are labels added to certain words or numbers that fit within a category like person, place, time, date. These categories represent words/numbers that are relatively important in the data, which is why NER picks them out and label them.

By writing a custom Entity-Ruler, or “ruler” for short, we can define our own entity and write instructions for how to find and label entites of our choosing.

For this project, we will write a custom ruler that captures words and phrases related to gender in the text, like “gender,” “sex,” “male,” “female,” for example. After we write the pattern we want, then we add it to our ruler. At the end, we will pass our text (our dataset of the bills) through the nlp() pipeline, going through the entire pipeline process once more, this time, with our ruler added to the NER pipe. Here are the steps in order:

  1. Write down & code the exact pattern you’re looking for in the text

  2. Create the custom ruler and add the patterns to the ruler

  3. Run the nlp() pipeline (which now includes our custom ruler) on the text

  4. print the results

Note for advanced users: If you wanted to train a model to find definitions of gender/sex/sexuality from new data, you could use the Entity-Ruler to help prepare the dataset. You would first write a ruler, run it on your dataset, then use the results to “fine-tune” a model. Then, when the model is trained, you can use it to process new text (that it’s never seen before) and automatically apply your entity rules to that text. For more on training an NER, see Dr. Mattingly’s excellent tutorial on the subject.

# loading up our libraries and text
import spacy
import requests
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # loading up our libraries and text
----> 2 import spacy
      3 import requests
      4 nlp = spacy.load("en_core_web_sm")

ModuleNotFoundError: No module named 'spacy'

1. write down & code the patterns#

Now we will use NER to add entities to our text. For example, we can create a custom entity to represent anytime a word like “gender” or “sex” is mentioned. When writing our patterns, let’s try to separate out terms for gender, sex, and sexuality within the bills.

# List of Entities and Patterns

# the syntax is to use a JSON format to add a label and the pattern
# that matches the label. The patterns is the exact text which the ruler
# will be looking for in the data. 

patterns = [
              {"label": "GENDER", "pattern": 'gender'},
              {"label": "SEX", "pattern": 'sex'},
              {"label": "SEXUALITY", "pattern": 'sexuality'},
              {"label": "SEXUALITY", "pattern": 'orientation'}
          ]

If we want to catch more examples of gender, sex, and sexuality terms, we can add more words to our entity ruler.

patterns = [
                {"label": "GENDER", "pattern": 'gender'},
                {"label": "GENDER", "pattern": 'trans'},
                {"label": "GENDER", "pattern": 'nonbinary'},
                {"label": "GENDER", "pattern": 'male'},
                {"label": "GENDER", "pattern": 'female'},
                {"label": "SEX", "pattern": 'sex'},
                {"label": "SEX", "pattern": 'biological'},
                {"label": "SEXUALITY", "pattern": 'sexuality'},
                {"label": "SEXUALITY", "pattern": 'orientation'},
                {"label": "SEXUALITY", "pattern": 'queer'},
                {"label": "IDENTITY", "pattern": 'LGBTQ'},
                {"label": "IDENTITY", "pattern": 'LGBT'},
                {"label": "IDENTITY", "pattern": 'LGBTQIA+'},
                {"label": "IDENTITY", "pattern": 'queer'}
            ]

2. create our ruler & add patterns#

# create the EntityRuler object
ruler = nlp.add_pipe("entity_ruler", after="ner")
# after writing the pattern, we need to add it to our ruler
ruler.add_patterns(patterns)
# check to see that our ruler is now in the pipeline
print(nlp.pipe_names)
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'entity_ruler']

3. run nlp() on our text#

Remember that we need to run nlp() after adding our pattern to the ruler. This will ensure that our new pipeline (which contains our custom ruler) has a chance to run on our text.

# loading up our sample text, which is the first million characters
# of our cleaned dataset

source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
decoded = text.decode('utf-8')
    
doc = nlp(decoded[:500000])