the token Matcher#

The token Matcher is very similar to the PhraseMatcher from two sections before. The difference is that this Matcher is allows for more variation, so we can capture different forms of the same basic pattern. For example, we could get definitions of gender (and sex, and sexuality) that use different words (like “means” or “includes”) or kinds of punctuation (like single or double quotes) in the defintition.

The token Matcher works by writing a pattern of tokens that we define using the token attributes. We can, for example, leverage the work we did with the EntityRuler in the previous section to help create our token Matcher.

First, we will import the matcher to create a matcher object. Then, we will write patterns and save them. After that, we add our new patterns to the matcher. Finally, we will run the matcher on our document. The steps are the following:

  1. write patterns to matcher

  2. configure and run matcher

  3. print the results

Let’s take it one step at a time.

1. write patterns to the matcher#

# loading up our libraries and text
import spacy
import requests
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # loading up our libraries and text
----> 2 import spacy
      3 import requests
      4 from spacy.matcher import Matcher

ModuleNotFoundError: No module named 'spacy'

We want to capture not just “gender,” but “sex” and “sexuality,” as well as other synonyms for these terms. That’s where the custom entities from the last section will become useful.

Below I am re-creating the custom entity ruler so that we can leverage these entites in our token matcher.

ruler = nlp.add_pipe("entity_ruler", after="ner")

patterns = [
                {"label": "GENDER", "pattern": 'gender'},
                {"label": "GENDER", "pattern": 'trans'},
                {"label": "GENDER", "pattern": 'nonbinary'},
                {"label": "GENDER", "pattern": 'male'},
                {"label": "GENDER", "pattern": 'female'},
                {"label": "SEX", "pattern": 'sex'},
                {"label": "SEX", "pattern": 'biological'},
                {"label": "SEXUALITY", "pattern": 'sexuality'},
                {"label": "SEXUALITY", "pattern": 'orientation'},
                {"label": "SEXUALITY", "pattern": 'queer'},
                {"label": "IDENTITY", "pattern": 'LGBTQ'},
                {"label": "IDENTITY", "pattern": 'LGBT'},
                {"label": "IDENTITY", "pattern": 'LGBTQIA+'},
                {"label": "IDENTITY", "pattern": 'queer'}
            ]

ruler.add_patterns(patterns)

This is the basic format of the Matcher. We will add much more detail to this format later on, but it’s a good idea to get a sense of how it’s structured now, using JSON key-value pairs.

Also, it draws the token attributes from this page: https://spacy.io/api/matcher

pattern_format = [
    {
        'LOWER': 'gender'
    },
    {
        'IS_PUNCT': True
    },
    {
        'LOWER': 'means'
    }
]

We want to capture a specific pattern where gender is being defined. We’d want a phrase that captures text like: “gender means”, and to also get variations on the punctuation and/or terminologies in that text. For example, we want to get instances where they use both single and double quotes.

pattern = [
      # specifying the entity type, which can be one of our three
      # custom entities
      {"ENT_TYPE": {
          'IN': [
              'GENDER', 'SEX', 'SEXUALITY'
              ]
          }
      },
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {'OP': '?'}, # catches a "wild card" if it appears zero or one time.
      {
          'IS_PUNCT': True, 'OP': '+' #one or more times
      },
      {
          # getting the lowercase word of any of the following terms
          'LOWER': {
              'IN': [
                  'means', 'signifies', 'includes'
              ]
          }
       }
  ]

2. configure and run matcher#

Now we can configure the Matcher. First, create the matcher, then add our pattern to the matcher, and finally run the mather on our doc.

# loading up our sample text, which is the first million characters
# of our cleaned dataset

source = requests.get('https://bit.ly/senate_117_bills_clean')
text = source.content
decoded = text.decode('utf-8')

doc = nlp(decoded[:500000])
# use matcher class to create a matcher object
matcher = Matcher(nlp.vocab)

# add pattern to matcher
matcher.add('definition', [pattern])

# run matcher over doc
matches = matcher(doc)
# how many matches did we get?

len(matches)
33