scraping our text#

Researchers generally only share the analysis they have done after they got the data, which makes it hard for beginners to replicate the process. For that reason, I’m showing the data gathering process. (FYI - I tried using the congress.gov API to get this dataset, which is always the right thing to do! But it doesn’t offer the full text of the bill, so that’s why I turned to scraping. For future reference, you can request an API key here: https://api.congress.gov/)

Here, I’ll go over the I wrote for downloading bill data from congress.gov and scraping the text of the individual bills.

First, I got a list of the relevant bills using the regular search function on the congress.gov website. I did a search for the term “transgender,” and then downloading the results to a spreadsheet. (Side note: the ability to download results from a search is super useful, and most websites won’t offer that functionality)

image of congress.gov search interface

Then, I loaded up the file (a csv file) into a Python notebook.

import requests # for making http (web) requests
import pandas as pd # for working with tabular (spreadsheet) data
import csv # also for working with tabular data, in csv format

# this grabs the CSV from the previous section. If you get a file
# not found error make sure you go through the previous section to 
# save that csv
bills = pd.read_csv('congress_clean.csv')

df = pd.DataFrame(bills)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               148 non-null    int64  
 1   Legislation Number       148 non-null    object 
 2   URL                      148 non-null    object 
 3   Congress                 148 non-null    object 
 4   Title                    146 non-null    object 
 5   Sponsor                  148 non-null    object 
 6   Party of Sponsor         148 non-null    object 
 7   Date of Introduction     115 non-null    object 
 8   Committees               114 non-null    object 
 9   Latest Action            146 non-null    object 
 10  Latest Action Date       146 non-null    object 
 11  Latest Summary           61 non-null     object 
 12  Amends Bill              33 non-null     object 
 13  Date Offered             31 non-null     object 
 14  Date Submitted           2 non-null      object 
 15  Date Proposed            0 non-null      float64
 16  Amendment Text (Latest)  33 non-null     object 
 17  Amends Amendment         0 non-null      float64
dtypes: float64(2), int64(1), object(15)
memory usage: 20.9+ KB

extracting the bill number#

In order to scrape the bill text, we need just the bill number. In order to get that, we need to go through the Legislation Number column and extract just the number.

df['Legislation Number']
0        H.R. 1112
1           S. 435
2       H.Res. 886
3       S.Res. 464
4       H.Res. 269
          ...     
143    H.Amdt. 195
144    H.Amdt. 193
145    H.Amdt. 256
146    H.Amdt. 257
147    H.Amdt. 255
Name: Legislation Number, Length: 148, dtype: object
# we can use the split() method to split up the single string
# into two strings, by the empty space in between them

bill = "H.R. 1112"
bill.split(' ')
['H.R.', '1112']
# we can write a for loop to append the number to a list
# involves checking if the item is a number, using "isnumeric"

numbers = []
for item in bill.split(' '):
    if item.isnumeric():
        numbers.append(item)
item
'1112'

To extract the bill numbers, we will write a loop that:

  • goes through each row of df['Legislation Number']

  • turns that row into a string, using str() function

  • splits that row by the empty space using split()

  • writes another loop to check if the item isnumeric()

  • appends the numeric item to a new list

## go through each row in numbers column of our spreadsheet
## extract the number and put into a separate list
numbers = []
for row in df['Legislation Number']:
    # need to change data type to string in order to use split()
    row = str(row)
    splitted = row.split(' ')
    for item in splitted:
        if item.isnumeric():
            numbers.append(item)

scraping the bill text#

Using that list of numbers as input, we will write a function that scrapes the bill text.

# here we are introducing "f-strings", which is way of writing "formatted strings"
# in python that allows us to input variables, like a bill number, in this case

def scrape_bill_text(numbers):
    bills_text = []
    for item in numbers:
        # f-string is used to add the specific bill number to the URL
        url = (f'https://www.congress.gov/117/bills/hr{item}/BILLS-117hr{item}ih.htm')
        # requests library to scrape the URL, which is formatted for each bill number
        response = requests.get(url)
        content = response.content
        bills_text.append(content)
    return bills_text

Calling the function and saving the results to sample

# so we don't overload the website, we will scrape just a sample of
# the first 10 bills. This will be more than enough data for us to
# practice cleaning.

sample = scrape_bill_text(numbers[:10])
len(sample)
10
# let's check out our first item (the first bill text) in the list 

sample[0]
b"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n  1st Session\n                                H. R. 1112\n\n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n\n_______________________________________________________________________\n\n\n                    IN THE HOUSE OF REPRESENTATIVES\n\n                           February 18, 2021\n\n    Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n  Buchanan) introduced the following bill; which was referred to the \n                      Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n                                 A BILL\n\n\n \n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n    Be it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled,\n\nSECTION 1. SHORT TITLE.\n\n    This Act may be cited as the ``Protect Democracy in Burma Act of \n2021''.\n\nSEC. 2. FINDINGS.\n\n    Congress finds the following:\n            (1) On March 14, 2005, the House of Representatives agreed \n        to H. Res. 135, which established the House Democracy \n        Assistance Commission (later changed to the House Democracy \n        Partnership, hereafter referred to as ``HDP'') to work directly \n        with parliaments around the world to support the development of \n        effective, independent, and responsive legislative \n        institutions.\n            (2) HDP approved a legislative strengthening partnership \n        with Burma in 2016 and organized the first congressional \n        delegation to meet with the new civilian-led government, led by \n        State Counselor Aung San Suu Kyi, and civil society leaders in \n        May 2016.\n            (3) On February 2, 2021, the U.S. Department of State \n        assessed that Daw Aung San Suu Kyi, the leader of Burma's \n        ruling party, and President Win Myint, the duly elected head of \n        government, were deposed in a military coup on February 1, \n        2021.\n            (4) As part of the military coup, the Burmese military \n        declared martial law, suspended the civilian-led government, \n        and detained newly elected Members of Parliament in the \n        capitol, Naypyidaw, thereby usurping the role of the \n        democratically elected government and parliament.\n\nSEC. 3. SENSE OF CONGRESS.\n\n    It is the sense of Congress that--\n            (1) due to the Burmese military's seizure of government \n        through the detention of State Counsellor Aung San Suu Kyi, \n        President Win Myint, and other government leaders, Burma is not \n        represented by a democratically elected government;\n            (2) the inability of newly elected Members of Parliament to \n        begin their official mandate due to the Burmese military's \n        actions directly threatens the democratic trajectory of Burma's \n        Parliament, and thereby the country;\n            (3) the will and determination of those duly-elected \n        Members of Parliament who are taking it upon themselves to \n        continue serving as representatives of the people through \n        alternative methods of communicating and convening should be \n        lauded; and\n            (4) by preventing the Parliament from completing its work, \n        the Burmese military has rendered impossible and effectively \n        nullified the international collaborative relationships that \n        have supported and strengthened the institution, including the \n        Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n    It is the policy of the United States to--\n            (1) engage with the Association of Southeast Asian Nations \n        (ASEAN) and ASEAN member states to--\n                    (A) condemn the military coup in Burma;\n                    (B) urge the unconditional release of detained \n                democratically elected leaders and civil society \n                members; and\n                    (C) support a return to Burma's democratic \n                transition; and\n            (2) instruct, as appropriate, representatives of the United \n        States Government to use the voice, vote, and influence of the \n        United States at the United Nations to hold accountable those \n        responsible for the military coup in Burma.\n\nSEC. 5. REPORT.\n\n    Not later than 90 days after the date of the enactment of this Act, \nthe Secretary of State shall submit to the Committee on Foreign Affairs \nand the Committee on Appropriations of the House of Representatives and \nthe Committee on Foreign Relations and the Committee on Appropriations \nof the Senate a report on the military coup in Burma, including a \ndescription of efforts to implement the policy specified in section 4.\n                                 &lt;all&gt;\n</pre></body></html>\n"

decoding text from bytes to string#

We will “decode” the bytes type of data into a string, so we can eventually save it as a string format.

# Use type() to see what kind of data we are working with.
# list type 

type(sample)
list
# within the list, bytes type

type(sample[0])
bytes
# turn bytes into string using decode()
decoded = []
for item in sample:
    decoded.append(item.decode('utf-8'))
type(decoded[0])
str
decoded[0]
"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n  1st Session\n                                H. R. 1112\n\n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n\n_______________________________________________________________________\n\n\n                    IN THE HOUSE OF REPRESENTATIVES\n\n                           February 18, 2021\n\n    Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n  Buchanan) introduced the following bill; which was referred to the \n                      Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n                                 A BILL\n\n\n \n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n    Be it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled,\n\nSECTION 1. SHORT TITLE.\n\n    This Act may be cited as the ``Protect Democracy in Burma Act of \n2021''.\n\nSEC. 2. FINDINGS.\n\n    Congress finds the following:\n            (1) On March 14, 2005, the House of Representatives agreed \n        to H. Res. 135, which established the House Democracy \n        Assistance Commission (later changed to the House Democracy \n        Partnership, hereafter referred to as ``HDP'') to work directly \n        with parliaments around the world to support the development of \n        effective, independent, and responsive legislative \n        institutions.\n            (2) HDP approved a legislative strengthening partnership \n        with Burma in 2016 and organized the first congressional \n        delegation to meet with the new civilian-led government, led by \n        State Counselor Aung San Suu Kyi, and civil society leaders in \n        May 2016.\n            (3) On February 2, 2021, the U.S. Department of State \n        assessed that Daw Aung San Suu Kyi, the leader of Burma's \n        ruling party, and President Win Myint, the duly elected head of \n        government, were deposed in a military coup on February 1, \n        2021.\n            (4) As part of the military coup, the Burmese military \n        declared martial law, suspended the civilian-led government, \n        and detained newly elected Members of Parliament in the \n        capitol, Naypyidaw, thereby usurping the role of the \n        democratically elected government and parliament.\n\nSEC. 3. SENSE OF CONGRESS.\n\n    It is the sense of Congress that--\n            (1) due to the Burmese military's seizure of government \n        through the detention of State Counsellor Aung San Suu Kyi, \n        President Win Myint, and other government leaders, Burma is not \n        represented by a democratically elected government;\n            (2) the inability of newly elected Members of Parliament to \n        begin their official mandate due to the Burmese military's \n        actions directly threatens the democratic trajectory of Burma's \n        Parliament, and thereby the country;\n            (3) the will and determination of those duly-elected \n        Members of Parliament who are taking it upon themselves to \n        continue serving as representatives of the people through \n        alternative methods of communicating and convening should be \n        lauded; and\n            (4) by preventing the Parliament from completing its work, \n        the Burmese military has rendered impossible and effectively \n        nullified the international collaborative relationships that \n        have supported and strengthened the institution, including the \n        Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n    It is the policy of the United States to--\n            (1) engage with the Association of Southeast Asian Nations \n        (ASEAN) and ASEAN member states to--\n                    (A) condemn the military coup in Burma;\n                    (B) urge the unconditional release of detained \n                democratically elected leaders and civil society \n                members; and\n                    (C) support a return to Burma's democratic \n                transition; and\n            (2) instruct, as appropriate, representatives of the United \n        States Government to use the voice, vote, and influence of the \n        United States at the United Nations to hold accountable those \n        responsible for the military coup in Burma.\n\nSEC. 5. REPORT.\n\n    Not later than 90 days after the date of the enactment of this Act, \nthe Secretary of State shall submit to the Committee on Foreign Affairs \nand the Committee on Appropriations of the House of Representatives and \nthe Committee on Foreign Relations and the Committee on Appropriations \nof the Senate a report on the military coup in Burma, including a \ndescription of efforts to implement the policy specified in section 4.\n                                 &lt;all&gt;\n</pre></body></html>\n"
with open('sample.txt', 'w') as f:
    for item in decoded:
        f.write(item)