scraping our text#
Researchers generally only share the analysis they have done after they got the data, which makes it hard for beginners to replicate the process. For that reason, I’m showing the data gathering process. (FYI - I tried using the congress.gov API to get this dataset, which is always the right thing to do! But it doesn’t offer the full text of the bill, so that’s why I turned to scraping. For future reference, you can request an API key here: https://api.congress.gov/)
Here, I’ll go over the I wrote for downloading bill data from congress.gov
and scraping the text of the individual bills.
First, I got a list of the relevant bills using the regular search function on the congress.gov
website. I did a search for the term “transgender,” and then downloading the results to a spreadsheet. (Side note: the ability to download results from a search is super useful, and most websites won’t offer that functionality)
Then, I loaded up the file (a csv file) into a Python notebook.
import requests # for making http (web) requests
import pandas as pd # for working with tabular (spreadsheet) data
import csv # also for working with tabular data, in csv format
# this grabs the CSV from the previous section. If you get a file
# not found error make sure you go through the previous section to
# save that csv
bills = pd.read_csv('congress_clean.csv')
df = pd.DataFrame(bills)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 148 non-null int64
1 Legislation Number 148 non-null object
2 URL 148 non-null object
3 Congress 148 non-null object
4 Title 146 non-null object
5 Sponsor 148 non-null object
6 Party of Sponsor 148 non-null object
7 Date of Introduction 115 non-null object
8 Committees 114 non-null object
9 Latest Action 146 non-null object
10 Latest Action Date 146 non-null object
11 Latest Summary 61 non-null object
12 Amends Bill 33 non-null object
13 Date Offered 31 non-null object
14 Date Submitted 2 non-null object
15 Date Proposed 0 non-null float64
16 Amendment Text (Latest) 33 non-null object
17 Amends Amendment 0 non-null float64
dtypes: float64(2), int64(1), object(15)
memory usage: 20.9+ KB
extracting the bill number#
In order to scrape the bill text, we need just the bill number. In order to get that, we need to go through the Legislation Number
column and extract just the number.
df['Legislation Number']
0 H.R. 1112
1 S. 435
2 H.Res. 886
3 S.Res. 464
4 H.Res. 269
...
143 H.Amdt. 195
144 H.Amdt. 193
145 H.Amdt. 256
146 H.Amdt. 257
147 H.Amdt. 255
Name: Legislation Number, Length: 148, dtype: object
# we can use the split() method to split up the single string
# into two strings, by the empty space in between them
bill = "H.R. 1112"
bill.split(' ')
['H.R.', '1112']
# we can write a for loop to append the number to a list
# involves checking if the item is a number, using "isnumeric"
numbers = []
for item in bill.split(' '):
if item.isnumeric():
numbers.append(item)
item
'1112'
To extract the bill numbers, we will write a loop that:
goes through each row of
df['Legislation Number']
turns that row into a string, using
str()
functionsplits that row by the empty space using
split()
writes another loop to check if the item
isnumeric()
appends the numeric item to a new list
## go through each row in numbers column of our spreadsheet
## extract the number and put into a separate list
numbers = []
for row in df['Legislation Number']:
# need to change data type to string in order to use split()
row = str(row)
splitted = row.split(' ')
for item in splitted:
if item.isnumeric():
numbers.append(item)
scraping the bill text#
Using that list of numbers as input, we will write a function that scrapes the bill text.
# here we are introducing "f-strings", which is way of writing "formatted strings"
# in python that allows us to input variables, like a bill number, in this case
def scrape_bill_text(numbers):
bills_text = []
for item in numbers:
# f-string is used to add the specific bill number to the URL
url = (f'https://www.congress.gov/117/bills/hr{item}/BILLS-117hr{item}ih.htm')
# requests library to scrape the URL, which is formatted for each bill number
response = requests.get(url)
content = response.content
bills_text.append(content)
return bills_text
Calling the function and saving the results to sample
# so we don't overload the website, we will scrape just a sample of
# the first 10 bills. This will be more than enough data for us to
# practice cleaning.
sample = scrape_bill_text(numbers[:10])
len(sample)
10
# let's check out our first item (the first bill text) in the list
sample[0]
b"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n<DOC>\n\n\n\n\n\n\n117th CONGRESS\n 1st Session\n H. R. 1112\n\n To require a report on the military coup in Burma, and for other \n purposes.\n\n\n_______________________________________________________________________\n\n\n IN THE HOUSE OF REPRESENTATIVES\n\n February 18, 2021\n\n Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n Buchanan) introduced the following bill; which was referred to the \n Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n A BILL\n\n\n \n To require a report on the military coup in Burma, and for other \n purposes.\n\n Be it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled,\n\nSECTION 1. SHORT TITLE.\n\n This Act may be cited as the ``Protect Democracy in Burma Act of \n2021''.\n\nSEC. 2. FINDINGS.\n\n Congress finds the following:\n (1) On March 14, 2005, the House of Representatives agreed \n to H. Res. 135, which established the House Democracy \n Assistance Commission (later changed to the House Democracy \n Partnership, hereafter referred to as ``HDP'') to work directly \n with parliaments around the world to support the development of \n effective, independent, and responsive legislative \n institutions.\n (2) HDP approved a legislative strengthening partnership \n with Burma in 2016 and organized the first congressional \n delegation to meet with the new civilian-led government, led by \n State Counselor Aung San Suu Kyi, and civil society leaders in \n May 2016.\n (3) On February 2, 2021, the U.S. Department of State \n assessed that Daw Aung San Suu Kyi, the leader of Burma's \n ruling party, and President Win Myint, the duly elected head of \n government, were deposed in a military coup on February 1, \n 2021.\n (4) As part of the military coup, the Burmese military \n declared martial law, suspended the civilian-led government, \n and detained newly elected Members of Parliament in the \n capitol, Naypyidaw, thereby usurping the role of the \n democratically elected government and parliament.\n\nSEC. 3. SENSE OF CONGRESS.\n\n It is the sense of Congress that--\n (1) due to the Burmese military's seizure of government \n through the detention of State Counsellor Aung San Suu Kyi, \n President Win Myint, and other government leaders, Burma is not \n represented by a democratically elected government;\n (2) the inability of newly elected Members of Parliament to \n begin their official mandate due to the Burmese military's \n actions directly threatens the democratic trajectory of Burma's \n Parliament, and thereby the country;\n (3) the will and determination of those duly-elected \n Members of Parliament who are taking it upon themselves to \n continue serving as representatives of the people through \n alternative methods of communicating and convening should be \n lauded; and\n (4) by preventing the Parliament from completing its work, \n the Burmese military has rendered impossible and effectively \n nullified the international collaborative relationships that \n have supported and strengthened the institution, including the \n Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n It is the policy of the United States to--\n (1) engage with the Association of Southeast Asian Nations \n (ASEAN) and ASEAN member states to--\n (A) condemn the military coup in Burma;\n (B) urge the unconditional release of detained \n democratically elected leaders and civil society \n members; and\n (C) support a return to Burma's democratic \n transition; and\n (2) instruct, as appropriate, representatives of the United \n States Government to use the voice, vote, and influence of the \n United States at the United Nations to hold accountable those \n responsible for the military coup in Burma.\n\nSEC. 5. REPORT.\n\n Not later than 90 days after the date of the enactment of this Act, \nthe Secretary of State shall submit to the Committee on Foreign Affairs \nand the Committee on Appropriations of the House of Representatives and \nthe Committee on Foreign Relations and the Committee on Appropriations \nof the Senate a report on the military coup in Burma, including a \ndescription of efforts to implement the policy specified in section 4.\n <all>\n</pre></body></html>\n"
decoding text from bytes to string#
We will “decode” the bytes type of data into a string, so we can eventually save it as a string format.
# Use type() to see what kind of data we are working with.
# list type
type(sample)
list
# within the list, bytes type
type(sample[0])
bytes
# turn bytes into string using decode()
decoded = []
for item in sample:
decoded.append(item.decode('utf-8'))
type(decoded[0])
str
decoded[0]
"<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n<DOC>\n\n\n\n\n\n\n117th CONGRESS\n 1st Session\n H. R. 1112\n\n To require a report on the military coup in Burma, and for other \n purposes.\n\n\n_______________________________________________________________________\n\n\n IN THE HOUSE OF REPRESENTATIVES\n\n February 18, 2021\n\n Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n Buchanan) introduced the following bill; which was referred to the \n Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n A BILL\n\n\n \n To require a report on the military coup in Burma, and for other \n purposes.\n\n Be it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled,\n\nSECTION 1. SHORT TITLE.\n\n This Act may be cited as the ``Protect Democracy in Burma Act of \n2021''.\n\nSEC. 2. FINDINGS.\n\n Congress finds the following:\n (1) On March 14, 2005, the House of Representatives agreed \n to H. Res. 135, which established the House Democracy \n Assistance Commission (later changed to the House Democracy \n Partnership, hereafter referred to as ``HDP'') to work directly \n with parliaments around the world to support the development of \n effective, independent, and responsive legislative \n institutions.\n (2) HDP approved a legislative strengthening partnership \n with Burma in 2016 and organized the first congressional \n delegation to meet with the new civilian-led government, led by \n State Counselor Aung San Suu Kyi, and civil society leaders in \n May 2016.\n (3) On February 2, 2021, the U.S. Department of State \n assessed that Daw Aung San Suu Kyi, the leader of Burma's \n ruling party, and President Win Myint, the duly elected head of \n government, were deposed in a military coup on February 1, \n 2021.\n (4) As part of the military coup, the Burmese military \n declared martial law, suspended the civilian-led government, \n and detained newly elected Members of Parliament in the \n capitol, Naypyidaw, thereby usurping the role of the \n democratically elected government and parliament.\n\nSEC. 3. SENSE OF CONGRESS.\n\n It is the sense of Congress that--\n (1) due to the Burmese military's seizure of government \n through the detention of State Counsellor Aung San Suu Kyi, \n President Win Myint, and other government leaders, Burma is not \n represented by a democratically elected government;\n (2) the inability of newly elected Members of Parliament to \n begin their official mandate due to the Burmese military's \n actions directly threatens the democratic trajectory of Burma's \n Parliament, and thereby the country;\n (3) the will and determination of those duly-elected \n Members of Parliament who are taking it upon themselves to \n continue serving as representatives of the people through \n alternative methods of communicating and convening should be \n lauded; and\n (4) by preventing the Parliament from completing its work, \n the Burmese military has rendered impossible and effectively \n nullified the international collaborative relationships that \n have supported and strengthened the institution, including the \n Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n It is the policy of the United States to--\n (1) engage with the Association of Southeast Asian Nations \n (ASEAN) and ASEAN member states to--\n (A) condemn the military coup in Burma;\n (B) urge the unconditional release of detained \n democratically elected leaders and civil society \n members; and\n (C) support a return to Burma's democratic \n transition; and\n (2) instruct, as appropriate, representatives of the United \n States Government to use the voice, vote, and influence of the \n United States at the United Nations to hold accountable those \n responsible for the military coup in Burma.\n\nSEC. 5. REPORT.\n\n Not later than 90 days after the date of the enactment of this Act, \nthe Secretary of State shall submit to the Committee on Foreign Affairs \nand the Committee on Appropriations of the House of Representatives and \nthe Committee on Foreign Relations and the Committee on Appropriations \nof the Senate a report on the military coup in Burma, including a \ndescription of efforts to implement the policy specified in section 4.\n <all>\n</pre></body></html>\n"
with open('sample.txt', 'w') as f:
for item in decoded:
f.write(item)