scraping with bs4
#
Now we have a sense of the bs4
syntax, and a list of the html elements that we want to scrape from the https://translegislation.com/
website, we can write some code that extracts those elements and saves them in a structured format, like a spreadsheet.
Before doing all that, we will import the libraries we need and create our soup
object (that holds our website content).
import requests
from bs4 import BeautifulSoup
import lxml
site = requests.get('https://translegislation.com/bills/2023/passed')
html_code = site.content
soup = BeautifulSoup(html_code, 'lxml')
Now that we have loaded up the necessary libraries and soup object, we can extract the data we want. Like all good programmers, we will break our task up into a number of steps:
isolate the bill_cards data from the rest of the webpage
pick out the information we want from the bill cards
process the information from the bill cards into the format we want
save that information to a csv file
Each of these steps itself contains smaller steps, which we will figure out as we go along. Let’s begin with the first step
step 1: isolate the bill cards data from the rest of the page#
First, create a new object called bill_cards
, which enables us to narrow down the parts of the website that we want to scrape.
# to get the element and class for the cards, use the inspector
bill_cards = soup.find_all('div', class_ ='css-4rck61')
# checking our list by printing just the first three items
bill_cards[:3]
[<div class="css-4rck61"><style data-emotion="css 1dvz6tu">.css-1dvz6tu{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:baseline;-webkit-box-align:baseline;-ms-flex-align:baseline;align-items:baseline;-webkit-box-pack:justify;-webkit-justify-content:space-between;justify-content:space-between;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;gap:var(--chakra-space-2);margin-left:inherit;margin-right:inherit;margin-bottom:var(--chakra-space-2);}</style><div class="css-1dvz6tu"><style data-emotion="css wd7aku">.css-wd7aku{font-weight:var(--chakra-fontWeights-semibold);letter-spacing:var(--chakra-letterSpacings-wide);margin-bottom:var(--chakra-space-2);}</style><div class="css-wd7aku"><style data-emotion="css 1vygpf9">.css-1vygpf9{font-family:var(--chakra-fonts-heading);font-weight:var(--chakra-fontWeights-bold);font-size:var(--chakra-fontSizes-2xl);line-height:1.33;color:#181818;text-align:left;margin-bottom:var(--chakra-space-1);}@media screen and (min-width: 48em){.css-1vygpf9{font-size:var(--chakra-fontSizes-3xl);line-height:1.2;}}</style><h3 class="chakra-heading css-1vygpf9"><style data-emotion="css f4h6uy">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/HB261">AL<!-- --> <!-- -->HB261</a></h3><style data-emotion="css bu60l4">.css-bu60l4{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#b55202;color:var(--chakra-colors-white);}.css-bu60l4:focus,.css-bu60l4[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class="css-bu60l4">SPORTS</span></div><style data-emotion="css bcf15j">.css-bcf15j{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.25rem;min-width:1.25rem;font-size:var(--chakra-fontSizes-xs);-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);border-radius:0px;background:var(--chakra-colors-red-100);color:var(--chakra-colors-gray-800);}.css-bcf15j:focus,.css-bcf15j[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class="css-bcf15j">PASSED</span></div><style data-emotion="css bp9bt3">.css-bp9bt3{font-family:var(--chakra-fonts-heading);font-weight:var(--chakra-fontWeights-semibold);font-size:var(--chakra-fontSizes-xl);line-height:1.2;margin-left:inherit;margin-right:inherit;overflow:hidden;text-overflow:ellipsis;display:-webkit-box;-webkit-box-orient:vertical;-webkit-line-clamp:var(--chakra-line-clamp);--chakra-line-clamp:3;margin-bottom:var(--chakra-space-2);}</style><h2 class="chakra-heading css-bp9bt3">Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.</h2><div class="css-bxak8j"></div><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/HB261"><style data-emotion="css 1952nyr">.css-1952nyr{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-appearance:none;-moz-appearance:none;-ms-appearance:none;appearance:none;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;position:relative;white-space:nowrap;vertical-align:middle;outline:2px solid transparent;outline-offset:2px;width:auto;line-height:1.2;border-radius:0px;font-weight:var(--chakra-fontWeights-semibold);transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-normal);height:var(--chakra-sizes-10);min-width:var(--chakra-sizes-10);font-size:var(--chakra-fontSizes-md);-webkit-padding-start:var(--chakra-space-4);padding-inline-start:var(--chakra-space-4);-webkit-padding-end:var(--chakra-space-4);padding-inline-end:var(--chakra-space-4);background:var(--chakra-colors-gray-100);border:1px solid #181818;background-color:var(--chakra-colors-white);}.css-1952nyr:focus,.css-1952nyr[data-focus]{box-shadow:var(--chakra-shadows-outline);}.css-1952nyr[disabled],.css-1952nyr[aria-disabled=true],.css-1952nyr[data-disabled]{opacity:0.4;cursor:not-allowed;box-shadow:var(--chakra-shadows-none);}.css-1952nyr:hover,.css-1952nyr[data-hover]{background:var(--chakra-colors-gray-200);}.css-1952nyr:hover[disabled],.css-1952nyr[data-hover][disabled],.css-1952nyr:hover[aria-disabled=true],.css-1952nyr[data-hover][aria-disabled=true],.css-1952nyr:hover[data-disabled],.css-1952nyr[data-hover][data-disabled]{background:var(--chakra-colors-gray-100);}.css-1952nyr:active,.css-1952nyr[data-active]{background:var(--chakra-colors-gray-300);}</style><button class="chakra-button css-1952nyr" type="button">View Bill</button></a></div>,
<div class="css-4rck61"><div class="css-1dvz6tu"><div class="css-wd7aku"><h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/SB261">AL<!-- --> <!-- -->SB261</a></h3><style data-emotion="css iyw6hm">.css-iyw6hm{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#3E3F30;color:var(--chakra-colors-white);}.css-iyw6hm:focus,.css-iyw6hm[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class="css-iyw6hm">OTHER</span></div><span class="css-bcf15j">PASSED</span></div><h2 class="chakra-heading css-bp9bt3">Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions.</h2><div class="css-bxak8j"></div><a class="chakra-link css-f4h6uy" href="/bills/2023/AL/SB261"><button class="chakra-button css-1952nyr" type="button">View Bill</button></a></div>,
<div class="css-4rck61"><div class="css-1dvz6tu"><div class="css-wd7aku"><h3 class="chakra-heading css-1vygpf9"><a class="chakra-link css-f4h6uy" href="/bills/2023/AR/HB1156">AR<!-- --> <!-- -->HB1156</a></h3><style data-emotion="css bvx26t">.css-bvx26t{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#A33469;color:var(--chakra-colors-white);}.css-bvx26t:focus,.css-bvx26t[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class="css-bvx26t">BATHROOM</span></div><span class="css-bcf15j">PASSED</span></div><h2 class="chakra-heading css-bp9bt3">Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex.</h2><div class="css-bxak8j"></div><a class="chakra-link css-f4h6uy" href="/bills/2023/AR/HB1156"><button class="chakra-button css-1952nyr" type="button">View Bill</button></a></div>]
step 2: pick out information from each bill card#
Everything that we need is contained within the object, bill_cards
. Now, we use the inspector to get the elements and attributes for the items within bill_cards
, like:
bill title
bill category
bill description
link to bill
# get bill title
soup.h3.text
'AL HB261'
# get bill caption
soup.find('div', class_ ='css-4rck61').h2.text
'Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.'
# get bill category
soup.find('div', class_='css-4rck61').span.text
'SPORTS'
# get bill description (if any)
soup.find('div', class_ ='css-4rck61').p.text
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[8], line 3
1 # get bill description (if any)
----> 3 soup.find('div', class_ ='css-4rck61').p.text
AttributeError: 'NoneType' object has no attribute 'text'
# get link extension
soup.find('div', class_ ='css-4rck61').a['href']
'/bills/2023/AL/HB261'
Because we now have the code for the relevant HTML elements, we will now extract them and save them. To do that, we will write a loop that goes through each item in our bill_cards
, gets the relevant HTML element, and saves it to a variable. Our loop will goes through each bill card, one by one, and pull out the title, description, category, and link.
Note: loops are ways of programmatically going through a dataset and doing something to each item in the dataset, like extracting it. Read more about loops in the intro workshop
Below, I will be explaining the code logic in by writing it out in “pseudo-code” in the comments. Pseudo-code is a cross between normal language and programming language, that is useful for explaining and working out how to write the actual programming code in Python.
# for each card in bill_cards:
# get the title in h3.text
# get the caption in h2.text
# get the category in span.text
# get the descriptoin in p.text (if any)
# get the link in a tag, class "chakra-link"
# runs the loop on the bill cards
for item in bill_cards[:10]: # only the first ten cards, just to check if it is working
print(item.h3.text) # title
print(item.h2.text) # caption
print(item.span.text) # category
print(item.p.text) # description (if any)
print(item.a['href']) # add https://translegislation.com/bills/2023/US
AL HB261
Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.
SPORTS
/bills/2023/AL/HB261
AL SB261
Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions.
OTHER
/bills/2023/AL/SB261
AR HB1156
Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex.
BATHROOM
/bills/2023/AR/HB1156
AR HB1468
To Create The Given Name Act; And To Prohibit Requiring Employees Of Public Schools And State-supported Institutions Of Higher Education To Use A Person's Preferred Pronoun, Name, Or Title Without Parental Consent.
EDUCATION
/bills/2023/AR/HB1468
AR HB1615
To Create The Conscience Protection Act; And To Amend The Religious Freedom Restoration Act.
OTHER
/bills/2023/AR/HB1615
AR SB125
Concerning Free Speech Rights At State-supported Institutions Of Higher Education.
EDUCATION
/bills/2023/AR/SB125
AR SB199
Concerning Medical Malpractice And Gender Transition In Minors; And To Create The Protecting Minors From Medical Malpractice Act Of 2023.
HEALTHCARE
/bills/2023/AR/SB199
AR SB270
To Amend The Criminal Offense Of Sexual Indecency With A Child.
BATHROOM
/bills/2023/AR/SB270
AR SB294
To Create The Learns Act; To Amend Various Provisions Of The Arkansas Code As They Relate To Early Childhood Through Grade Twelve Education In The State Of Arkansas; And To Declare An Emergency.
BATHROOM
/bills/2023/AR/SB294
AR SB43
To Add Certain Restrictions To An Adult-oriented Performance; And To Define An Adult-oriented Performance.
OTHER
/bills/2023/AR/SB43
It worked! Now, the next step is to assign a variable for each item. This allows us to save the data to the variable name, and later, to add it to a list.
for item in bill_cards[:10]:
title = item.h3.text
caption = item.h2.text
category = item.find('span').text
description = item.p.text
link = 'https://translegislation.com/bills/2023/passed' + item.a['href']
print(title, caption, category, description, link)
AL HB261 Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act. SPORTS https://translegislation.com/bills/2023/passed/bills/2023/AL/HB261
AL SB261 Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions. OTHER https://translegislation.com/bills/2023/passed/bills/2023/AL/SB261
AR HB1156 Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex. BATHROOM https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1156
AR HB1468 To Create The Given Name Act; And To Prohibit Requiring Employees Of Public Schools And State-supported Institutions Of Higher Education To Use A Person's Preferred Pronoun, Name, Or Title Without Parental Consent. EDUCATION https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1468
AR HB1615 To Create The Conscience Protection Act; And To Amend The Religious Freedom Restoration Act. OTHER https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1615
AR SB125 Concerning Free Speech Rights At State-supported Institutions Of Higher Education. EDUCATION https://translegislation.com/bills/2023/passed/bills/2023/AR/SB125
AR SB199 Concerning Medical Malpractice And Gender Transition In Minors; And To Create The Protecting Minors From Medical Malpractice Act Of 2023. HEALTHCARE https://translegislation.com/bills/2023/passed/bills/2023/AR/SB199
AR SB270 To Amend The Criminal Offense Of Sexual Indecency With A Child. BATHROOM https://translegislation.com/bills/2023/passed/bills/2023/AR/SB270
AR SB294 To Create The Learns Act; To Amend Various Provisions Of The Arkansas Code As They Relate To Early Childhood Through Grade Twelve Education In The State Of Arkansas; And To Declare An Emergency. BATHROOM https://translegislation.com/bills/2023/passed/bills/2023/AR/SB294
AR SB43 To Add Certain Restrictions To An Adult-oriented Performance; And To Define An Adult-oriented Performance. OTHER https://translegislation.com/bills/2023/passed/bills/2023/AR/SB43
It works! Now let’s save it to lists.
# a bunch of empty lists where we will dump our data
titles = []
captions = []
categories = []
descriptions = []
# our for loop that saves each item we want from the bill_cards
for item in bill_cards:
title = item.h3.text
caption = item.h2.text
category = item.find('span').text
description = item.p.text
# adding the items to the empty lists
titles.append(title)
captions.append(caption)
categories.append(category)
descriptions.append(description)
step 3: processing information about the bills#
Before adding saving our dataset to a spreadsheet, we are going to do a bit more data processing and gathering. This will enable us to make a more robust dataset at the end. Here, we are going to do two things:
split the title column into state and title
get the link directly to the bill page on LegiScan
Like the previous sections, I’m going to use comments to write some pseudo-code that separates out the steps of the larger task. This is good practice for all programmers.
# first, we will split the bill name into two variables, state and title
# this will make things more clean when we add it to our spreadsheet
for item in bill_cards[:10]:
state, title = item.h3.text.split(' ')
print(state, title)
AL HB261
AL SB261
AR HB1156
AR HB1468
AR HB1615
AR SB125
AR SB199
AR SB270
AR SB294
AR SB43
## now, we will get the link to state bill, in the following steps:
## first, make a list of URLs:
## then, for each URL, make a soup.
## then, for each soup, get the link to the state bill, called "extension"
## then, add the link extension to the root, saving it as "urls"
## finally, add the urls to a new list, called "legiscan links"
for item in bill_cards[:10]:
extension = 'https://translegislation.com/' + item.a['href']
print(extension)
https://translegislation.com//bills/2023/AL/HB261
https://translegislation.com//bills/2023/AL/SB261
https://translegislation.com//bills/2023/AR/HB1156
https://translegislation.com//bills/2023/AR/HB1468
https://translegislation.com//bills/2023/AR/HB1615
https://translegislation.com//bills/2023/AR/SB125
https://translegislation.com//bills/2023/AR/SB199
https://translegislation.com//bills/2023/AR/SB270
https://translegislation.com//bills/2023/AR/SB294
https://translegislation.com//bills/2023/AR/SB43
urls = []
for item in bill_cards:
extension = 'https://translegislation.com/' + item.a['href']
urls.append(extension)
urls[:10]
['https://translegislation.com//bills/2023/AL/HB261',
'https://translegislation.com//bills/2023/AL/SB261',
'https://translegislation.com//bills/2023/AR/HB1156',
'https://translegislation.com//bills/2023/AR/HB1468',
'https://translegislation.com//bills/2023/AR/HB1615',
'https://translegislation.com//bills/2023/AR/SB125',
'https://translegislation.com//bills/2023/AR/SB199',
'https://translegislation.com//bills/2023/AR/SB270',
'https://translegislation.com//bills/2023/AR/SB294',
'https://translegislation.com//bills/2023/AR/SB43']
# making a soup object of *every* page that is linked
# this may take several seconds
soups = []
for item in urls:
site = requests.get(item)
html_code = site.content
soup = BeautifulSoup(html_code, 'lxml')
soups.append(soup)
legiscan_links = []
for item in soups:
# get the url for state url
anchor_tag = item.find('a', class_='chakra-link css-oga2ct')
link = anchor_tag['href']
legiscan_links.append(link)
3. saving our data to a CSV#
This is the final step. First, we will import two libraries for working with tabular data pandas
and csv
.
Then, we will add each of our lists into the “DataFrame” (the pandas
term for a tabular type of object), where they will appear as separate columns. Finally, we will save our DataFrame as a .csv file.
# importing the necessary libraries
import pandas as pd
import csv
# creating empty lists to hold all of our data
states = []
titles = []
captions = []
categories = []
descriptions = []
# extracting the data from the bill cards
for item in bill_cards:
state, title = item.h3.text.split(' ') # adding the extra step to split the bill name into state and title items
caption = item.h2.text
category = item.find('span').text
description = item.p.text
# adding the items to the empty lists
states.append(state)
titles.append(title)
captions.append(caption)
categories.append(category)
descriptions.append(description)
# remember that "legiscan_links" is already saved as a list, so we don't have to create it here
# creating a dataframe, with separate columns to hold each of our lists
df = pd.DataFrame(
{'state': states,
'title': titles,
'caption': captions,
'category': categories,
'description': descriptions,
'legiscan link': legiscan_links
})
# checking the first 5 lines of the dataframe
df.head()
state | title | caption | category | description | legiscan link | |
---|---|---|---|---|---|---|
0 | AL | HB261 | Relating to two-year and four-year public inst... | SPORTS | https://legiscan.com/AL/text/HB261/id/2817698 | |
1 | AL | SB261 | Relating to public contracts; to prohibit gove... | OTHER | https://legiscan.com/AL/text/SB261/id/2821857 | |
2 | AR | HB1156 | Concerning A Public School District Or Open-en... | BATHROOM | https://legiscan.com/AR/text/HB1156/id/2756961 | |
3 | AR | HB1468 | To Create The Given Name Act; And To Prohibit ... | EDUCATION | https://legiscan.com/AR/text/HB1468/id/2781770 | |
4 | AR | HB1615 | To Create The Conscience Protection Act; And T... | OTHER | https://legiscan.com/AR/text/HB1615/id/2781807 |
# saving the dataframe as a csv file
df.to_csv('bill_data.csv')
And that’s all! If you are on google colab, check your sidebar under the “files” tab. You should see a .csv file containing the data we’ve scraped from the translegislation.com
website. Well done!
In the next section, we will look at an API method for getting legislative data, and save that data to a CSV file. In that activity, you’ll see the differences in handling data acrossn web scraping and API methods.