{ "cells": [ { "cell_type": "markdown", "id": "0deef6c7", "metadata": {}, "source": [ "# scraping with `bs4`\n", "Now we have a sense of the `bs4` syntax, and a list of the html elements that we want to scrape from the `https://translegislation.com/` website, we can write some code that extracts those elements and saves them in a structured format, like a spreadsheet.\n", "\n", "Before doing all that, we will import the libraries we need and create our `soup` object (that holds our website content). " ] }, { "cell_type": "code", "execution_count": 1, "id": "2419132e", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import lxml" ] }, { "cell_type": "code", "execution_count": 2, "id": "ea364855", "metadata": {}, "outputs": [], "source": [ "site = requests.get('https://translegislation.com/bills/2023/passed')\n", "html_code = site.content\n", "soup = BeautifulSoup(html_code, 'lxml')" ] }, { "cell_type": "markdown", "id": "6dcf736e", "metadata": {}, "source": [ "Now that we have loaded up the necessary libraries and soup object, we can extract the data we want. Like all good programmers, we will break our task up into a number of steps:\n", "1. isolate the bill_cards data from the rest of the webpage\n", "2. pick out the information we want from the bill cards\n", "3. process the information from the bill cards into the format we want\n", "4. save that information to a csv file\n", "\n", "Each of these steps itself contains smaller steps, which we will figure out as we go along. Let's begin with the first step" ] }, { "cell_type": "markdown", "id": "94726fbd", "metadata": {}, "source": [ "## step 1: isolate the bill cards data from the rest of the page\n", "First, create a new object called `bill_cards`, which enables us to narrow down the parts of the website that we want to scrape." ] }, { "cell_type": "code", "execution_count": null, "id": "87f41013", "metadata": {}, "outputs": [], "source": [ "# to get the element and class for the cards, use the inspector\n", "\n", "bill_cards = soup.find_all('div', class_ ='css-4rck61')" ] }, { "cell_type": "code", "execution_count": 36, "id": "126f6ccc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
\n", " | state | \n", "title | \n", "caption | \n", "category | \n", "description | \n", "legiscan link | \n", "
---|---|---|---|---|---|---|
0 | \n", "AL | \n", "HB261 | \n", "Relating to two-year and four-year public inst... | \n", "SPORTS | \n", "\n", " | https://legiscan.com/AL/text/HB261/id/2817698 | \n", "
1 | \n", "AL | \n", "SB261 | \n", "Relating to public contracts; to prohibit gove... | \n", "OTHER | \n", "\n", " | https://legiscan.com/AL/text/SB261/id/2821857 | \n", "
2 | \n", "AR | \n", "HB1156 | \n", "Concerning A Public School District Or Open-en... | \n", "BATHROOM | \n", "\n", " | https://legiscan.com/AR/text/HB1156/id/2756961 | \n", "
3 | \n", "AR | \n", "HB1468 | \n", "To Create The Given Name Act; And To Prohibit ... | \n", "EDUCATION | \n", "\n", " | https://legiscan.com/AR/text/HB1468/id/2781770 | \n", "
4 | \n", "AR | \n", "HB1615 | \n", "To Create The Conscience Protection Act; And T... | \n", "OTHER | \n", "\n", " | https://legiscan.com/AR/text/HB1615/id/2781807 | \n", "