{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0deef6c7",
   "metadata": {},
   "source": [
    "# scraping with `bs4`\n",
    "Now we have a sense of the `bs4` syntax, and a list of the html elements that we want to scrape from the `https://translegislation.com/` website, we can write some code that extracts those elements and saves them in a structured format, like a spreadsheet.\n",
    "\n",
    "Before doing all that, we will import the libraries we need and create our `soup` object (that holds our website content). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2419132e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import lxml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ea364855",
   "metadata": {},
   "outputs": [],
   "source": [
    "site = requests.get('https://translegislation.com/bills/2023/passed')\n",
    "html_code = site.content\n",
    "soup = BeautifulSoup(html_code, 'lxml')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dcf736e",
   "metadata": {},
   "source": [
    "Now that we have loaded up the necessary libraries and soup object, we can extract the data we want. Like all good programmers, we will break our task up into a number of steps:\n",
    "1. isolate the bill_cards data from the rest of the webpage\n",
    "2. pick out the information we want from the bill cards\n",
    "3. process the information from the bill cards into the format we want\n",
    "4. save that information to a csv file\n",
    "\n",
    "Each of these steps itself contains smaller steps, which we will figure out as we go along. Let's begin with the first step"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94726fbd",
   "metadata": {},
   "source": [
    "## step 1: isolate the bill cards data from the rest of the page\n",
    "First, create a new object called `bill_cards`, which enables us to narrow down the parts of the website that we want to scrape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87f41013",
   "metadata": {},
   "outputs": [],
   "source": [
    "# to get the element and class for the cards, use the inspector\n",
    "\n",
    "bill_cards = soup.find_all('div', class_ ='css-4rck61')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "126f6ccc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<div class=\"css-4rck61\"><style data-emotion=\"css 1dvz6tu\">.css-1dvz6tu{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:baseline;-webkit-box-align:baseline;-ms-flex-align:baseline;align-items:baseline;-webkit-box-pack:justify;-webkit-justify-content:space-between;justify-content:space-between;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;gap:var(--chakra-space-2);margin-left:inherit;margin-right:inherit;margin-bottom:var(--chakra-space-2);}</style><div class=\"css-1dvz6tu\"><style data-emotion=\"css wd7aku\">.css-wd7aku{font-weight:var(--chakra-fontWeights-semibold);letter-spacing:var(--chakra-letterSpacings-wide);margin-bottom:var(--chakra-space-2);}</style><div class=\"css-wd7aku\"><style data-emotion=\"css 1vygpf9\">.css-1vygpf9{font-family:var(--chakra-fonts-heading);font-weight:var(--chakra-fontWeights-bold);font-size:var(--chakra-fontSizes-2xl);line-height:1.33;color:#181818;text-align:left;margin-bottom:var(--chakra-space-1);}@media screen and (min-width: 48em){.css-1vygpf9{font-size:var(--chakra-fontSizes-3xl);line-height:1.2;}}</style><h3 class=\"chakra-heading css-1vygpf9\"><style data-emotion=\"css f4h6uy\">.css-f4h6uy{transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-fast);transition-timing-function:var(--chakra-transition-easing-ease-out);cursor:pointer;-webkit-text-decoration:none;text-decoration:none;outline:2px solid transparent;outline-offset:2px;color:inherit;}.css-f4h6uy:hover,.css-f4h6uy[data-hover]{-webkit-text-decoration:underline;text-decoration:underline;}.css-f4h6uy:focus,.css-f4h6uy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AL/HB261\">AL<!-- --> <!-- -->HB261</a></h3><style data-emotion=\"css bu60l4\">.css-bu60l4{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#b55202;color:var(--chakra-colors-white);}.css-bu60l4:focus,.css-bu60l4[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class=\"css-bu60l4\">SPORTS</span></div><style data-emotion=\"css bcf15j\">.css-bcf15j{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.25rem;min-width:1.25rem;font-size:var(--chakra-fontSizes-xs);-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);border-radius:0px;background:var(--chakra-colors-red-100);color:var(--chakra-colors-gray-800);}.css-bcf15j:focus,.css-bcf15j[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class=\"css-bcf15j\">PASSED</span></div><style data-emotion=\"css bp9bt3\">.css-bp9bt3{font-family:var(--chakra-fonts-heading);font-weight:var(--chakra-fontWeights-semibold);font-size:var(--chakra-fontSizes-xl);line-height:1.2;margin-left:inherit;margin-right:inherit;overflow:hidden;text-overflow:ellipsis;display:-webkit-box;-webkit-box-orient:vertical;-webkit-line-clamp:var(--chakra-line-clamp);--chakra-line-clamp:3;margin-bottom:var(--chakra-space-2);}</style><h2 class=\"chakra-heading css-bp9bt3\">Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.</h2><style data-emotion=\"css 1anmcl7\">.css-1anmcl7{overflow:hidden;text-overflow:ellipsis;display:-webkit-box;-webkit-box-orient:vertical;-webkit-line-clamp:var(--chakra-line-clamp);--chakra-line-clamp:5;margin-left:inherit;margin-right:inherit;margin-bottom:var(--chakra-space-4);}</style><p class=\"chakra-text css-1anmcl7\"></p><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AL/HB261\"><style data-emotion=\"css 1952nyr\">.css-1952nyr{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;-webkit-appearance:none;-moz-appearance:none;-ms-appearance:none;appearance:none;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;position:relative;white-space:nowrap;vertical-align:middle;outline:2px solid transparent;outline-offset:2px;width:auto;line-height:1.2;border-radius:0px;font-weight:var(--chakra-fontWeights-semibold);transition-property:var(--chakra-transition-property-common);transition-duration:var(--chakra-transition-duration-normal);height:var(--chakra-sizes-10);min-width:var(--chakra-sizes-10);font-size:var(--chakra-fontSizes-md);-webkit-padding-start:var(--chakra-space-4);padding-inline-start:var(--chakra-space-4);-webkit-padding-end:var(--chakra-space-4);padding-inline-end:var(--chakra-space-4);background:var(--chakra-colors-gray-100);border:1px solid #181818;background-color:var(--chakra-colors-white);}.css-1952nyr:focus,.css-1952nyr[data-focus]{box-shadow:var(--chakra-shadows-outline);}.css-1952nyr[disabled],.css-1952nyr[aria-disabled=true],.css-1952nyr[data-disabled]{opacity:0.4;cursor:not-allowed;box-shadow:var(--chakra-shadows-none);}.css-1952nyr:hover,.css-1952nyr[data-hover]{background:var(--chakra-colors-gray-200);}.css-1952nyr:hover[disabled],.css-1952nyr[data-hover][disabled],.css-1952nyr:hover[aria-disabled=true],.css-1952nyr[data-hover][aria-disabled=true],.css-1952nyr:hover[data-disabled],.css-1952nyr[data-hover][data-disabled]{background:var(--chakra-colors-gray-100);}.css-1952nyr:active,.css-1952nyr[data-active]{background:var(--chakra-colors-gray-300);}</style><button class=\"chakra-button css-1952nyr\" type=\"button\">View Bill</button></a></div>,\n",
       " <div class=\"css-4rck61\"><div class=\"css-1dvz6tu\"><div class=\"css-wd7aku\"><h3 class=\"chakra-heading css-1vygpf9\"><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AL/SB261\">AL<!-- --> <!-- -->SB261</a></h3><style data-emotion=\"css 8im0hy\">.css-8im0hy{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#607d8b;color:var(--chakra-colors-white);}.css-8im0hy:focus,.css-8im0hy[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class=\"css-8im0hy\">OTHER</span></div><span class=\"css-bcf15j\">PASSED</span></div><h2 class=\"chakra-heading css-bp9bt3\">Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions.</h2><p class=\"chakra-text css-1anmcl7\"></p><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AL/SB261\"><button class=\"chakra-button css-1952nyr\" type=\"button\">View Bill</button></a></div>,\n",
       " <div class=\"css-4rck61\"><div class=\"css-1dvz6tu\"><div class=\"css-wd7aku\"><h3 class=\"chakra-heading css-1vygpf9\"><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AR/HB1156\">AR<!-- --> <!-- -->HB1156</a></h3><style data-emotion=\"css bvx26t\">.css-bvx26t{display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;vertical-align:top;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;max-width:100%;font-weight:var(--chakra-fontWeights-medium);line-height:1.2;outline:2px solid transparent;outline-offset:2px;min-height:1.5rem;min-width:1.5rem;font-size:var(--chakra-fontSizes-sm);border-radius:0px;-webkit-padding-start:var(--chakra-space-2);padding-inline-start:var(--chakra-space-2);-webkit-padding-end:var(--chakra-space-2);padding-inline-end:var(--chakra-space-2);background:#A33469;color:var(--chakra-colors-white);}.css-bvx26t:focus,.css-bvx26t[data-focus]{box-shadow:var(--chakra-shadows-outline);}</style><span class=\"css-bvx26t\">BATHROOM</span></div><span class=\"css-bcf15j\">PASSED</span></div><h2 class=\"chakra-heading css-bp9bt3\">Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex.</h2><p class=\"chakra-text css-1anmcl7\"></p><a class=\"chakra-link css-f4h6uy\" href=\"/bills/2023/AR/HB1156\"><button class=\"chakra-button css-1952nyr\" type=\"button\">View Bill</button></a></div>]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# checking our list by printing just the first three items\n",
    "\n",
    "bill_cards[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "726926e3",
   "metadata": {},
   "source": [
    "# step 2: pick out information from each bill card\n",
    "Everything that we need is contained within the object, `bill_cards`. Now, we use the inspector to get the elements and attributes for the items within `bill_cards`, like:\n",
    "- bill title\n",
    "- bill category\n",
    "- bill description\n",
    "- link to bill"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "db345235",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'AL HB261'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get bill title\n",
    "\n",
    "soup.h3.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "98436908",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get bill caption\n",
    "\n",
    "soup.find('div', class_ ='css-4rck61').h2.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6cd4bbdc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'SPORTS'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get bill category\n",
    "\n",
    "soup.find('div', class_='css-4rck61').span.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "72446853",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "''"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get bill description (if any)\n",
    "\n",
    "soup.find('div', class_ ='css-4rck61').p.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ab9104da",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/bills/2023/AL/HB261'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get link extension\n",
    "\n",
    "soup.find('div', class_ ='css-4rck61').a['href']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "519807f5",
   "metadata": {},
   "source": [
    "Because we now have the code for the relevant HTML elements, we will now extract them and save them. To do that, we will write a loop that goes through each item in our `bill_cards`, gets the relevant HTML element, and saves it to a variable. Our loop will goes through each bill card, one by one, and pull out the title, description, category, and link. \n",
    "\n",
    "*Note: loops are ways of programmatically going through a dataset and doing something to each item in the dataset, like extracting it. Read more about [loops in the intro workshop](../intro/loops.ipynb)*\n",
    "\n",
    "Below, I will be explaining the code logic in by writing it out in \"pseudo-code\" in the comments. Pseudo-code is a cross between normal language and programming language, that is useful for explaining and working out how to write the actual programming code in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "46bbc6ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# for each card in bill_cards:\n",
    "# get the title in h3.text\n",
    "# get the caption in h2.text\n",
    "# get the category in span.text\n",
    "# get the descriptoin in p.text (if any)\n",
    "# get the link in a tag, class \"chakra-link\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "95a1c269",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AL HB261\n",
      "Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act.\n",
      "SPORTS\n",
      "\n",
      "/bills/2023/AL/HB261\n",
      "AL SB261\n",
      "Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions.\n",
      "OTHER\n",
      "\n",
      "/bills/2023/AL/SB261\n",
      "AR HB1156\n",
      "Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex.\n",
      "BATHROOM\n",
      "\n",
      "/bills/2023/AR/HB1156\n",
      "AR HB1468\n",
      "To Create The Given Name Act; And To Prohibit Requiring Employees Of Public Schools And State-supported Institutions Of Higher Education To Use A Person's Preferred Pronoun, Name, Or Title Without Parental Consent.\n",
      "EDUCATION\n",
      "\n",
      "/bills/2023/AR/HB1468\n",
      "AR HB1615\n",
      "To Create The Conscience Protection Act; And To Amend The Religious Freedom Restoration Act.\n",
      "OTHER\n",
      "\n",
      "/bills/2023/AR/HB1615\n",
      "AR SB125\n",
      "Concerning Free Speech Rights At State-supported Institutions Of Higher Education.\n",
      "EDUCATION\n",
      "\n",
      "/bills/2023/AR/SB125\n",
      "AR SB199\n",
      "Concerning Medical Malpractice And Gender Transition In Minors; And To Create The Protecting Minors From Medical Malpractice Act Of 2023.\n",
      "HEALTHCARE\n",
      "\n",
      "/bills/2023/AR/SB199\n",
      "AR SB270\n",
      "To Amend The Criminal Offense Of Sexual Indecency With A Child.\n",
      "BATHROOM\n",
      "\n",
      "/bills/2023/AR/SB270\n",
      "AR SB294\n",
      "To Create The Learns Act; To Amend Various Provisions Of The Arkansas Code As They Relate To Early Childhood Through Grade Twelve Education In The State Of Arkansas; And To Declare An Emergency.\n",
      "BATHROOM\n",
      "\n",
      "/bills/2023/AR/SB294\n",
      "AR SB43\n",
      "To Add Certain Restrictions To An Adult-oriented Performance; And To Define An Adult-oriented Performance.\n",
      "OTHER\n",
      "\n",
      "/bills/2023/AR/SB43\n"
     ]
    }
   ],
   "source": [
    "# runs the loop on the bill cards\n",
    "\n",
    "for item in bill_cards[:10]: # only the first ten cards, just to check if it is working\n",
    "    print(item.h3.text) # title\n",
    "    print(item.h2.text) # caption\n",
    "    print(item.span.text) # category\n",
    "    print(item.p.text) # description (if any)\n",
    "    print(item.a['href']) # add https://translegislation.com/bills/2023/US"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b38a118",
   "metadata": {},
   "source": [
    "It worked! Now, the next step is to assign a variable for each item. This allows us to save the data to the variable name, and later, to add it to a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "912fe161",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AL HB261 Relating to two-year and four-year public institutions of higher education; to amend Section 16-1-52, Code of Alabama 1975, to prohibit a biological male from participating on an athletic team or sport designated for females; to prohibit a biological female from participating on an athletic team or sport designated for males; to prohibit adverse action against a public K-12 school or public two-year or four-year institution of higher education for complying with this act; to prohibit adverse action or retaliation against a student who reports a violation of this act; and to provide a remedy for any student who suffers harm or is directy deprived of an athletic opportunity as a result of a violation of this act. SPORTS  https://translegislation.com/bills/2023/passed/bills/2023/AL/HB261\n",
      "AL SB261 Relating to public contracts; to prohibit governmental entities from entering into certain contracts with companies that boycott businesses because the business engages in certain sectors or does not meet certain environmental or corporate governance standards or does not facilitate certain activities; to provide that no company in the state shall be required by a governmental entity, nor penalized by a governmental entity for declining to engage in economic boycotts or other actions that further social, political, or ideological interests; to require the Attorney General to take actions to prevent federal laws or actions from penalizing, inflicting harm on, limiting commercial relations with, or changing or limiting the activities of companies or residents of the state based on the furtherance of economic boycott criteria; and to authorize the Attorney General to investigate and enforce this act; and to provide definitions. OTHER  https://translegislation.com/bills/2023/passed/bills/2023/AL/SB261\n",
      "AR HB1156 Concerning A Public School District Or Open-enrollment Public Charter School Policy Relating To A Public School Student's Sex. BATHROOM  https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1156\n",
      "AR HB1468 To Create The Given Name Act; And To Prohibit Requiring Employees Of Public Schools And State-supported Institutions Of Higher Education To Use A Person's Preferred Pronoun, Name, Or Title Without Parental Consent. EDUCATION  https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1468\n",
      "AR HB1615 To Create The Conscience Protection Act; And To Amend The Religious Freedom Restoration Act. OTHER  https://translegislation.com/bills/2023/passed/bills/2023/AR/HB1615\n",
      "AR SB125 Concerning Free Speech Rights At State-supported Institutions Of Higher Education. EDUCATION  https://translegislation.com/bills/2023/passed/bills/2023/AR/SB125\n",
      "AR SB199 Concerning Medical Malpractice And Gender Transition In Minors; And To Create The Protecting Minors From Medical Malpractice Act Of 2023. HEALTHCARE  https://translegislation.com/bills/2023/passed/bills/2023/AR/SB199\n",
      "AR SB270 To Amend The Criminal Offense Of Sexual Indecency With A Child. BATHROOM  https://translegislation.com/bills/2023/passed/bills/2023/AR/SB270\n",
      "AR SB294 To Create The Learns Act; To Amend Various Provisions Of The Arkansas Code As They Relate To Early Childhood Through Grade Twelve Education In The State Of Arkansas; And To Declare An Emergency. BATHROOM  https://translegislation.com/bills/2023/passed/bills/2023/AR/SB294\n",
      "AR SB43 To Add Certain Restrictions To An Adult-oriented Performance; And To Define An Adult-oriented Performance. OTHER  https://translegislation.com/bills/2023/passed/bills/2023/AR/SB43\n"
     ]
    }
   ],
   "source": [
    "for item in bill_cards[:10]:\n",
    "    title = item.h3.text\n",
    "    caption = item.h2.text\n",
    "    category = item.find('span').text\n",
    "    description = item.p.text\n",
    "    link = 'https://translegislation.com/bills/2023/passed' + item.a['href']\n",
    "    print(title, caption, category, description, link)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e480df5",
   "metadata": {},
   "source": [
    "It works! Now let's save it to lists. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10e11cb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# a bunch of empty lists where we will dump our data\n",
    "titles = []\n",
    "captions = []\n",
    "categories = []\n",
    "descriptions = []\n",
    "\n",
    "# our for loop that saves each item we want from the bill_cards\n",
    "for item in bill_cards:\n",
    "    title = item.h3.text\n",
    "    caption = item.h2.text\n",
    "    category = item.find('span').text\n",
    "    description = item.p.text\n",
    "    \n",
    "    # adding the items to the empty lists\n",
    "    titles.append(title)\n",
    "    captions.append(caption)\n",
    "    categories.append(category)\n",
    "    descriptions.append(description)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "933f4fc8",
   "metadata": {},
   "source": [
    "## step 3: processing information about the bills\n",
    "Before adding saving our dataset to a spreadsheet, we are going to do a bit more data processing and gathering. This will enable us to make a more robust dataset at the end. Here, we are going to do two things:\n",
    "\n",
    "1. split the title column into state and title\n",
    "2. get the link directly to the bill page on LegiScan\n",
    "\n",
    "Like the previous sections, I'm going to use comments to write some pseudo-code that separates out the steps of the larger task. This is good practice for all programmers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "385087f5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AL HB261\n",
      "AL SB261\n",
      "AR HB1156\n",
      "AR HB1468\n",
      "AR HB1615\n",
      "AR SB125\n",
      "AR SB199\n",
      "AR SB270\n",
      "AR SB294\n",
      "AR SB43\n"
     ]
    }
   ],
   "source": [
    "# first, we will split the bill name into two variables, state and title \n",
    "# this will make things more clean when we add it to our spreadsheet\n",
    "\n",
    "for item in bill_cards[:10]:\n",
    "    state, title = item.h3.text.split(' ')\n",
    "    print(state, title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9c2b7eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "## now, we will get the link to state bill, in the following steps:\n",
    "\n",
    "## first, make a list of URLs:\n",
    "## then, for each URL, make a soup.\n",
    "## then, for each soup, get the link to the state bill, called \"extension\"\n",
    "## then, add the link extension to the root, saving it as \"urls\"\n",
    "## finally, add the urls to a new list, called \"legiscan links\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "9051983f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://translegislation.com//bills/2023/AL/HB261\n",
      "https://translegislation.com//bills/2023/AL/SB261\n",
      "https://translegislation.com//bills/2023/AR/HB1156\n",
      "https://translegislation.com//bills/2023/AR/HB1468\n",
      "https://translegislation.com//bills/2023/AR/HB1615\n",
      "https://translegislation.com//bills/2023/AR/SB125\n",
      "https://translegislation.com//bills/2023/AR/SB199\n",
      "https://translegislation.com//bills/2023/AR/SB270\n",
      "https://translegislation.com//bills/2023/AR/SB294\n",
      "https://translegislation.com//bills/2023/AR/SB43\n"
     ]
    }
   ],
   "source": [
    "for item in bill_cards[:10]:\n",
    "    extension = 'https://translegislation.com/' + item.a['href']\n",
    "    print(extension)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "92bd2f85",
   "metadata": {},
   "outputs": [],
   "source": [
    "urls = []\n",
    "for item in bill_cards:\n",
    "    extension = 'https://translegislation.com/' + item.a['href']\n",
    "    urls.append(extension)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "b6d9222e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://translegislation.com//bills/2023/AL/HB261',\n",
       " 'https://translegislation.com//bills/2023/AL/SB261',\n",
       " 'https://translegislation.com//bills/2023/AR/HB1156',\n",
       " 'https://translegislation.com//bills/2023/AR/HB1468',\n",
       " 'https://translegislation.com//bills/2023/AR/HB1615',\n",
       " 'https://translegislation.com//bills/2023/AR/SB125',\n",
       " 'https://translegislation.com//bills/2023/AR/SB199',\n",
       " 'https://translegislation.com//bills/2023/AR/SB270',\n",
       " 'https://translegislation.com//bills/2023/AR/SB294',\n",
       " 'https://translegislation.com//bills/2023/AR/SB43']"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urls[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "edd8695f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# making a soup object of *every* page that is linked\n",
    "# this may take several seconds\n",
    "\n",
    "soups = []\n",
    "for item in urls:\n",
    "    site = requests.get(item)\n",
    "    html_code = site.content\n",
    "    soup = BeautifulSoup(html_code, 'lxml')\n",
    "    soups.append(soup)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "e68204d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "legiscan_links = []\n",
    "for item in soups:\n",
    "    # get the url for state url\n",
    "    anchor_tag = item.find('a', class_='chakra-link css-oga2ct')\n",
    "    link = anchor_tag['href']\n",
    "    legiscan_links.append(link)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d12cc727",
   "metadata": {},
   "source": [
    "## 3. saving our data to a CSV\n",
    "This is the final step. First, we will import two libraries for working with tabular data `pandas` and `csv`. \n",
    "\n",
    "Then, we will add each of our lists into the \"DataFrame\" (the `pandas` term for a tabular type of object), where they will appear as separate columns. Finally, we will save our DataFrame as a .csv file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "6f35aa77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# importing the necessary libraries\n",
    "\n",
    "import pandas as pd\n",
    "import csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "ad4614dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# creating empty lists to hold all of our data\n",
    "\n",
    "states = []\n",
    "titles = []\n",
    "captions = []\n",
    "categories = []\n",
    "descriptions = []\n",
    "\n",
    "# extracting the data from the bill cards\n",
    "for item in bill_cards:\n",
    "    state, title = item.h3.text.split(' ') # adding the extra step to split the bill name into state and title items\n",
    "    caption = item.h2.text\n",
    "    category = item.find('span').text\n",
    "    description = item.p.text\n",
    "    \n",
    "    # adding the items to the empty lists\n",
    "    states.append(state)\n",
    "    titles.append(title)\n",
    "    captions.append(caption)\n",
    "    categories.append(category)\n",
    "    descriptions.append(description)\n",
    "    # remember that \"legiscan_links\" is already saved as a list, so we don't have to create it here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "65381e18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# creating a dataframe, with separate columns to hold each of our lists\n",
    "df = pd.DataFrame(\n",
    "    {'state': states,\n",
    "     'title': titles,\n",
    "     'caption': captions,\n",
    "     'category': categories,\n",
    "     'description': descriptions,\n",
    "     'legiscan link': legiscan_links\n",
    "    })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "6003a39d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>state</th>\n",
       "      <th>title</th>\n",
       "      <th>caption</th>\n",
       "      <th>category</th>\n",
       "      <th>description</th>\n",
       "      <th>legiscan link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AL</td>\n",
       "      <td>HB261</td>\n",
       "      <td>Relating to two-year and four-year public inst...</td>\n",
       "      <td>SPORTS</td>\n",
       "      <td></td>\n",
       "      <td>https://legiscan.com/AL/text/HB261/id/2817698</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AL</td>\n",
       "      <td>SB261</td>\n",
       "      <td>Relating to public contracts; to prohibit gove...</td>\n",
       "      <td>OTHER</td>\n",
       "      <td></td>\n",
       "      <td>https://legiscan.com/AL/text/SB261/id/2821857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AR</td>\n",
       "      <td>HB1156</td>\n",
       "      <td>Concerning A Public School District Or Open-en...</td>\n",
       "      <td>BATHROOM</td>\n",
       "      <td></td>\n",
       "      <td>https://legiscan.com/AR/text/HB1156/id/2756961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AR</td>\n",
       "      <td>HB1468</td>\n",
       "      <td>To Create The Given Name Act; And To Prohibit ...</td>\n",
       "      <td>EDUCATION</td>\n",
       "      <td></td>\n",
       "      <td>https://legiscan.com/AR/text/HB1468/id/2781770</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AR</td>\n",
       "      <td>HB1615</td>\n",
       "      <td>To Create The Conscience Protection Act; And T...</td>\n",
       "      <td>OTHER</td>\n",
       "      <td></td>\n",
       "      <td>https://legiscan.com/AR/text/HB1615/id/2781807</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  state   title                                            caption   category  \\\n",
       "0    AL   HB261  Relating to two-year and four-year public inst...     SPORTS   \n",
       "1    AL   SB261  Relating to public contracts; to prohibit gove...      OTHER   \n",
       "2    AR  HB1156  Concerning A Public School District Or Open-en...   BATHROOM   \n",
       "3    AR  HB1468  To Create The Given Name Act; And To Prohibit ...  EDUCATION   \n",
       "4    AR  HB1615  To Create The Conscience Protection Act; And T...      OTHER   \n",
       "\n",
       "  description                                   legiscan link  \n",
       "0               https://legiscan.com/AL/text/HB261/id/2817698  \n",
       "1               https://legiscan.com/AL/text/SB261/id/2821857  \n",
       "2              https://legiscan.com/AR/text/HB1156/id/2756961  \n",
       "3              https://legiscan.com/AR/text/HB1468/id/2781770  \n",
       "4              https://legiscan.com/AR/text/HB1615/id/2781807  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# checking the first 5 lines of the dataframe\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "6f63d6e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# saving the dataframe as a csv file\n",
    "\n",
    "df.to_csv('bill_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e367833f",
   "metadata": {},
   "source": [
    "And that's all! If you are on google colab, check your sidebar under the \"files\" tab. You should see a .csv file containing the data we've scraped from the `translegislation.com` website. Well done!\n",
    "\n",
    "In the next section, we will look at an API method for getting legislative data, and save that data to a CSV file. In that activity, you'll see the differences in handling data acrossn web scraping and API methods."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}