chapter overview

chapter overview#

learning objectives#

This chapter explores approaches for programmatically extracting information from websites using web scraping and APIs. Participants will practice gathering data from the web about politics, art, news and more, for example, scraping data about congressional legislation and using the Metropolitan Museum of Art API. They will work primarily with the requests and bs4 Python libraries for gathering data and the pandas library for organizing and saving that data. They will also learn principles from object-oriented programming to gather and parse HTML and JSON type data.

This chapter introduces:

  • overview of HTML and how to use web browser developer tools

  • the differences between web scraping and APIs and what methods are appropriate for specific use cases

  • ethical and legal considerations for data gathering, use, and publication

  • Python libraries for data gathering and simple analysis, like bs4, requests, and pandas

  • object-oriented programming syntax for parsing text based data

These lessons will not to make you an expert at web scraping or APIs, but it will create a solid foundation for you begin gathering data from the web, even if you are a total beginner.

Python environments#

There are many ways to use Python. For this workshop, we will be using Jupyter-Notebooks, installed through the Python Anaconda distrubtion. This option is convenient because it creates a “local” version of Python directly on your computer, which means you can use it in mutiple ways and without an internet connection.

For those of you who cannot download Python, you can use Google Colab, a browser-based tool for running Python code. Like Google Docs, Google Colab creates a collaborative environment hosted on the Google cloud for authoring content. Whereas most Python environments require installations (some of which can be really complicated), Google Colab offers Python software pre-installed on the cloud environment. It enables new users to jump right into programming.