cleaning our text#

In this section we will clean the bill text that we scraped from the last section. We will do this in the following steps:

  1. get our text (if we don’t already have it loaded up) from either scraping or loading the file from the previous section.

  2. inspect our text to identify elements that we want to clean

  3. write loops to remove these elements from the text

  4. learn about functions so we can write one to clean our text in an automatic way

  5. keep improving the funtion to clean more and more elements

# run the lines below to load up the text from the course website

import requests
source = requests.get('https://bit.ly/transgender_text')
text = source.content
text[:100]
b'<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n'
text = text.decode('utf-8')
# alternatively, uncomment the bottom four lines to load it from your own space
# notice that the data is already in a string format.

# load = open('sample.txt')
# loaded_text = load.read()
# load.close()
# loaded_text[:100]

inspecting our text#

Remember slicing? Take some slices of the text to see what elements we want to clean. Come up with a list of things that we want to remove.

text[:1000]
'<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n  1st Session\n                                H. R. 1112\n\n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n\n_______________________________________________________________________\n\n\n                    IN THE HOUSE OF REPRESENTATIVES\n\n                           February 18, 2021\n\n    Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n  Buchanan) introduced the following bill; which was referred to the \n                      Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n                                 A BILL\n\n\n \n   To require a report on the military coup in Burma, and for other \n                               purposes.\n\n    Be it enacted by the Senate and House of Representatives of the'
text[3000:4000]
"due to the Burmese military's \n        actions directly threatens the democratic trajectory of Burma's \n        Parliament, and thereby the country;\n            (3) the will and determination of those duly-elected \n        Members of Parliament who are taking it upon themselves to \n        continue serving as representatives of the people through \n        alternative methods of communicating and convening should be \n        lauded; and\n            (4) by preventing the Parliament from completing its work, \n        the Burmese military has rendered impossible and effectively \n        nullified the international collaborative relationships that \n        have supported and strengthened the institution, including the \n        Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n    It is the policy of the United States to--\n            (1) engage with the Association of Southeast Asian Nations \n        (ASEAN) and ASEAN member states to--\n                    (A) condem"

looping through the text to replace() it#

These are the elements we want to clean, as well as the large blank spaces:

\n
/n
\\n
_
[
]
<html><body><pre>
<html><body><pre>

When you have a lot of items to remove at once, it’s best to put them into a list. Then we can write a loop that goes through each item in the “take out” list to see if it’s in the text data. If it is, we will replace that item with a blank space.

to_take_out = ['\n', '/n', '\\n', '_', '[', ']', '<html><body><pre>', '<html><body><pre>', '  ']
for item in to_take_out:
    if item in text:
        # here is a complicated line of code: 
        # we are replacing the item with nothing, indicated by two quotes 
        # then we are saving those results to "text", effectively overwriting
        # the variable. 
        text = text.replace(item, '')
text[:1000]
"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeH.R. 1112 Introduced in House (IH)&lt;DOC&gt;117th CONGRESS1st SessionH. R. 1112 To require a report on the military coup in Burma, and for otherpurposes.IN THE HOUSE OF REPRESENTATIVES February 18, 2021Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs A BILLTo require a report on the military coup in Burma, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1. SHORT TITLE.This Act may be cited as the ``Protect Democracy in Burma Act of 2021''.SEC. 2. FINDINGS.Congress finds the following:(1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work di"

function to automate cleaning#

Let’s say we want to do this to many bits of text, not just one. We could automate the work by writing a function that can run on as many texts as we want.

Functions have two key components: the definition and the call. You first define the function and what it does, then you “call” it to get it to work on a particular piece of data.

Let’s start with the definition. First, you name the function, and include parentheses for your parameters (more on this in a moment). Then, in the body of the definition, you write whatever python code you want to execute for that function. Finally, you have a return statement that saves or “returns” the result, so to speak, from the function.

def add(x,y):
    answer = x + y
    return answer

Then we call the function.

add(5, 10021)
10026

The basic idea is that the input data, whatever data you want the function to work with, goes inside the parentheses. So that within the body of the function definition, that input data (known formally as “parameters”) gets assigned to whatever variable is in the definition.

This makes functions portable, so to speak, as you can write one, then call it using whatever input data that you like.

What would a function for our text cleaner look like?

# remove all the characters in the "take out" list by writing a
# loop that replaces those characters with an empty character, ''
def clean_up(data):
    to_take_out = ['\n', '/n', '\\n', '_', '[', ']', '<html><body><pre>', '<html><body><pre>', '  ']
    for item in to_take_out:
        if item in data:
            data = data.replace(item, '')
    return data
cleaned = clean_up(text)
cleaned[:1000]
"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeH.R. 1112 Introduced in House (IH)&lt;DOC&gt;117th CONGRESS1st SessionH. R. 1112 To require a report on the military coup in Burma, and for otherpurposes.IN THE HOUSE OF REPRESENTATIVES February 18, 2021Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs A BILLTo require a report on the military coup in Burma, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1. SHORT TITLE.This Act may be cited as the ``Protect Democracy in Burma Act of 2021''.SEC. 2. FINDINGS.Congress finds the following:(1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work di"

Finally, we save our text. And that’s it!

with open('clean_sample.txt', 'w') as f:
    f.write(cleaned)