Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

cleaning our text

In this section we will clean the bill text that we scraped from the last section. We will do this in the following steps:

  1. get our text (if we don’t already have it loaded up) from either scraping or loading the file from the previous section.

  2. inspect our text to identify elements that we want to clean

  3. write loops to remove these elements from the text

  4. learn about functions so we can write one to clean our text in an automatic way

  5. keep improving the funtion to clean more and more elements

Then, at the end, we will look at other tools (ChatGPT and OpenRefine) to help with the cleaning process.

# run the lines below to load up the text from the course website

import requests
source = requests.get('https://bit.ly/transgender_text')
text = source.content
text[:100]
b'<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n'
type(text)
bytes
text = text.decode('utf-8')
# alternatively, uncomment the bottom four lines to load it from your own space
# notice that the data is already in a string format.

# load = open('sample.txt')
# loaded_text = load.read()
# load.close()
# loaded_text[:100]

inspecting our text

Remember slicing? Take some slices of the text to see what elements we want to clean. Come up with a list of things that we want to remove.

text[:1000]
'<html><body><pre>\n[Congressional Bills 117th Congress]\n[From the U.S. Government Publishing Office]\n[H.R. 1112 Introduced in House (IH)]\n\n&lt;DOC&gt;\n\n\n\n\n\n\n117th CONGRESS\n 1st Session\n H. R. 1112\n\n To require a report on the military coup in Burma, and for other \n purposes.\n\n\n_______________________________________________________________________\n\n\n IN THE HOUSE OF REPRESENTATIVES\n\n February 18, 2021\n\n Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. \n Buchanan) introduced the following bill; which was referred to the \n Committee on Foreign Affairs\n\n_______________________________________________________________________\n\n A BILL\n\n\n \n To require a report on the military coup in Burma, and for other \n purposes.\n\n Be it enacted by the Senate and House of Representatives of the'
text[3000:4000]
"due to the Burmese military's \n actions directly threatens the democratic trajectory of Burma's \n Parliament, and thereby the country;\n (3) the will and determination of those duly-elected \n Members of Parliament who are taking it upon themselves to \n continue serving as representatives of the people through \n alternative methods of communicating and convening should be \n lauded; and\n (4) by preventing the Parliament from completing its work, \n the Burmese military has rendered impossible and effectively \n nullified the international collaborative relationships that \n have supported and strengthened the institution, including the \n Burmese parliament's partnership with HDP.\n\nSEC. 4. STATEMENT OF POLICY.\n\n It is the policy of the United States to--\n (1) engage with the Association of Southeast Asian Nations \n (ASEAN) and ASEAN member states to--\n (A) condem"

looping through the text to replace() it

These are the elements we want to clean, as well as the large blank spaces:

\n
/n
\\n
_
[
]
<html><body><pre>
<html><body><pre>

When you have a lot of items to remove at once, it’s best to put them into a list. Then we can write a loop that goes through each item in the “take out” list to see if it’s in the text data. If it is, we will replace that item with a blank space.

to_take_out = ['\n', '/n', '\\n', '_', '[', ']', '<html><body><pre>', '<html><body><pre>', '  ']
for item in to_take_out:
    if item in text:
        # here is a complicated line of code: 
        # we are replacing the item with nothing, indicated by two quotes 
        # then we are saving those results to "text", effectively overwriting
        # the variable. 
        text = text.replace(item, '')
text[:1000]
"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeH.R. 1112 Introduced in House (IH)&lt;DOC&gt;117th CONGRESS1st SessionH. R. 1112 To require a report on the military coup in Burma, and for otherpurposes.IN THE HOUSE OF REPRESENTATIVES February 18, 2021Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs A BILLTo require a report on the military coup in Burma, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1. SHORT TITLE.This Act may be cited as the ``Protect Democracy in Burma Act of 2021''.SEC. 2. FINDINGS.Congress finds the following:(1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work di"

challenge: replacing with spaces

Notice that now some of the words in our dataset are joined with other words. That’s because we took out the new lines. How could we change the code so that we keep spaces between words?

function to automate cleaning

Let’s say we want to do this to many bits of text, not just one. We could automate the work by writing a function that can run on as many texts as we want.

Functions have two key components: the definition and the call. You first define the function and what it does, then you “call” it to get it to work on a particular piece of data.

Let’s start with the definition. First, you name the function, and include parentheses for your parameters (more on this in a moment). Then, in the body of the definition, you write whatever python code you want to execute for that function. Finally, you have a return statement that saves or “returns” the result, so to speak, from the function.

def add(x,y):
    answer = x + y
    return answer

Then we call the function.

add(5, 10021)
10026

The basic idea is that the input data, whatever data you want the function to work with, goes inside the parentheses. So that within the body of the function definition, that input data (known formally as “parameters”) gets assigned to whatever variable is in the definition.

This makes functions portable, so to speak, as you can write one, then call it using whatever input data that you like.

What would a function for our text cleaner look like?

# remove all the characters in the "take out" list by writing a
# loop that replaces those characters with an empty character, ''
def clean_up(data):
    to_take_out = ['\n', '/n', '\\n', '_', '[', ']', '<html><body><pre>', '<html><body><pre>', '  ']
    for item in to_take_out:
        if item in data:
            data = data.replace(item, '')
    return data
cleaned = clean_up(text)
cleaned[:1000]
"Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeH.R. 1112 Introduced in House (IH)&lt;DOC&gt;117th CONGRESS1st SessionH. R. 1112 To require a report on the military coup in Burma, and for otherpurposes.IN THE HOUSE OF REPRESENTATIVES February 18, 2021Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs A BILLTo require a report on the military coup in Burma, and for otherpurposes.Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,SECTION 1. SHORT TITLE.This Act may be cited as the ``Protect Democracy in Burma Act of 2021''.SEC. 2. FINDINGS.Congress finds the following:(1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work di"

Finally, we save our text. And that’s it!

with open('clean_sample.txt', 'w') as f:
    f.write(cleaned)

cleaning with chatgpt

prompt: clean all of the blank spaces and extraneous characters from this text with python.

import re
from html import unescape

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Unescape HTML entities
    text = unescape(text)
    
    # Remove extra blank lines and whitespace
    text = re.sub(r'\n\s*\n', '\n', text)  # Remove multiple newlines
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize spaces

    return text
text = clean_text(text)
text[:10000]
"[Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 1112 Introduced in House (IH)] <DOC> 117th CONGRESS 1st Session H. R. 1112 To require a report on the military coup in Burma, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES February 18, 2021 Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs _______________________________________________________________________ A BILL To require a report on the military coup in Burma, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Protect Democracy in Burma Act of 2021''. SEC. 2. FINDINGS. Congress finds the following: (1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work directly with parliaments around the world to support the development of effective, independent, and responsive legislative institutions. (2) HDP approved a legislative strengthening partnership with Burma in 2016 and organized the first congressional delegation to meet with the new civilian-led government, led by State Counselor Aung San Suu Kyi, and civil society leaders in May 2016. (3) On February 2, 2021, the U.S. Department of State assessed that Daw Aung San Suu Kyi, the leader of Burma's ruling party, and President Win Myint, the duly elected head of government, were deposed in a military coup on February 1, 2021. (4) As part of the military coup, the Burmese military declared martial law, suspended the civilian-led government, and detained newly elected Members of Parliament in the capitol, Naypyidaw, thereby usurping the role of the democratically elected government and parliament. SEC. 3. SENSE OF CONGRESS. It is the sense of Congress that-- (1) due to the Burmese military's seizure of government through the detention of State Counsellor Aung San Suu Kyi, President Win Myint, and other government leaders, Burma is not represented by a democratically elected government; (2) the inability of newly elected Members of Parliament to begin their official mandate due to the Burmese military's actions directly threatens the democratic trajectory of Burma's Parliament, and thereby the country; (3) the will and determination of those duly-elected Members of Parliament who are taking it upon themselves to continue serving as representatives of the people through alternative methods of communicating and convening should be lauded; and (4) by preventing the Parliament from completing its work, the Burmese military has rendered impossible and effectively nullified the international collaborative relationships that have supported and strengthened the institution, including the Burmese parliament's partnership with HDP. SEC. 4. STATEMENT OF POLICY. It is the policy of the United States to-- (1) engage with the Association of Southeast Asian Nations (ASEAN) and ASEAN member states to-- (A) condemn the military coup in Burma; (B) urge the unconditional release of detained democratically elected leaders and civil society members; and (C) support a return to Burma's democratic transition; and (2) instruct, as appropriate, representatives of the United States Government to use the voice, vote, and influence of the United States at the United Nations to hold accountable those responsible for the military coup in Burma. SEC. 5. REPORT. Not later than 90 days after the date of the enactment of this Act, the Secretary of State shall submit to the Committee on Foreign Affairs and the Committee on Appropriations of the House of Representatives and the Committee on Foreign Relations and the Committee on Appropriations of the Senate a report on the military coup in Burma, including a description of efforts to implement the policy specified in section 4. <all> [Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 435 Introduced in House (IH)] <DOC> 117th CONGRESS 1st Session H. R. 435 To exclude from tax certain payments of Federal pandemic unemployment compensation, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES January 21, 2021 Ms. Velazquez (for herself, Ms. Norton, Ms. Williams of Georgia, Mr. Smith of Washington, Ms. Jayapal, Mr. Meeks, Mr. Welch, Ms. Schakowsky, Ms. Kaptur, Ms. Tlaib, Ms. Wild, Ms. Newman, Ms. Lee of California, Ms. Meng, and Mr. Nadler) introduced the following bill; which was referred to the Committee on Ways and Means _______________________________________________________________________ A BILL To exclude from tax certain payments of Federal pandemic unemployment compensation, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Excluding Pandemic Unemployment Compensation from Income Act''. SEC. 2. CERTAIN PAYMENTS OF FEDERAL PANDEMIC UNEMPLOYMENT COMPENSATION NOT INCLUDIBLE IN GROSS INCOME OR TAKEN INTO ACCOUNT IN DETERMINING CERTAIN MEANS-TESTED BENEFITS. (a) Exclusion From Gross Income.--For purposes of the Internal Revenue Code of 1986, gross income shall not include the amount specified in section 2104(b)(3) of the CARES Act to the extent such amount is received by the taxpayer pursuant to section 2102, 2104, or 2107 of such Act. (b) Disregarded in the Administration of Federal Programs and Federally Assisted Programs.--For purposes of section 6409 of the Internal Revenue Code of 1986, any amount excluded from gross income under subsection (a) shall be treated in the same manner as a refund under such Code. (c) Effective Dates.-- (1) Exclusion.--Subsection (a) shall apply to taxable years ending after the date of the enactment of the CARES Act. (2) Disregard.--Subsection (b) shall apply to amounts received after the date of the enactment of the CARES Act. <all> [Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 886 Introduced in House (IH)] <DOC> 117th CONGRESS 1st Session H. R. 886 To provide funds through the Social Services Block Grant program for diaper assistance. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES February 5, 2021 Ms. Lee of California (for herself, Mr. Rush, Mr. Carson, Ms. Meng, Mr. Sires, and Mrs. Hayes) introduced the following bill; which was referred to the Committee on Ways and Means _______________________________________________________________________ A BILL To provide funds through the Social Services Block Grant program for diaper assistance. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``COVID-19 Diaper Assistance Act''. SEC. 2. FINDINGS. Congress finds the following: (1) Infants need between 6 to 12 diapers a day. Infants and toddlers lacking access to clean diapers have a greater risk for health complications, which require costly and difficult to access medical care. (2) Prior to the COVID-19 pandemic, surveys indicated that 1 in 3 families in the United States with young children could not afford an adequate supply of diapers to keep their child clean, dry, and healthy. (3) Low-income families pay an even higher than average price for diapers because they do not have access to money saving alternatives such as bulk buying or online shopping that can reduce costs. Low-income families spend about 14 percent of their income on diapers. (4) Clean diaper access presents a financial and stressful burden on families' dependent on child care services. Research indicates that when diaper need is a barrier to child care, parents miss an average of 4 workdays a month. (5) The COVID-19 pandemic has further raised demand for diapers and strained diaper banks' ability to meet low-income families' needs. Prior to the COVID-19 pandemic, nonprofit diaper bank distribution only met 5 to 6 percent of diaper need. Since the start of the COVID-19 pandemic, diaper banks around the country have experienced double, triple, or greater increase in demand for diapers due to the pandemic and economic shutdown. Nonprofits alone cannot fully address this public health crisis that impacts 1 in 3 families in the United States. SEC. 3. TARGETED FUNDING FOR STATES FOR DIAPER ASSISTANCE THROUGH THE SOCIAL SERVICES BLOCK GRANT PROGRAM. (a) Increase in Funding for Social Services Block Grant Program.-- (1) In general.--The amount specified in subsection (c) of section 2003 of the Social Security Act (42 U.S.C. 1397b) for purposes of subsections (a) and (b) of such section is deemed to be $1,900,000,000 for fiscal year 2021, of which $200,000,000 shall be obligated by States in accordance with subsection (b). (2) Appropriation.--Out of any money in the Treasury of the United States not otherwise appropriated, there are appropriated $200,000,000, which shall be available for payments under section 2002 of the Social Security Act (42 U.S.C. 1397a), which shall remain available for expenditure through December 31, 2022. (3) Deadline for distribution of funds.--Within 45 days after the date of enactment of this Act, the Secretary of Health and Human Services shall distribute the funds made available by this subsection, which shall be made available to States on an emergency basis for immediate obligation and expenditure. (4) Submission of revised pre-expenditure report.--Within 90 days after a State receives funds made available by this subsection, the State shall submit to the Secretary of Health and Human "

follow-up prompt:

This text is a collection of legislative bills. Now write me some code that separates the long string of text into individual bills.

def split_bills(text):
    # Define a regex pattern to detect the start of a new bill
    bill_pattern = re.compile(r'(?=\d{3}th CONGRESS\s+\d{1,2}st Session\s+H\. R\. \d+)', re.MULTILINE)
    
    # Split the text using the pattern
    bills = bill_pattern.split(text)
    
    # Remove empty entries and strip whitespace
    bills = [bill.strip() for bill in bills if bill.strip()]
    
    return bills
bills = split_bills(text)
type(bills)
list
len(bills)
11
bills[0]
'[Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 1112 Introduced in House (IH)] <DOC>'
bills[1]
"117th CONGRESS 1st Session H. R. 1112 To require a report on the military coup in Burma, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES February 18, 2021 Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs _______________________________________________________________________ A BILL To require a report on the military coup in Burma, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Protect Democracy in Burma Act of 2021''. SEC. 2. FINDINGS. Congress finds the following: (1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work directly with parliaments around the world to support the development of effective, independent, and responsive legislative institutions. (2) HDP approved a legislative strengthening partnership with Burma in 2016 and organized the first congressional delegation to meet with the new civilian-led government, led by State Counselor Aung San Suu Kyi, and civil society leaders in May 2016. (3) On February 2, 2021, the U.S. Department of State assessed that Daw Aung San Suu Kyi, the leader of Burma's ruling party, and President Win Myint, the duly elected head of government, were deposed in a military coup on February 1, 2021. (4) As part of the military coup, the Burmese military declared martial law, suspended the civilian-led government, and detained newly elected Members of Parliament in the capitol, Naypyidaw, thereby usurping the role of the democratically elected government and parliament. SEC. 3. SENSE OF CONGRESS. It is the sense of Congress that-- (1) due to the Burmese military's seizure of government through the detention of State Counsellor Aung San Suu Kyi, President Win Myint, and other government leaders, Burma is not represented by a democratically elected government; (2) the inability of newly elected Members of Parliament to begin their official mandate due to the Burmese military's actions directly threatens the democratic trajectory of Burma's Parliament, and thereby the country; (3) the will and determination of those duly-elected Members of Parliament who are taking it upon themselves to continue serving as representatives of the people through alternative methods of communicating and convening should be lauded; and (4) by preventing the Parliament from completing its work, the Burmese military has rendered impossible and effectively nullified the international collaborative relationships that have supported and strengthened the institution, including the Burmese parliament's partnership with HDP. SEC. 4. STATEMENT OF POLICY. It is the policy of the United States to-- (1) engage with the Association of Southeast Asian Nations (ASEAN) and ASEAN member states to-- (A) condemn the military coup in Burma; (B) urge the unconditional release of detained democratically elected leaders and civil society members; and (C) support a return to Burma's democratic transition; and (2) instruct, as appropriate, representatives of the United States Government to use the voice, vote, and influence of the United States at the United Nations to hold accountable those responsible for the military coup in Burma. SEC. 5. REPORT. Not later than 90 days after the date of the enactment of this Act, the Secretary of State shall submit to the Committee on Foreign Affairs and the Committee on Appropriations of the House of Representatives and the Committee on Foreign Relations and the Committee on Appropriations of the Senate a report on the military coup in Burma, including a description of efforts to implement the policy specified in section 4. <all> [Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 435 Introduced in House (IH)] <DOC>"
bills[2]
"117th CONGRESS 1st Session H. R. 435 To exclude from tax certain payments of Federal pandemic unemployment compensation, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES January 21, 2021 Ms. Velazquez (for herself, Ms. Norton, Ms. Williams of Georgia, Mr. Smith of Washington, Ms. Jayapal, Mr. Meeks, Mr. Welch, Ms. Schakowsky, Ms. Kaptur, Ms. Tlaib, Ms. Wild, Ms. Newman, Ms. Lee of California, Ms. Meng, and Mr. Nadler) introduced the following bill; which was referred to the Committee on Ways and Means _______________________________________________________________________ A BILL To exclude from tax certain payments of Federal pandemic unemployment compensation, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Excluding Pandemic Unemployment Compensation from Income Act''. SEC. 2. CERTAIN PAYMENTS OF FEDERAL PANDEMIC UNEMPLOYMENT COMPENSATION NOT INCLUDIBLE IN GROSS INCOME OR TAKEN INTO ACCOUNT IN DETERMINING CERTAIN MEANS-TESTED BENEFITS. (a) Exclusion From Gross Income.--For purposes of the Internal Revenue Code of 1986, gross income shall not include the amount specified in section 2104(b)(3) of the CARES Act to the extent such amount is received by the taxpayer pursuant to section 2102, 2104, or 2107 of such Act. (b) Disregarded in the Administration of Federal Programs and Federally Assisted Programs.--For purposes of section 6409 of the Internal Revenue Code of 1986, any amount excluded from gross income under subsection (a) shall be treated in the same manner as a refund under such Code. (c) Effective Dates.-- (1) Exclusion.--Subsection (a) shall apply to taxable years ending after the date of the enactment of the CARES Act. (2) Disregard.--Subsection (b) shall apply to amounts received after the date of the enactment of the CARES Act. <all> [Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 886 Introduced in House (IH)] <DOC>"
bills[5]
"117th CONGRESS 1st Session H. R. 269 To enhance the safety of ports of entry in the United States, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES January 11, 2021 Mr. Vela introduced the following bill; which was referred to the Committee on Transportation and Infrastructure _______________________________________________________________________ A BILL To enhance the safety of ports of entry in the United States, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Putting Our Resources Toward Security (PORTS) Act''. SEC. 2. IMPROVING PORTS OF ENTRY FOR BORDER SECURITY AND OTHER PURPOSES. (1) In general.--There are authorized to be appropriated to the Administrator of the General Services Administration $2,000,000,000 for each of fiscal years 2022 through 2024 to make improvements to existing ports of entry in the United States to improve border security and for other purposes. (2) Priority.--In making improvements described in subsection (a), the Administrator of the General Services Administration, in coordination with the Secretary of Homeland Security, shall give priority to the ports of entry that the Administrator determines are in most need of repair to improve border security and for other purposes in accordance with port of entry infrastructure assessment studies required in section 603 of title VI, division E, of the Consolidated Appropriations Act of 2008 (Public Law 110-161). <all> [Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 144 Introduced in House (IH)] <DOC>"
bills[-1]
"117th CONGRESS 1st Session H. R. 154 To provide regulatory relief for Black and community banks, to codify the Minority Bank Deposit Program, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES January 4, 2021 Mr. Rush (for himself, Mr. Carson, Ms. Clarke of New York, Mr. Cooper, Mr. Khanna, and Mr. Welch) introduced the following bill; which was referred to the Committee on Financial Services, and in addition to the Committee on Ways and Means, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee concerned _______________________________________________________________________ A BILL To provide regulatory relief for Black and community banks, to codify the Minority Bank Deposit Program, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Reenergized Economic Sustainability for Community and Urban Entities Act for Black and Community Banks'' or the ``RESCUE Act for Black and Community Banks''. SEC. 2. REGULATION OF BLACK AND COMMUNITY BANKS. (a) Office of Black and Community Banks.-- (1) Establishment.--There is established within the Office of the Comptroller of the Currency an office to be known as the ``Office of Black and Community Banks''. (2) Supervision and examination of black banks and community banks.--The Comptroller of the Currency, acting through the Office of Black and Community Banks, shall supervise and examine Black banks and community banks. (3) Regulatory relief.-- (A) In general.--The Comptroller shall issue regulations to partially or completely exempt Black banks and community banks from Federal banking statutes and regulations, to the extent the Comptroller determines it appropriate without endangering the safety and soundness of such banks. (B) Treatment of manual underwriting.--For purposes of risk-based capital requirements for Black banks and community banks, the Comptroller shall issue regulations to assign a lower level of risk to loans that are issued by such banks using manual underwriting, in recognition of the individualized scrutiny provided by manual underwriting. (C) Encouraging small-dollar lending.--The Comptroller shall issue regulations to encourage affordable small-dollar lending by Black banks and community banks by providing regulatory flexibility with respect to such lending. (b) Regulatory Relief Under the Securities Laws.-- (1) Investment products.--With respect to investment products sold by a Black bank or a community bank (or an affiliate of such bank) to individuals in the community in which such bank is located, the Securities and Exchange Commission shall issue regulations to partially or completely exempt the bank from the securities laws and regulations issued under the securities laws, to the extent the Commission determines it appropriate without endangering the protection of investors. (2) Securities.-- (A) In general.--The Securities and Exchange Commission shall issue regulations to reduce the regulatory burden applicable to Black banks and community banks-- (i) under the amendments made by the Jumpstart Our Business Startups Act; (ii) issuing mortgage-backed securities; and (iii) issuing securities backed by loans guaranteed by the Small Business Act. (B) Crowdfunding exemption.--Section 4A of the Securities Act of 1933 (15 U.S.C. 77d-1) shall not apply to Black banks or community banks. (c) Conforming Change to Definition of Appropriate Federal Banking Agency.--Section 3(q)(1) of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)(1)) is amended-- (1) in subparagraph (B), by striking ``and'' at the end; (2) in subparagraph (C), by adding ``and'' at the end; and (3) by adding at the end the following: ``(D) notwithstanding paragraphs (2) and (3), any Black bank or community bank (as such terms are defined under section 5 of the RESCUE Act for Black and Community Banks);''. SEC. 3. CODIFICATION OF THE MINORITY BANK DEPOSIT PROGRAM. (a) In General.-- (1) In general.--Section 1204 of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note) is amended to read as follows: ``SEC. 1204. EXPANSION OF USE OF MINORITY BANKS, WOMEN'S BANKS, AND LOW-INCOME CREDIT UNIONS. ``(a) Minority Bank Deposit Program.-- ``(1) Establishment.--There is established a program to be known as the `Minority Bank Deposit Program' to expand the use of minority banks, women's banks, and low-income credit unions. ``(2) Administration.--The Secretary of the Treasury, acting through the Fiscal Service, shall-- ``(A) on application by a depository institution or credit union, certify whether such depository institution or credit union is a minority bank, women's bank, or low-income credit union; ``(B) maintain and publish a list of all depository institutions and credit unions that have been certified pursuant to subparagraph (A); ``(C) periodically distribute the list described in subparagraph (B) to-- ``(i) all Federal departments and agencies; ``(ii) interested State and local governments; and ``(iii) interested private sector companies; and ``(D) support the creation of ratings, online Black bank resources, and database products, including online lending and investment facilities. ``(3) Inclusion of certain entities on list.--A depository institution or credit union that, on the date of the enactment of this section, has a current certification from the Secretary of the Treasury stating that such depository institution or credit union is a minority bank, women's bank, or low-income credit union shall be included on the list described under paragraph (2)(B). ``(b) Expanded Use Among Federal Departments and Agencies.-- ``(1) In general.--Not later than 1 year after the establishment of the program described in subsection (a), the head of each Federal department or agency shall develop and implement standards and procedures to ensure, to the maximum extent possible as permitted by law, the use of minority banks, women's banks, and low-income credit unions to serve the financial needs of each such department or agency. ``(2) Minimum requirement.--Notwithstanding paragraph (1), the head of each Federal department or agency shall ensure that at least 10 percent of the financial needs of each such department or agency are met by the use of minority banks, women's banks, and low-income credit unions. ``(3) Report to congress.--Not later than 2 years after the establishment of the program described in subsection (a), and annually thereafter, the head of each Federal department or agency shall submit to Congress a report on the actions taken to increase the use of minority banks, women's banks, and low- income credit unions to serve the financial needs of each such department or agency. ``(c) Definitions.--For purposes of this section: ``(1) Credit union.--The term `credit union' has the meaning given the term `insured credit union' in section 101 of the Federal Credit Union Act (12 U.S.C. 1752). ``(2) Depository institution.--The term `depository institution' has the meaning given the term `insured depository institution' in section 3 of the Federal Deposit Insurance Act (12 U.S.C. 1813). ``(3) Low-income credit union.--The term `low-income credit union' means any entity described in section 19(b)(1)(A)(iv) of the Federal Reserve Act. ``(4) Minority.--The term `minority' means any Black American, Native American, Hispanic American, or Asian American. ``(5) Minority bank.--The term `minority bank' means any bank described in clause (i), (ii), or (iii) of section 19(b)(1)(A) of the Federal Reserve Act for which-- ``(A) more than 50 percent of the outstanding shares of which are held by 1 or more minority individuals; ``(B) the majority of the directors on the board of directors of which are minority individuals; and ``(C) a significant percentage of senior management positions of which are held by minority individuals. ``(6) Women's bank.--The term `women's bank' means any bank described in clause (i), (ii), or (iii) of section 19(b)(1)(A) of the Federal Reserve Act for which-- ``(A) more than 50 percent of the outstanding shares of which are held by 1 or more women; ``(B) the majority of the directors on the board of directors of which are women; and ``(C) a significant percentage of senior management positions of which are held by women.''. (2) Conforming amendments.--The following provisions are amended by striking ``1204(c)(3)'' and inserting ``1204(c)'': (A) Section 808(b)(3) of the Community Reinvestment Act of 1977 (12 U.S.C. 2907(b)(3)). (B) Section 40(g)(1)(B) of the Federal Deposit Insurance Act (12 U.S.C. 1831q(g)(1)(B)). (C) Section 704B(h)(4) of the Equal Credit Opportunity Act (15 U.S.C. 1691c-2(h)(4)). (b) Amendments to the Community Reinvestment Act.--Section 804(b) of the Community Reinvestment Act of 1977 (12 U.S.C. 2903(b)) is amended to read as follows: ``(b) Cooperation With Minority Banks, Women's Banks, and Low- Income Credit Unions Considered.-- ``(1) In general.--In assessing and taking into account, under subsection (a), the record of a financial institution, the appropriate Federal financial supervisory agency shall consider as a factor capital investment, loan participation, and other ventures undertaken by the institution in cooperation with minority banks, women's banks, community development financial institutions, and low-income credit unions provided that these activities help meet the credit needs of local communities in which such institutions and credit unions are chartered. ``(2) Definitions.-- ``(A) FIRREA definitions.--The terms `low-income credit union', `minority bank', and `women's bank' have the meanings given such terms, respectively, in section 1204(c) of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note). ``(B) Community development financial institution.--The term `community development financial institution' has the meaning given in section 103(5) of the Riegle Community Development and Regulatory Improvement Act of 1994 (12 U.S.C. 4702(5)).''. (c) Considerations When Assessing Financial Inclusion for Federally Chartered Financial Institutions.-- (1) In general.--In assessing and taking into account the record of a federally chartered financial institution under any financial inclusion assessment process created by the Comptroller of the Currency in any rule relating to the chartering of a financial institution, the Comptroller shall consider as a factor capital investment, loan participation, and other ventures undertaken by the bank in cooperation with Black banks, women's banks, community development financial institutions, and low-income credit unions, provided that these activities help meet the financial needs of local communities in which the federally chartered financial institution provides financial products or services. (2) Definitions.--For purposes of this section: (A) Community development financial institution.-- The term ``community development financial institution'' has the meaning given in section 103(5) of the Riegle Community Development and Regulatory Improvement Act of 1994 (12 U.S.C. 4702(5)). (B) Financial inclusion assessment process.--The term ``financial inclusion assessment process'' means any process relating to the chartering of a financial institution whereby the Comptroller of the Currency assesses and takes into account the financial institution's record of meeting the financial needs of the bank's entire community, including low- and moderate-income neighborhoods, consistent with the safe and sound operation of such bank. (C) Financial product or service.--The term ``financial product or service'' has the meaning given such term in section 1002 of the Dodd-Frank Wall Street Reform and Consumer Protection Act (12 U.S.C. 5481). (D) FIRREA definitions.--The terms ``low-income credit union'' and ``women's bank'' have the meanings given such terms, respectively, in section 1204(c) of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note). SEC. 4. GAO STUDIES. (a) New Markets Tax Credit Study.--The Comptroller General of the United States shall carry out a study on the award of the new markets tax credit by-- (1) surveying communities and specifically talking to Black banks, community banks, and CDFIs that wish to receive the tax credit about why they are not receiving the tax credit; (2) determining where the tax credit money actually went and what it was used for; and (3) to the extent possible, using a case study approach. (b) Lower-Value Home Mortgage Loan Study.--The Comptroller General of the United States shall carry out a study on mortgage loans with a principal amount of $100,000 or less, including-- (1) who is making such loans currently; (2) how communities are encouraging such loans; (3) what changes could encourage banks and other persons to provide more such loans; and (4) any statutory or regulatory changes that the Comptroller believes may be needed to encourage more such loans. (c) Blockchain Study.--The Comptroller General of the United States shall carry out a study on blockchain technology and whether such technology could be used to increase investment by lower-income individuals in start-ups and other crowd-funded companies. SEC. 5. DEFINITIONS. For purposes of this Act: (1) Black bank.--The term ``Black bank'' means an insured depository institution-- (A) more than 50 percent of the ownership or control of which is held by 1 or more Black individuals; and (B) more than 50 percent of the net profit or loss of which accrues to 1 or more Black individuals. (2) CDFI.--The term ``CDFI'' has the meaning given the term ``community development financial institution'' under section 103 of the Community Development Banking and Financial Institutions Act of 1994. (3) Community bank.--The term ``community bank'' means an insured depository institution with less than $100,000,000 in consolidated assets. (4) Comptroller.--The term ``Comptroller'' means the Comptroller of the Currency, except when used in the context of the Comptroller General of the United States. (5) Insured credit union.--The term ``insured credit union'' has the meaning given such term under section 101 of the Federal Credit Union Act. (6) Insured depository institution.--The term ``insured depository institution''-- (A) has the meaning given such term under section 3 of the Federal Deposit Insurance Act; and (B) includes an insured credit union. (7) Securities laws.--The term ``securities laws'' has the meaning given such term under section 3 of the Securities Exchange Act of 1934 (15 U.S.C. 78c). <all>"

follow-up prompt:

You split the first bill incorrectly. make sure the first bill header information is included with the body of the bill, like on the other bills

def split_bills(text):
    # Regular expression to detect the start of a new bill
    bill_start_pattern = re.compile(r"\n\[Congressional Bills \d+th Congress\]\n.*?\n.*?Introduced in House \(IH\)\]\n")
    
    # Find all start positions of bills
    matches = list(bill_start_pattern.finditer(text))
    
    bills = []
    for i in range(len(matches)):
        start = matches[i].start()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        bills.append(text[start:end].strip())
    
    return bills
bills = split_bills(text)
bills
[]

follow-up prompt

Now I’m getting nothing for bills

import re

def split_bills(text):
    # Regular expression to detect the start of a new bill
    bill_start_pattern = re.compile(r"(\[Congressional Bills \d+th Congress\].*?Introduced in House \(IH\)\])", re.DOTALL)

    # Split based on detected bill headers
    parts = bill_start_pattern.split(text)
    
    # Reconstruct bills ensuring the header stays with the content
    bills = []
    for i in range(1, len(parts), 2):
        header = parts[i].strip()
        body = parts[i + 1].strip() if i + 1 < len(parts) else ""
        bills.append(f"{header}\n\n{body}")

    return bills

bills = split_bills(text)
bills[0]
"[Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 1112 Introduced in House (IH)]\n\n<DOC> 117th CONGRESS 1st Session H. R. 1112 To require a report on the military coup in Burma, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES February 18, 2021 Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs _______________________________________________________________________ A BILL To require a report on the military coup in Burma, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Protect Democracy in Burma Act of 2021''. SEC. 2. FINDINGS. Congress finds the following: (1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work directly with parliaments around the world to support the development of effective, independent, and responsive legislative institutions. (2) HDP approved a legislative strengthening partnership with Burma in 2016 and organized the first congressional delegation to meet with the new civilian-led government, led by State Counselor Aung San Suu Kyi, and civil society leaders in May 2016. (3) On February 2, 2021, the U.S. Department of State assessed that Daw Aung San Suu Kyi, the leader of Burma's ruling party, and President Win Myint, the duly elected head of government, were deposed in a military coup on February 1, 2021. (4) As part of the military coup, the Burmese military declared martial law, suspended the civilian-led government, and detained newly elected Members of Parliament in the capitol, Naypyidaw, thereby usurping the role of the democratically elected government and parliament. SEC. 3. SENSE OF CONGRESS. It is the sense of Congress that-- (1) due to the Burmese military's seizure of government through the detention of State Counsellor Aung San Suu Kyi, President Win Myint, and other government leaders, Burma is not represented by a democratically elected government; (2) the inability of newly elected Members of Parliament to begin their official mandate due to the Burmese military's actions directly threatens the democratic trajectory of Burma's Parliament, and thereby the country; (3) the will and determination of those duly-elected Members of Parliament who are taking it upon themselves to continue serving as representatives of the people through alternative methods of communicating and convening should be lauded; and (4) by preventing the Parliament from completing its work, the Burmese military has rendered impossible and effectively nullified the international collaborative relationships that have supported and strengthened the institution, including the Burmese parliament's partnership with HDP. SEC. 4. STATEMENT OF POLICY. It is the policy of the United States to-- (1) engage with the Association of Southeast Asian Nations (ASEAN) and ASEAN member states to-- (A) condemn the military coup in Burma; (B) urge the unconditional release of detained democratically elected leaders and civil society members; and (C) support a return to Burma's democratic transition; and (2) instruct, as appropriate, representatives of the United States Government to use the voice, vote, and influence of the United States at the United Nations to hold accountable those responsible for the military coup in Burma. SEC. 5. REPORT. Not later than 90 days after the date of the enactment of this Act, the Secretary of State shall submit to the Committee on Foreign Affairs and the Committee on Appropriations of the House of Representatives and the Committee on Foreign Relations and the Committee on Appropriations of the Senate a report on the military coup in Burma, including a description of efforts to implement the policy specified in section 4. <all>"

follow-up prompt

Extract metadata for title, number, session, and plain text


def extract_metadata(bill_text):
    metadata = {}
    
    # Extract bill number
    bill_number_match = re.search(r"\[Congressional Bills (\d+th Congress)\]", bill_text)
    metadata['session_number'] = bill_number_match.group(1) if bill_number_match else "Unknown"
    
    # Extract title
    title_match = re.search(r"\n(.*?)\n.*?Introduced in House", bill_text)
    metadata['title'] = title_match.group(1).strip() if title_match else "Unknown"
    
    # Extract bill number from text header
    bill_id_match = re.search(r"(H\.R\. \d+|S\. \d+)", bill_text)
    metadata['bill_number'] = bill_id_match.group(1) if bill_id_match else "Unknown"
    
    # Store the plain text content
    metadata['text'] = bill_text.strip()
    
    return metadata
extract_metadata(bills[0])
{'session_number': '117th Congress', 'title': 'Unknown', 'bill_number': 'H.R. 1112', 'text': "[Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 1112 Introduced in House (IH)]\n\n<DOC> 117th CONGRESS 1st Session H. R. 1112 To require a report on the military coup in Burma, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES February 18, 2021 Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs _______________________________________________________________________ A BILL To require a report on the military coup in Burma, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Protect Democracy in Burma Act of 2021''. SEC. 2. FINDINGS. Congress finds the following: (1) On March 14, 2005, the House of Representatives agreed to H. Res. 135, which established the House Democracy Assistance Commission (later changed to the House Democracy Partnership, hereafter referred to as ``HDP'') to work directly with parliaments around the world to support the development of effective, independent, and responsive legislative institutions. (2) HDP approved a legislative strengthening partnership with Burma in 2016 and organized the first congressional delegation to meet with the new civilian-led government, led by State Counselor Aung San Suu Kyi, and civil society leaders in May 2016. (3) On February 2, 2021, the U.S. Department of State assessed that Daw Aung San Suu Kyi, the leader of Burma's ruling party, and President Win Myint, the duly elected head of government, were deposed in a military coup on February 1, 2021. (4) As part of the military coup, the Burmese military declared martial law, suspended the civilian-led government, and detained newly elected Members of Parliament in the capitol, Naypyidaw, thereby usurping the role of the democratically elected government and parliament. SEC. 3. SENSE OF CONGRESS. It is the sense of Congress that-- (1) due to the Burmese military's seizure of government through the detention of State Counsellor Aung San Suu Kyi, President Win Myint, and other government leaders, Burma is not represented by a democratically elected government; (2) the inability of newly elected Members of Parliament to begin their official mandate due to the Burmese military's actions directly threatens the democratic trajectory of Burma's Parliament, and thereby the country; (3) the will and determination of those duly-elected Members of Parliament who are taking it upon themselves to continue serving as representatives of the people through alternative methods of communicating and convening should be lauded; and (4) by preventing the Parliament from completing its work, the Burmese military has rendered impossible and effectively nullified the international collaborative relationships that have supported and strengthened the institution, including the Burmese parliament's partnership with HDP. SEC. 4. STATEMENT OF POLICY. It is the policy of the United States to-- (1) engage with the Association of Southeast Asian Nations (ASEAN) and ASEAN member states to-- (A) condemn the military coup in Burma; (B) urge the unconditional release of detained democratically elected leaders and civil society members; and (C) support a return to Burma's democratic transition; and (2) instruct, as appropriate, representatives of the United States Government to use the voice, vote, and influence of the United States at the United Nations to hold accountable those responsible for the military coup in Burma. SEC. 5. REPORT. Not later than 90 days after the date of the enactment of this Act, the Secretary of State shall submit to the Committee on Foreign Affairs and the Committee on Appropriations of the House of Representatives and the Committee on Foreign Relations and the Committee on Appropriations of the Senate a report on the military coup in Burma, including a description of efforts to implement the policy specified in section 4. <all>"}
extract_metadata(bills[-1])
{'session_number': '117th Congress', 'title': 'Unknown', 'bill_number': 'H.R. 154', 'text': "[Congressional Bills 117th Congress] [From the U.S. Government Publishing Office] [H.R. 154 Introduced in House (IH)]\n\n<DOC> 117th CONGRESS 1st Session H. R. 154 To provide regulatory relief for Black and community banks, to codify the Minority Bank Deposit Program, and for other purposes. _______________________________________________________________________ IN THE HOUSE OF REPRESENTATIVES January 4, 2021 Mr. Rush (for himself, Mr. Carson, Ms. Clarke of New York, Mr. Cooper, Mr. Khanna, and Mr. Welch) introduced the following bill; which was referred to the Committee on Financial Services, and in addition to the Committee on Ways and Means, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee concerned _______________________________________________________________________ A BILL To provide regulatory relief for Black and community banks, to codify the Minority Bank Deposit Program, and for other purposes. Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled, SECTION 1. SHORT TITLE. This Act may be cited as the ``Reenergized Economic Sustainability for Community and Urban Entities Act for Black and Community Banks'' or the ``RESCUE Act for Black and Community Banks''. SEC. 2. REGULATION OF BLACK AND COMMUNITY BANKS. (a) Office of Black and Community Banks.-- (1) Establishment.--There is established within the Office of the Comptroller of the Currency an office to be known as the ``Office of Black and Community Banks''. (2) Supervision and examination of black banks and community banks.--The Comptroller of the Currency, acting through the Office of Black and Community Banks, shall supervise and examine Black banks and community banks. (3) Regulatory relief.-- (A) In general.--The Comptroller shall issue regulations to partially or completely exempt Black banks and community banks from Federal banking statutes and regulations, to the extent the Comptroller determines it appropriate without endangering the safety and soundness of such banks. (B) Treatment of manual underwriting.--For purposes of risk-based capital requirements for Black banks and community banks, the Comptroller shall issue regulations to assign a lower level of risk to loans that are issued by such banks using manual underwriting, in recognition of the individualized scrutiny provided by manual underwriting. (C) Encouraging small-dollar lending.--The Comptroller shall issue regulations to encourage affordable small-dollar lending by Black banks and community banks by providing regulatory flexibility with respect to such lending. (b) Regulatory Relief Under the Securities Laws.-- (1) Investment products.--With respect to investment products sold by a Black bank or a community bank (or an affiliate of such bank) to individuals in the community in which such bank is located, the Securities and Exchange Commission shall issue regulations to partially or completely exempt the bank from the securities laws and regulations issued under the securities laws, to the extent the Commission determines it appropriate without endangering the protection of investors. (2) Securities.-- (A) In general.--The Securities and Exchange Commission shall issue regulations to reduce the regulatory burden applicable to Black banks and community banks-- (i) under the amendments made by the Jumpstart Our Business Startups Act; (ii) issuing mortgage-backed securities; and (iii) issuing securities backed by loans guaranteed by the Small Business Act. (B) Crowdfunding exemption.--Section 4A of the Securities Act of 1933 (15 U.S.C. 77d-1) shall not apply to Black banks or community banks. (c) Conforming Change to Definition of Appropriate Federal Banking Agency.--Section 3(q)(1) of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)(1)) is amended-- (1) in subparagraph (B), by striking ``and'' at the end; (2) in subparagraph (C), by adding ``and'' at the end; and (3) by adding at the end the following: ``(D) notwithstanding paragraphs (2) and (3), any Black bank or community bank (as such terms are defined under section 5 of the RESCUE Act for Black and Community Banks);''. SEC. 3. CODIFICATION OF THE MINORITY BANK DEPOSIT PROGRAM. (a) In General.-- (1) In general.--Section 1204 of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note) is amended to read as follows: ``SEC. 1204. EXPANSION OF USE OF MINORITY BANKS, WOMEN'S BANKS, AND LOW-INCOME CREDIT UNIONS. ``(a) Minority Bank Deposit Program.-- ``(1) Establishment.--There is established a program to be known as the `Minority Bank Deposit Program' to expand the use of minority banks, women's banks, and low-income credit unions. ``(2) Administration.--The Secretary of the Treasury, acting through the Fiscal Service, shall-- ``(A) on application by a depository institution or credit union, certify whether such depository institution or credit union is a minority bank, women's bank, or low-income credit union; ``(B) maintain and publish a list of all depository institutions and credit unions that have been certified pursuant to subparagraph (A); ``(C) periodically distribute the list described in subparagraph (B) to-- ``(i) all Federal departments and agencies; ``(ii) interested State and local governments; and ``(iii) interested private sector companies; and ``(D) support the creation of ratings, online Black bank resources, and database products, including online lending and investment facilities. ``(3) Inclusion of certain entities on list.--A depository institution or credit union that, on the date of the enactment of this section, has a current certification from the Secretary of the Treasury stating that such depository institution or credit union is a minority bank, women's bank, or low-income credit union shall be included on the list described under paragraph (2)(B). ``(b) Expanded Use Among Federal Departments and Agencies.-- ``(1) In general.--Not later than 1 year after the establishment of the program described in subsection (a), the head of each Federal department or agency shall develop and implement standards and procedures to ensure, to the maximum extent possible as permitted by law, the use of minority banks, women's banks, and low-income credit unions to serve the financial needs of each such department or agency. ``(2) Minimum requirement.--Notwithstanding paragraph (1), the head of each Federal department or agency shall ensure that at least 10 percent of the financial needs of each such department or agency are met by the use of minority banks, women's banks, and low-income credit unions. ``(3) Report to congress.--Not later than 2 years after the establishment of the program described in subsection (a), and annually thereafter, the head of each Federal department or agency shall submit to Congress a report on the actions taken to increase the use of minority banks, women's banks, and low- income credit unions to serve the financial needs of each such department or agency. ``(c) Definitions.--For purposes of this section: ``(1) Credit union.--The term `credit union' has the meaning given the term `insured credit union' in section 101 of the Federal Credit Union Act (12 U.S.C. 1752). ``(2) Depository institution.--The term `depository institution' has the meaning given the term `insured depository institution' in section 3 of the Federal Deposit Insurance Act (12 U.S.C. 1813). ``(3) Low-income credit union.--The term `low-income credit union' means any entity described in section 19(b)(1)(A)(iv) of the Federal Reserve Act. ``(4) Minority.--The term `minority' means any Black American, Native American, Hispanic American, or Asian American. ``(5) Minority bank.--The term `minority bank' means any bank described in clause (i), (ii), or (iii) of section 19(b)(1)(A) of the Federal Reserve Act for which-- ``(A) more than 50 percent of the outstanding shares of which are held by 1 or more minority individuals; ``(B) the majority of the directors on the board of directors of which are minority individuals; and ``(C) a significant percentage of senior management positions of which are held by minority individuals. ``(6) Women's bank.--The term `women's bank' means any bank described in clause (i), (ii), or (iii) of section 19(b)(1)(A) of the Federal Reserve Act for which-- ``(A) more than 50 percent of the outstanding shares of which are held by 1 or more women; ``(B) the majority of the directors on the board of directors of which are women; and ``(C) a significant percentage of senior management positions of which are held by women.''. (2) Conforming amendments.--The following provisions are amended by striking ``1204(c)(3)'' and inserting ``1204(c)'': (A) Section 808(b)(3) of the Community Reinvestment Act of 1977 (12 U.S.C. 2907(b)(3)). (B) Section 40(g)(1)(B) of the Federal Deposit Insurance Act (12 U.S.C. 1831q(g)(1)(B)). (C) Section 704B(h)(4) of the Equal Credit Opportunity Act (15 U.S.C. 1691c-2(h)(4)). (b) Amendments to the Community Reinvestment Act.--Section 804(b) of the Community Reinvestment Act of 1977 (12 U.S.C. 2903(b)) is amended to read as follows: ``(b) Cooperation With Minority Banks, Women's Banks, and Low- Income Credit Unions Considered.-- ``(1) In general.--In assessing and taking into account, under subsection (a), the record of a financial institution, the appropriate Federal financial supervisory agency shall consider as a factor capital investment, loan participation, and other ventures undertaken by the institution in cooperation with minority banks, women's banks, community development financial institutions, and low-income credit unions provided that these activities help meet the credit needs of local communities in which such institutions and credit unions are chartered. ``(2) Definitions.-- ``(A) FIRREA definitions.--The terms `low-income credit union', `minority bank', and `women's bank' have the meanings given such terms, respectively, in section 1204(c) of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note). ``(B) Community development financial institution.--The term `community development financial institution' has the meaning given in section 103(5) of the Riegle Community Development and Regulatory Improvement Act of 1994 (12 U.S.C. 4702(5)).''. (c) Considerations When Assessing Financial Inclusion for Federally Chartered Financial Institutions.-- (1) In general.--In assessing and taking into account the record of a federally chartered financial institution under any financial inclusion assessment process created by the Comptroller of the Currency in any rule relating to the chartering of a financial institution, the Comptroller shall consider as a factor capital investment, loan participation, and other ventures undertaken by the bank in cooperation with Black banks, women's banks, community development financial institutions, and low-income credit unions, provided that these activities help meet the financial needs of local communities in which the federally chartered financial institution provides financial products or services. (2) Definitions.--For purposes of this section: (A) Community development financial institution.-- The term ``community development financial institution'' has the meaning given in section 103(5) of the Riegle Community Development and Regulatory Improvement Act of 1994 (12 U.S.C. 4702(5)). (B) Financial inclusion assessment process.--The term ``financial inclusion assessment process'' means any process relating to the chartering of a financial institution whereby the Comptroller of the Currency assesses and takes into account the financial institution's record of meeting the financial needs of the bank's entire community, including low- and moderate-income neighborhoods, consistent with the safe and sound operation of such bank. (C) Financial product or service.--The term ``financial product or service'' has the meaning given such term in section 1002 of the Dodd-Frank Wall Street Reform and Consumer Protection Act (12 U.S.C. 5481). (D) FIRREA definitions.--The terms ``low-income credit union'' and ``women's bank'' have the meanings given such terms, respectively, in section 1204(c) of the Financial Institutions Reform, Recovery, and Enforcement Act of 1989 (12 U.S.C. 1811 note). SEC. 4. GAO STUDIES. (a) New Markets Tax Credit Study.--The Comptroller General of the United States shall carry out a study on the award of the new markets tax credit by-- (1) surveying communities and specifically talking to Black banks, community banks, and CDFIs that wish to receive the tax credit about why they are not receiving the tax credit; (2) determining where the tax credit money actually went and what it was used for; and (3) to the extent possible, using a case study approach. (b) Lower-Value Home Mortgage Loan Study.--The Comptroller General of the United States shall carry out a study on mortgage loans with a principal amount of $100,000 or less, including-- (1) who is making such loans currently; (2) how communities are encouraging such loans; (3) what changes could encourage banks and other persons to provide more such loans; and (4) any statutory or regulatory changes that the Comptroller believes may be needed to encourage more such loans. (c) Blockchain Study.--The Comptroller General of the United States shall carry out a study on blockchain technology and whether such technology could be used to increase investment by lower-income individuals in start-ups and other crowd-funded companies. SEC. 5. DEFINITIONS. For purposes of this Act: (1) Black bank.--The term ``Black bank'' means an insured depository institution-- (A) more than 50 percent of the ownership or control of which is held by 1 or more Black individuals; and (B) more than 50 percent of the net profit or loss of which accrues to 1 or more Black individuals. (2) CDFI.--The term ``CDFI'' has the meaning given the term ``community development financial institution'' under section 103 of the Community Development Banking and Financial Institutions Act of 1994. (3) Community bank.--The term ``community bank'' means an insured depository institution with less than $100,000,000 in consolidated assets. (4) Comptroller.--The term ``Comptroller'' means the Comptroller of the Currency, except when used in the context of the Comptroller General of the United States. (5) Insured credit union.--The term ``insured credit union'' has the meaning given such term under section 101 of the Federal Credit Union Act. (6) Insured depository institution.--The term ``insured depository institution''-- (A) has the meaning given such term under section 3 of the Federal Deposit Insurance Act; and (B) includes an insured credit union. (7) Securities laws.--The term ``securities laws'' has the meaning given such term under section 3 of the Securities Exchange Act of 1934 (15 U.S.C. 78c). <all>"}
bills_metadata = [extract_metadata(bill) for bill in bills]
type(bills_metadata)
list
import pandas as pd

df = pd.DataFrame(bills_metadata)
df
Loading...
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   session_number  10 non-null     object
 1   title           10 non-null     object
 2   bill_number     10 non-null     object
 3   text            10 non-null     object
dtypes: object(4)
memory usage: 452.0+ bytes
df['text']
0 [Congressional Bills 117th Congress] [From the... 1 [Congressional Bills 117th Congress] [From the... 2 [Congressional Bills 117th Congress] [From the... 3 [Congressional Bills 117th Congress] [From the... 4 [Congressional Bills 117th Congress] [From the... 5 [Congressional Bills 117th Congress] [From the... 6 [Congressional Bills 117th Congress] [From the... 7 [Congressional Bills 117th Congress] [From the... 8 [Congressional Bills 117th Congress] [From the... 9 [Congressional Bills 117th Congress] [From the... Name: text, dtype: object
# Function to extract metadata from a bill
def extract_metadata(bill_text):
    bill_number_match = re.search(r"H\.R\. \d+", bill_text)
    
    # Find the third number occurrence (bill number and other numbers)
    number_matches = re.findall(r"\d+", bill_text)
    if len(number_matches) >= 3:
        # Find text after the third occurrence of numbers, stopping at the first underscore
        title_start = bill_text.find(number_matches[2]) + len(number_matches[2])  # After the third number
        title_end = bill_text.find("_", title_start)  # Stop at the first underscore
        title = bill_text[title_start:title_end].strip() if title_end != -1 else bill_text[title_start:].strip()
    else:
        title = "No Title Found"
    
    session_match = re.search(r"\[Congressional Bills (\d+th Congress)\]", bill_text)
    
    bill_number = bill_number_match.group(0) if bill_number_match else "Unknown"
    session = session_match.group(1) if session_match else "Unknown Session"
    
    return {
        "Bill Number": bill_number,
        "Title": title,
        "Session": session,
        "Full Text": bill_text
    }
bills_metadata = [extract_metadata(bill) for bill in bills]
df = pd.DataFrame(bills_metadata)
df
Loading...

follow-up prompt

To get the title, get the text immediately following the third group of consecutive numbers, up until the first underscore

# Function to extract metadata from a bill
def extract_metadata(bill_text):
    bill_number_match = re.search(r"H\.R\. \d+", bill_text)
    
    # Find all groups of consecutive numbers
    number_matches = re.findall(r"\d+", bill_text)
    
    if len(number_matches) >= 3:
        # Find the third group of consecutive numbers and extract text following it
        third_number_position = bill_text.find(number_matches[2]) + len(number_matches[2])  # After the third number
        title_start = third_number_position
        
        # Stop at the first underscore
        title_end = bill_text.find("_", title_start)
        
        # If underscore is found, take text until the underscore, otherwise take all remaining text
        title = bill_text[title_start:title_end].strip() if title_end != -1 else bill_text[title_start:].strip()
    else:
        title = "No Title Found"
    
    # Extract session (if present)
    session_match = re.search(r"\[Congressional Bills (\d+th Congress)\]", bill_text)
    
    bill_number = bill_number_match.group(0) if bill_number_match else "Unknown"
    session = session_match.group(1) if session_match else "Unknown Session"
    
bills_metadata = [extract_metadata(bill) for bill in bills]
df = pd.DataFrame(bills_metadata)
df
Loading...
df.to_csv('trans.csv')

cleaning tabular data with OpenRefine

OpenRefine is an open source GUI tool for working with tabular data. It is especially useful for cleaning and organizing spreadsheet data.

We are going to use this tool to clean the “Title” column in our trans.csv dataset.

First, download and install the app. Download here.

After installing, when you open the app, it will appear on your browser, like Jupyter Notebooks.

Second, we will import our dataset. On the application page, click the “browese” button, select your data source, and press “next”. On the following page, you’ll see a bunch of options for customizing your project. We don’t have to change anything, so press “Create Project” on the right.

There is so much you can to with OpenRefine, particularly with Facets and Filters, which allow you to edit the spreadsheet by applying changes to patterns across cells rather than individual cells. Let’s say you want to take out some unwanted html elements that appear in your text, or you want to regularize the capitalization or spelling of a certain word or phrase. With facets and filtering, you can easily find the pattern and fix it. See more on this tutorial video.

Finally, we will do a bit of cleaning.

cleaning the “title” column

In our case, we are going to work on just one column, the “Title” column. We want to delete the text that comes before the title.

Go to the column that says “Title,” open the dropdown menu, hover over “edit column” and select “split into several columns”

Split into several columns option

Then, sure the “by separator” option is selected, which it should be by default. Then, you will choose your “separator,” in the text box below. A separator is the element by which OpenRefine will know where to split the column. For our dataset, we can use the phrase “Session H.R. 1112” to split our columns. That way, we can get just the title in its own column, and get rid of the other (original) column.

Now, we need to generate some regex. For this, I went back to Chat GPT and asked it to generate some regex that finds the pattern. I had to combine these results with what I found on the regex generator, in order to get the code to work.

chatgpt regex

Using that, I generated the string, Session\s[A-Za-z]\.\s[A-Za-z]\.\s\d{2,4}, and input it into the split cells window, making sure to select the regex tickbox.

splitting cells by regex expression

After this, I got rid of the column, renamed the column I did keep to “Title,” and did a “transform” (selecting “common transform”) of trailing whitespace. All of these are under the options “edit column” on the dropdown.

And that’s it! Now my dataset is much cleaner.

title column cleaned up