machine learning#

Machine Learning (ML) is a broad topic in Artificial Intelligence (AI) that contains process in natural language (NLP), computer vision, speech recognition, and more. In this workshop, we are going to focus on ML methods for NLP, specifically for text generation.

Before moving to text generation, however, it’s useful to get a sense of how this process works under the hood.

Word Vectors#

How does a text generation tool like ChatGPT work? How does it know what to respond when someone asks it a question? More specifically, how does it know what language to generate, what words follow other words?

The answer is that it learns by prediction. It processes massive amounts of text, and from that processing, it gleans a sense of what words tend to follow other words.

This is all possible thanks to “word vectors,” which is language in a quantified form. Technically speaking, word vectors are representations of words in graphical space. Each word is represented by a series of numbers that together make up its coordinate on a graph. Practically, each of those numbers consists of a list of probabilities, which represent it’s relationship to another word in the database.

A vector for the words “cat” and “dog”, must look like the following:

word

tiger

cute

bones

wolf

cat

.90

.99

.40

.35

dog

.35

.99

.85

.90

Use the code below to explore word vectors using the gensim library and a datast from twitter, glove-twitter-25.

import gensim
from gensim.models import Word2Vec
from gensim import downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')
glove_vectors.most_similar('woman')

[('child', 0.9371739029884338),
 ('mother', 0.9214695692062378),
 ('whose', 0.917497456073761),
 ('called', 0.9146499633789062),
 ('person', 0.913553774356842),
 ('wife', 0.9088310599327087),
 ('being', 0.9037442803382874),
 ('father', 0.9028053283691406),
 ('guy', 0.9026350975036621),
 ('known', 0.8997253179550171)]
glove_vectors.most_similar('man')

[('was', 0.9065622687339783),
 ('i', 0.8880172371864319),
 ('he', 0.887438178062439),
 ('bad', 0.8846145272254944),
 ('even', 0.8832387924194336),
 ('be', 0.8784030079841614),
 ('we', 0.8764979243278503),
 ('not', 0.8764553666114807),
 ('had', 0.8762108683586121),
 ('glad', 0.8758710622787476)]
glove_vectors.most_similar('protest')

[('protests', 0.9241024851799011),
 ('forces', 0.9001613259315491),
 ('afghanistan', 0.8905416131019592),
 ('activists', 0.8872407078742981),
 ('troops', 0.880148708820343),
 ('protesters', 0.8785053491592407),
 ('violence', 0.8769642114639282),
 ('parliament', 0.8767853379249573),
 ('prison', 0.8743768930435181),
 ('opposition', 0.8693628311157227)]
glove_vectors.most_similar('princeton')

[('cornell', 0.8837954998016357),
 ('warren', 0.872944712638855),
 ('emory', 0.8666537404060364),
 ('quincy', 0.863002359867096),
 ('dudley', 0.8600769639015198),
 ('dayton', 0.8584739565849304),
 ('carson', 0.8520109057426453),
 ('savannah', 0.8516344428062439),
 ('pearson', 0.8490176200866699),
 ('trump', 0.8488551378250122)]

king - man + woman = queen#

Why do this to words? Why represent them as numbers? So we can do math! We can do things like linear algebra. This allows us to calculate what words have similar meanings based on how close they are to each other in graphical space. For example, we can do cosine similarity, finding out what two vectors are similar to each other in shape/direction actually gives us a sense of their semantic similarity. Word Vectors open up a world of algebra, calculus, that we can do with language.

There’s a famous formula that represents this concept.

vector(”King”) - vector(”Man”) + vector(”Woman”) = vector("Queen")

This formular becomes a bit more interesting when we realize that this is the formula that introduced the power of word vectors to the world (see the famous paper Word2Vec). So the assumptions it plays on must be deeply embedded across society.œ What exactly is being calculated when we subtract “man” and add “woman”? What are the implied assumptions about gender here?

Learn more about word vectors in the excellent explanation by Jay Alammar.

Attention Mechanism#

After word vectors, the second big development is the “Attention mechanim”.

Attention is a key component of the “transformer” model architecture, which was developed in 2017/2018. See the Attention Is All You Need paper that introduced the concept.

Attention means that context matters, it is taken as input to the calculations. Before attention, neural networks only took into account the words preceding a given word. By contrast, With attention, networks could take the context, the words that surround a word, into their calculations.

Transformer architecture was introduced in the BERT model, which stands for Bidirectional Encoder Representations from Transformers. Developed by Google, BERT is the first generation Transformer model which, released on an open-source on Apache 2.0 license, one of most permissive licenses, which inspired many descendants that are still popular today.

When we move to the 🤗 website, we will see firsthand many variations of BERT, the berts!

Training and Fine-Tuning#

The “berts” are what I call the models that have been trained and/or fine-tuned on the original BERT base model.

What is the difference between training and fine-tuning?

  • training is the creation of a “base” model. It requires lots, LOTS of data, gigabites of data, and compute power. It takes weeks, sometimes longer.

  • fine-tuning is taking a base model, which has already been trained (like BERT) and training it further, with a much smaller dataset that is focused on a specific topic. It involves customizing the model to work for a particular topic or kind of data.

    • finBERT for sentiment analysis of financial data.

Why am I saying all this?#

To de-mystify the tool. These tools are not magic, they are not intuitive, possibly not even “intelligent”, they can just do a lot of math.

Also, to understand variations in models and performance. Because we are going to engage with different language models in this workshop, in all shapes and sizes, and you’ll see some of these tools acting differently than what you’re used to with chatgpt, or more polished AI applications. It helps to know how a bit about how they work and their major developments to understand this ever more complicated ecosystem.