Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

HuggingfaceđŸ€— platform

đŸ€— is an AI research and development company based in Brooklyn, New York City. The company hosts and develops a platform for Machine Learning, which contains compute & collaborative spaces for AI models, datasets, and more.

It is like a github for ML, if github had additional “hubs” for things besides just code (like datasets, papers, apps, etc).

“models hub”¶

We will start with the “models hub,” which contains AI models created by đŸ€— users. Users train and/or fine-tune models, then publish them on the hub for others to use.

The navigation goes from left to right: on left side, there are tasks, like text classification; on right side, there are models.

Filter results by “most downloaded”, and try to guess what the model does just by looking at the name.

Then, filter results by “most liked”:

Now, filter by task and by size. Then, filter by inference available. Choose one of the results that looks interesting to you, and practice running inference here for a few minutes. Anything that you notice about the results?

Go back to the search, and type gpt-neo-125. Click on the gpt-125m result. This is a model developed by EleutherAI, a non-profit research lab. It is part of a larger family of models named “gpt-neo” with the size at the end.

Here, we can see the model card, which contains important information about the model. There are a couple of things to note here:

On the header, and the right hand panel:

Now, look at the main text on the page. Here you will find context and information for the model, including training and inference information.

“datasets hub”¶

Besides models, đŸ€— offers Datasets. These datasets are used to fine-tune (and also train and evaluate) models. We are going to take a little peak into this part of the platform.

Where do we get most of the data used to train these models? From scraping the internet. Look at the c4 dataset, for example. This dataset is based on the Common Crawl, the largest web crawling initiative. The c4 dataset is a “colossal, cleaned” version of the common crawl (hence the 4c). It cleans the Common Crawl dataset using the “badwords filter”, aka the List of Dirty Naughty Obscene and Otherwise Bad Words.

It is very difficult to clean this amount of data at scale, which is a major issue for ML model training --- and a major opportunity for anyone who wants to get into the field.

In the next section, we will look at how to use these models by running “inference” directly in Python.