How to Collect data to train GPT-3 Model

Introduction

 

The GPT-3 model is a pre-trained model. In this blog, we will describe how you can train the GPT-3 model with the custom dataset. So the First step is to collect data to train GPT-3 Model from various sources which include Data extraction using the Wikipedia API, Filtering the Wikipedia pages and splitting them into sections by headings, and Exploring the data.

You can produce human-like text as an output from this language model using GPT 3, also known as the generative pre-trained transformer. Natural language processing, or NLP, has become very popular, and GPT 3 is now frequently used to generate poetry, stories, and other types of non-text. With the help of training data from various languages, the GPT 3 developers introduced the language model. The GPT 3 is another effective model that performs reasoning tasks like arithmetic in addition to language tasks. 

 

 

Libraries Required

 

  1. Pandas
  2. Wikipedia
  3. Numpy
  4. Transformers 

 

Collect data ( Data Extraction )

 

Firstly, install Wikipedia using the command “!pip install Wikipedia” if you are using python. Get the titles that are related to the topic you needed to train the GPT-3 Model. From the title, we pass an argument to get the Wikipedia page. Recursively find all the pages that are linked to the Wikipedia titles in the list.

let us consider an example If you want to train your model for Olympic Games 2020, We can do it as the following code.

 

Sample Code to collect data

 

In the below code, we are creating a function to filter out the web pages and collect data to train GPT-3 Model from Wikipedia based on the keywords passed to it as parameters.

 

# Importing required libraries

import pandas as pd
import wikipedia

# Creating a function filter_olympic_2020_titles passing titles as parameter
def filter_olympic_2020_titles(titles):

    # filtering the titles with keyword
    titles = [title for title in titles if '2020' in title and 'olympi' in title.lower()]

   # we are returning the list of titles
    return titles

We create a function to get the Wikipedia pages based on the title provided to it as a parameter. 

 

# creating a function named get_wiki_page with title as parameter

def get_wiki_page(title):
   
    # we first try to return the wikipedia page with title provided
    try:
        return wikipedia.page(title)

    # if there is any .DisambiguationError we return wikipedia.page with e.options[0] as parameter
    except wikipedia.exceptions.DisambiguationError as e:
        return wikipedia.page(e.options[0])

   # if there is any page error then we return none
    except wikipedia.exceptions.PageError as e:
        return None

 

Finally, we recursively find all the pages to find all the pages that are related to the title passed as a parameter, for that we create the recursively_find_all_pages function with titles and titles so far as a parameter. 

 

# creating recursively_find_all_pages with titles which is a list and titles_so_far which is a set

def recursively_find_all_pages(titles, titles_so_far=set()):

    # creating an empty list
    all_pages = []

    # for titles we are subtracting titles so far so that same web page donot repeat
    titles = list(set(titles) - titles_so_far)

    # calling filter_olympic_2020_titles function
    titles = filter_olympic_2020_titles(titles)

    # update the searched titles in titles_so_far set
    titles_so_far.update(titles)

    # itearating through list of titles
    for title in titles:

        # calling get_wiki_page function with title as parameter
        page = get_wiki_page(title)

        # if page is empty
        if page is None:
            continue

        # append all the pages in all pages list
        all_pages.append(page)

        # recursively call the function
        new_pages = recursively_find_all_pages(page.links, titles_so_far)

       # iterate through new pages
        for pg in new_pages:
            
            # if the page title is not in all pages
            if pg.title not in [p.title for p in all_pages]:

                # append that page in all_pages
                all_pages.append(pg)

        # update the page links to titles_so_far list
        titles_so_far.update(page.links)

    # finally return all the pages
    return all_pages

# calling the recursively_find_all_pages function with "2020 Summer Olympics" as the title
pages = recursively_find_all_pages(["2020 Summer Olympics"])

 

After reading this blog post please read how to filter and split Wikipedia pages as a continuation of this blog post.

 

Share this post

One thought on “How to Collect data to train GPT-3 Model

Leave a Reply

Your email address will not be published. Required fields are marked *