How to Collect data to train GPT-3 Model
Introduction
The GPT-3 model is a pre-trained model. In this blog, we will describe how you can train the GPT-3 model with the custom dataset. So the First step is to collect data to train GPT-3 Model from various sources which include Data extraction using the Wikipedia API, Filtering the Wikipedia pages and splitting them into sections by headings, and Exploring the data.
You can produce human-like text as an output from this language model using GPT 3, also known as the generative pre-trained transformer. Natural language processing, or NLP, has become very popular, and GPT 3 is now frequently used to generate poetry, stories, and other types of non-text. With the help of training data from various languages, the GPT 3 developers introduced the language model. The GPT 3 is another effective model that performs reasoning tasks like arithmetic in addition to language tasks.
Libraries Required
- Pandas
- Wikipedia
- Numpy
- Transformers
Collect data ( Data Extraction )
Firstly, install Wikipedia using the command “!pip install Wikipedia” if you are using python. Get the titles that are related to the topic you needed to train the GPT-3 Model. From the title, we pass an argument to get the Wikipedia page. Recursively find all the pages that are linked to the Wikipedia titles in the list.
let us consider an example If you want to train your model for Olympic Games 2020, We can do it as the following code.
Sample Code to collect data
In the below code, we are creating a function to filter out the web pages and collect data to train GPT-3 Model from Wikipedia based on the keywords passed to it as parameters.
# Importing required libraries import pandas as pd import wikipedia # Creating a function filter_olympic_2020_titles passing titles as parameter def filter_olympic_2020_titles(titles): # filtering the titles with keyword titles = [title for title in titles if '2020' in title and 'olympi' in title.lower()] # we are returning the list of titles return titles
We create a function to get the Wikipedia pages based on the title provided to it as a parameter.
# creating a function named get_wiki_page with title as parameter def get_wiki_page(title): # we first try to return the wikipedia page with title provided try: return wikipedia.page(title) # if there is any .DisambiguationError we return wikipedia.page with e.options[0] as parameter except wikipedia.exceptions.DisambiguationError as e: return wikipedia.page(e.options[0]) # if there is any page error then we return none except wikipedia.exceptions.PageError as e: return None
Finally, we recursively find all the pages to find all the pages that are related to the title passed as a parameter, for that we create the recursively_find_all_pages function with titles and titles so far as a parameter.
# creating recursively_find_all_pages with titles which is a list and titles_so_far which is a set def recursively_find_all_pages(titles, titles_so_far=set()): # creating an empty list all_pages = [] # for titles we are subtracting titles so far so that same web page donot repeat titles = list(set(titles) - titles_so_far) # calling filter_olympic_2020_titles function titles = filter_olympic_2020_titles(titles) # update the searched titles in titles_so_far set titles_so_far.update(titles) # itearating through list of titles for title in titles: # calling get_wiki_page function with title as parameter page = get_wiki_page(title) # if page is empty if page is None: continue # append all the pages in all pages list all_pages.append(page) # recursively call the function new_pages = recursively_find_all_pages(page.links, titles_so_far) # iterate through new pages for pg in new_pages: # if the page title is not in all pages if pg.title not in [p.title for p in all_pages]: # append that page in all_pages all_pages.append(pg) # update the page links to titles_so_far list titles_so_far.update(page.links) # finally return all the pages return all_pages # calling the recursively_find_all_pages function with "2020 Summer Olympics" as the title pages = recursively_find_all_pages(["2020 Summer Olympics"])
After reading this blog post please read how to filter and split Wikipedia pages as a continuation of this blog post.
Pingback: How to Split and Filter Wikipedia pages - Study Experts