How to Split and Filter Wikipedia pages
Introduction
In the previous blog, we have described the steps on how to extract the data. After extraction, we have to filter out the Wikipedia pages and modify them as per our requirements. We ensure that each section is not longer than the token limit and less than the token limit. We extract all the references and other low-information sections and remove them. At last, we find to get the well-framed data in CSV file format. In this blog, we have discussed how to Split and Filter Wikipedia pages for GPT-3 training.
Libraries Required
- Pandas
- Numpy
- Transformers
- nltk
Implementation to Split and Filter Wikipedia pages
To filter out the Wikipedia pages, Firstly, count the number of tokens in a string using GPT2TokenizerFast.from_pretrained(“gpt2”), which is an inbuilt function that helps in encoding the text. Set an upper threshold limit, ie the maximum number of tokens you wish to have per webpage. Reduce a long text to a maximum of upper threshold limit tokens by potentially cutting at a sentence end. Create a list of headings that you wanted to exclude from the webpages, for example: ‘See also’, ‘References’, ‘External links’, ‘Further reading’, “Footnotes”, “Bibliography” and so on. Find all headings and the corresponding contents in the webpage, and discard all the headings which you wanted to exclude. Create a dataset and filter out any sections with fewer than 40 tokens. Dataset columns may include title, heading, content, and tokens. let us consider an example, setting the upper limit as 500 and the lower limit as 40.
Sample Code to Filter Wikipedia pages
We import all the required libraries and create a function to count the number of tokens in a string.
# Importing libraries import re from typing import Set # !pip install transformers to install transformers library. from transformers import GPT2TokenizerFast import numpy as np # pip insall nltk to install nltk library from nltk.tokenize import sent_tokenize tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") # creating count tokens function which would take text as parameter and return type of int def count_tokens(text: str) -> int: # returning the length of the text after encoding it. return len(tokenizer.encode(text))
Creating reduce_long function to Reduce a long text to a maximum of `max_len` tokens by potentially cutting at a sentence end.
# creating reduce_long function and passing long_text, long_text_tokens, bool and max len as parameters. def reduce_long( long_text: str, long_text_tokens: bool = False, max_len: int = 590 ) -> str: # if it is not long_text_tokens if not long_text_tokens: # initialize it with calling count token function and pass long_text as parameter long_text_tokens = count_tokens(long_text) # if long_text_tokens is greater than maximum length if long_text_tokens > max_len: # declare sentence with replacing space with empty sentences = sent_tokenize(long_text.replace("\n", " ")) # intialize ntokens with 0 ntokens = 0 # iterate through list of sentences for i, sentence in enumerate(sentences): # increment ntokens with count_tokens ntokens += 1 + count_tokens(sentence) # ntokens exceed maximum length if ntokens > max_len: # stop and return sentences till last ntoken charecter. return ". ".join(sentences[:i][:-1]) + "." return long_text # this is a list of discard_categories which you wanted to remove. discard_categories = ['See also', 'References', 'External links', 'Further reading', "Footnotes", "Bibliography", "Sources", "Citations", "Literature", "Footnotes", "Notes and references", "Photo gallery", "Works cited", "Photos", "Gallery", "Notes", "References and sources", "References and notes",]
Finally, Extract the sections of a Wikipedia page, discarding the references and other low information sections and store all the information in data frames and save it in CSV file.
# create extract_sections function with wiki_txt, title, max length and discard categories as parameters def extract_sections( wiki_text: str, title: str, max_len: int = 1500, discard_categories: Set[str] = discard_categories,) -> str: # if length of wiki text is empty or zero then return empty list if len(wiki_text) == 0: return [] # find all headings and the coresponding contents headings = re.findall("==+ .* ==+", wiki_text) # iterate through headings for heading in headings: # replace the headings with "==+ !! ==+" wiki_text = wiki_text.replace(heading, "==+ !! ==+") # split the text when "==+ !! ==+" is found contents = wiki_text.split("==+ !! ==+") # remove the white spaces contents = [c.strip() for c in contents] assert len(headings) == len(contents) - 1 # remove white space in contents cont = contents.pop(0).strip() outputs = [(title, "Summary", cont, count_tokens(cont)+4)] # discard the discard categories, accounting for a tree structure max_level = 100 # initializing everything keep_group_level = max_level remove_group_level = max_level nheadings, ncontents = [], [] for heading, content in zip(headings, contents): # getting the headings plain_heading = " ".join(heading.split(" ")[1:-1]) # get the length of heading num_equals = len(heading.split(" ")[0]) # if length of heading lless than keep_group_level if num_equals <= keep_group_level: keep_group_level = max_level # if length of heading lless than remove_group_level if num_equals > remove_group_level: if ( num_equals <= keep_group_level ): continue keep_group_level = max_level # if the plain heading is in discard category list if plain_heading in discard_categories: # intialize remove_group_level to num_equals and continue remove_group_level = num_equals keep_group_level = max_level continue # replace = with "" in heading and append to nheadings nheadings.append(heading.replace("=", "").strip()) # append contents to ncontents ncontents.append(content) remove_group_level = max_level # count the tokens of each section ncontent_ntokens = [ count_tokens(c) + 3 + count_tokens(" ".join(h.split(" ")[1:-1])) - (1 if len(c) == 0 else 0) for h, c in zip(nheadings, ncontents) ] # Create a tuple of (title, section_name, content, number of tokens) outputs += [(title, h, c, t) if t<max_len else (title, h, reduce_long(c, max_len), count_tokens(reduce_long(c,max_len))) for h, c, t in zip(nheadings, ncontents, ncontent_ntokens)] return outputs
As a continuation of this blog post, please read how to train GPT-3 Model using these data.
Pingback: How to Collect data to train GPT-3 Model - Study Experts