How to Split and Filter Wikipedia pages

Introduction

In the previous blog, we have described the steps on how to extract the data.  After extraction, we have to filter out the Wikipedia pages and modify them as per our requirements.  We ensure that each section is not longer than the token limit and less than the token limit. We extract all the references and other low-information sections and remove them. At last, we find to get the well-framed data in CSV file format. In this blog, we have discussed how to Split and Filter Wikipedia pages for GPT-3 training.

 

 

Libraries Required

  1. Pandas
  2. Numpy
  3. Transformers
  4. nltk

 

 

Implementation to Split and Filter Wikipedia pages 

To filter out the Wikipedia pages, Firstly, count the number of tokens in a string using GPT2TokenizerFast.from_pretrained(“gpt2”), which is an inbuilt function that helps in encoding the text. Set an upper threshold limit, ie the maximum number of tokens you wish to have per webpage. Reduce a long text to a maximum of upper threshold limit tokens by potentially cutting at a sentence end. Create a list of headings that you wanted to exclude from the webpages, for example: ‘See also’, ‘References’, ‘External links’, ‘Further reading’, “Footnotes”, “Bibliography” and so on. Find all headings and the corresponding contents in the webpage, and discard all the headings which you wanted to exclude. Create a dataset and filter out any sections with fewer than 40 tokens. Dataset columns may include title, heading, content, and tokens. let us consider an example, setting the upper limit as 500 and the lower limit as 40.

Sample Code to Filter Wikipedia pages

We import all the required libraries and create a function to count the number of tokens in a string.

# Importing libraries

import re
from typing import Set

# !pip install transformers to install transformers library.
from transformers import GPT2TokenizerFast

import numpy as np

# pip insall nltk to install nltk library
from nltk.tokenize import sent_tokenize
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# creating count tokens function which would take text as parameter and return type of int
def count_tokens(text: str) -> int:
   
    # returning the length of the text after encoding it.
    return len(tokenizer.encode(text))

 

 

Creating reduce_long function to Reduce a long text to a maximum of `max_len` tokens by potentially cutting at a sentence end.

# creating reduce_long function and passing long_text, long_text_tokens, bool and max len as parameters.
def reduce_long(
    long_text: str, long_text_tokens: bool = False, max_len: int = 590
) -> str:

    # if it is not long_text_tokens
    if not long_text_tokens:

        # initialize it with calling count token function and pass long_text as parameter
        long_text_tokens = count_tokens(long_text)

    # if long_text_tokens is greater than maximum length
    if long_text_tokens > max_len:

        # declare sentence with replacing space with empty
        sentences = sent_tokenize(long_text.replace("\n", " "))

        # intialize ntokens with 0
        ntokens = 0

        # iterate through list of sentences
        for i, sentence in enumerate(sentences):

             # increment ntokens with count_tokens
            ntokens += 1 + count_tokens(sentence)

            # ntokens exceed maximum length
            if ntokens > max_len:

                # stop and return sentences till last ntoken charecter.
                return ". ".join(sentences[:i][:-1]) + "."

    return long_text

# this is a list of discard_categories which you wanted to remove.

discard_categories = ['See also', 'References', 'External links', 'Further reading', "Footnotes",
    "Bibliography", "Sources", "Citations", "Literature", "Footnotes", "Notes and references",
    "Photo gallery", "Works cited", "Photos", "Gallery", "Notes", "References and sources",
    "References and notes",]

Finally, Extract the sections of a Wikipedia page, discarding the references and other low information sections and store all the information in data frames and save it in CSV file.

 

# create extract_sections function with wiki_txt, title, max length and discard categories as parameters
def extract_sections(
    wiki_text: str, title: str, max_len: int = 1500,
    discard_categories: Set[str] = discard_categories,) -> str:
   
    # if length of wiki text is empty or zero then return empty list
    if len(wiki_text) == 0:
        return []

    # find all headings and the coresponding contents
    headings = re.findall("==+ .* ==+", wiki_text)

    # iterate through headings
    for heading in headings:

        # replace the headings with "==+ !! ==+"
        wiki_text = wiki_text.replace(heading, "==+ !! ==+")

    # split the text when "==+ !! ==+" is found
    contents = wiki_text.split("==+ !! ==+")
    
    # remove the white spaces
    contents = [c.strip() for c in contents]
    assert len(headings) == len(contents) - 1
    
    # remove white space in contents
    cont = contents.pop(0).strip()
    outputs = [(title, "Summary", cont, count_tokens(cont)+4)]
    
    # discard the discard categories, accounting for a tree structure
    max_level = 100
    
    # initializing everything
    keep_group_level = max_level
    remove_group_level = max_level
    nheadings, ncontents = [], []
    for heading, content in zip(headings, contents):
        
       # getting the headings
        plain_heading = " ".join(heading.split(" ")[1:-1])

        # get the length of heading
        num_equals = len(heading.split(" ")[0])

       # if length of heading lless than keep_group_level
        if num_equals <= keep_group_level:
            keep_group_level = max_level

      # if length of heading lless than remove_group_level
        if num_equals > remove_group_level:
            if (
                num_equals <= keep_group_level
            ):
                continue
        keep_group_level = max_level

       # if the plain heading is in discard category list
        if plain_heading in discard_categories:

            # intialize remove_group_level to num_equals and continue
            remove_group_level = num_equals
            keep_group_level = max_level
            continue

       # replace = with "" in heading and append to nheadings
        nheadings.append(heading.replace("=", "").strip())

        # append contents to ncontents
        ncontents.append(content)
        remove_group_level = max_level

    # count the tokens of each section
    ncontent_ntokens = [
        count_tokens(c)
        + 3
        + count_tokens(" ".join(h.split(" ")[1:-1]))
        - (1 if len(c) == 0 else 0)
        for h, c in zip(nheadings, ncontents)
    ]

    # Create a tuple of (title, section_name, content, number of tokens)
    outputs += [(title, h, c, t) if t<max_len 
                else (title, h, reduce_long(c, max_len), count_tokens(reduce_long(c,max_len))) 
                    for h, c, t in zip(nheadings, ncontents, ncontent_ntokens)]
    
    return outputs

 

 

As a continuation of this blog post, please read how to train GPT-3 Model using these data.

 

Share this post

One thought on “How to Split and Filter Wikipedia pages

Leave a Reply

Your email address will not be published. Required fields are marked *