Generating Synthetic Data With LLMs For Fine-tuning

5 min readJul 30, 2024

Generating synthetic data with LLMs can remove the need to manually collect, annotate and review large datasets and be used to fine-tune smaller language models for machine local inference.

Imagine staring down an 80TB compressed dataset. That’s exactly where I found myself when I decided to explore the Common Crawl dataset. My first task? To classify the intent of each webpage. Simple enough in theory, but when you’re dealing with a dataset that makes most hard drives weep, the challenges stack up fast.

The Common Crawl Conundrum

For those unfamiliar, Common Crawl is a nonprofit that creates and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. It’s a goldmine of information, but it’s also a behemoth that demands respect — and a solid game plan.

Calling OpenAI or Anthropic was not viable as it is slow and costly.

The challenges:
- 80TB+ of compressed data to process
- Significant latency for each API call
- Costs that would make even the most generous budget blush
- Rate limits

I needed a solution that was fast, cost-effective, and could scale to handle the enormity of Common Crawl. What if I could create a model that had the smarts of GPT-4 but could run locally at lightning speed?

The Lightbulb Moment: BERT Meets GPT-4

I decided to combine the power of GPT-4o for generating high-quality synthetic data with the efficiency of a fine-tuned BERT model.

The plan was simple:
1. Use GPT-4 to generate a diverse, high-quality dataset of synthetic web pages
2. Fine-tune a BERT model on this synthetic data
3. Deploy the fine-tuned model to classify the Common Crawl dataset locally

This approach offered several advantages:
- No need for API calls during classification, drastically reducing latency
- A one-time cost for data generation and training, rather than per-page API fees
- The ability to process the entire dataset on local hardware, solving scalability issues

In this tutorial, I’m going to walk you through this process step by step. Whether you’re grappling with your own massive dataset or just curious about pushing the boundaries of what’s possible with AI and big data, this guide is for you.

Let’s get started!

The Solution: GPT-4 Meets BERT

By leveraging the latest advancements in GPT-4, we can generate synthetic HTML data that’s not only realistic but also diverse. This approach allows us to create a custom model for classifying web pages without the need for a large, manually labeled dataset. Let me walk you through the process I’ve developed.

Step 1: Generating Synthetic HTML Data with GPT-4o

First things first, let’s set up our environment. You’ll need to install the OpenAI library:

pip install openai

We’re going to use GPT-4 to generate our synthetic HTML data.

import random
import json
import os
import logging
from openai import OpenAI
from datetime import datetime

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Add your key to env vars
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

def generate_synthetic_html(page_type):
    prompt = f"""
    Generate an HTML page for a {page_type} website. Make it realistic and diverse.
    Include appropriate tags, headings, and content based on the page type.
    For an informational page, include a main question and its answer.
    For a navigational page, include a navigation menu and some basic content.
    For a commercial page, include product information and a call-to-action.
    For a transactional page, include a form or a clear action for the user to take.
    Return the result as a JSON object with two keys: 'html' for the HTML content, and 'summary' for a brief description of the page.
    """
    logging.info(f"Generating synthetic HTML for {page_type} page")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a web developer creating diverse and realistic HTML pages."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=3000,
        n=1,
        response_format={ "type": "json_object" },
        stop=None,
        temperature=0.8
    )
    content = response.choices[0].message.content
    logging.info(content)
    return json.loads(response.choices[0].message.content)

def generate_dataset(num_samples):
    page_types = ["informational", "navigational", "commercial", "transactional"]
    dataset = []
    for i in range(num_samples):
        page_type = random.choice(page_types)
        logging.info(f"Generating sample {i+1}/{num_samples}")
        result = generate_synthetic_html(page_type)
        dataset.append((result['html'], page_type, result['summary']))
    return dataset

def write_dataset_to_file(dataset, filename):
    logging.info(f"Writing dataset to file: {filename}")
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(dataset, f, ensure_ascii=False, indent=2)
    logging.info(f"Dataset successfully written to {filename}")

def main():
    num_samples = 10  # Replace with your desired count
    logging.info(f"Starting synthetic data generation for {num_samples} samples")
    
    dataset = generate_dataset(num_samples)
    
    # Create a timestamp for the filename
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"synthetic_dataset_{timestamp}.json"
    
    write_dataset_to_file(dataset, filename)
    
    logging.info("Synthetic data generation complete")

if __name__ == "__main__":
    main()

This script uses GPT-4 to create HTML content based on different page types, giving us a rich, diverse dataset to work with.

Step 2: Preparing the Data for BERT

Now that we have our synthetic data, it’s time to prepare it for BERT. Here’s where the magic happens:

pip install transformers[torch] torch datasets

import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import json
import glob
import os

# Load the most recent dataset
def load_most_recent_dataset():
    list_of_files = glob.glob('synthetic_dataset_*.json')
    if not list_of_files:
        raise FileNotFoundError("No synthetic dataset files found.")
    latest_file = max(list_of_files, key=os.path.getctime)
    with open(latest_file, 'r', encoding='utf-8') as f:
        return json.load(f)

# Load the dataset
dataset = load_most_recent_dataset()

# Unpack our dataset
html_texts, labels, summaries = zip(*dataset)
label_map = {"informational": 0, "navigational": 1, "commercial": 2, "transactional": 3}
encoded_labels = [label_map[label] for label in labels]

# Create a Hugging Face Dataset
hf_dataset = Dataset.from_dict({
    "text": html_texts,
    "summary": summaries,
    "label": encoded_labels
})

# Now, let's tokenize our data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    combined_text = [f"{html}\n\nSummary: {summary}" for html, summary in zip(examples["text"], examples["summary"])]
    tokenized = tokenizer(combined_text, padding="max_length", truncation=True, max_length=1024)
    tokenized["labels"] = examples["label"]
    return tokenized

tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, remove_columns=hf_dataset.column_names)

# Split into training and validation sets
train_val_dataset = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_val_dataset["train"]
val_dataset = train_val_dataset["test"]

Step 3: Fine-tuning BERT

Here’s where it all comes together. We’re going to fine-tune BERT on our synthetic dataset:

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    remove_unused_columns=False,
)

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create Trainer instance
trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=train_dataset,
 eval_dataset=val_dataset,
 data_collator=data_collator
)

# Fine-tune the model
trainer.train()

# Save our fine-tuned model
model.save_pretrained("./fine_tuned_bert_page_intent")
tokenizer.save_pretrained("./fine_tuned_bert_page_intent")

The Result: A BERT Model That Understands Web Pages

After running this process, we end up with a BERT model fine-tuned on high-quality, diverse synthetic data. This model has the potential to classify web pages with between informational, navigational, commercial, and transactional content.

If you liked the content, be sure to follow for me. Also, follow and Twitter and feel free to reach out

Generating Synthetic Data With LLMs For Fine-tuning

The Common Crawl Conundrum

The Lightbulb Moment: BERT Meets GPT-4

The Solution: GPT-4 Meets BERT

Step 1: Generating Synthetic HTML Data with GPT-4o

Step 2: Preparing the Data for BERT

Step 3: Fine-tuning BERT

The Result: A BERT Model That Understands Web Pages

Cole Murray - Personal Website

I design internet scale machine learning and distributed data systems

x.com — ColeMurray

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Cole Murray

No responses yet