0% found this document useful (0 votes)
80 views

AI Generated Text Detection Synopsis

The document discusses two metrics for evaluating AI generated text: perplexity and burstiness. Perplexity measures how well a language model can predict the next word, while burstiness measures the variance in sentence lengths and structures. Using both metrics together provides a more comprehensive approach for detecting AI-generated text.

Uploaded by

Mohit Adhikari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

AI Generated Text Detection Synopsis

The document discusses two metrics for evaluating AI generated text: perplexity and burstiness. Perplexity measures how well a language model can predict the next word, while burstiness measures the variance in sentence lengths and structures. Using both metrics together provides a more comprehensive approach for detecting AI-generated text.

Uploaded by

Mohit Adhikari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

KEY FEATURE

 PERPLEXITY
 BURSTINESS

PERPLEXITY :
Perplexity is a measure used to evaluate the performance of language models. It
refers to how well the model is able to predict the next word in a sequence of
words. As you’ll probably know by now, AI-generated text is procedurally
generated; i.e. word-by-word. AI selects the next probable word in a sentence
from a K-number of weighted options in the sample.

Perplexity is based on the concept of entropy, which is the amount of chaos or


randomness in a system. So a lower perplexity score indicates that the language
model is better at calculating the next word that is likely to occur in a given
sequence, while a higher perplexity score indicates that the model is less
accurate. Basically, the lower the perplexity, the more predictable it is. This
indicates better generalization and performance.

As a really rough example, how do you think should this sentence end?
“I picked up the kids and dropped them off at…”
A language model with high perplexity might propose “icicle”, “pensive”, or
“luminous” as answers. Those words don’t make sense; it’s word salad.
Somewhere in the middle might be “the President’s birthday party”. It’s highly
unlikely but… I guess it might be plausible, on rare occasions?
But a language model with low perplexity
For writing generic content that’s intended to be standard or ordinary, lower
perplexity is the safest bet.
BURSTINESS :

Burstiness basically measures how predictable a piece of content is by the


homogeneity of the length and structure of sentences throughout the text. In
some ways, burstiness is to phrases what perplexity is to words.

Whereas perplexity is the randomness or complexity of the word usage,


burstiness is the variance of the sentences: their lengths, structures, and tempos.
Real people tend to write in bursts and lulls— we naturally switch things up and
write long sentences or short ones; we might get interested in a topic and run on,
propelled by our own verbal momentum. Like I did^

AI is more robotic: uniform and regular. It has a steady tempo, compared to our
creative spontaneity. We humans get carried away and improvise; that’s what
captures the reader’s attention and encourages them to keep reading.
APPROACH :

A combined approach that leverages multiple metrics, specifically integrating


perplexity and burstiness measures, offers a more comprehensive and robust
method for detecting AI-generated text. By simultaneously evaluating different
statistical properties of the text, this approach aims to enhance the accuracy of
identification and classification of anomalies that may arise from automated text
generation processes.

Utilizing Multiple Metrics:

In the context of AI detection, the term "metrics" refers to quantifiable measures


that capture various aspects of the statistical characteristics of text data.
Perplexity and burstiness are two such metrics that, when used in combination,
contribute to a more nuanced and effective detection strategy.

Perplexity:
Perplexity is a metric commonly employed in natural language processing and
language modeling. It gauges the predictive power of a language model by
assessing how well it anticipates the occurrence of the next word in a sequence.
Lower perplexity scores typically indicate that a model is more adept at
predicting the next word, suggesting a higher likelihood of AI-generated
content.

Burstiness:
Burstiness, on the other hand, delves into the temporal distribution of terms in a
given text. A bursty pattern emerges when certain terms or phrases exhibit
clusters of consecutive repetitions. While burstiness is not inherently indicative
of AI generation, it can be a characteristic observed in specific types of
automated content creation.

Combined Approach Benefits:


The synergy between perplexity and burstiness metrics enables a more nuanced
understanding of the underlying patterns within a body of text. While perplexity
focuses on the coherence and predictability of the language model, burstiness
sheds light on the occurrence of repetitive clusters, a characteristic that may be
associated with certain AI generation methods.

Detection Method:

Establishing a robust detection method involves setting thresholds or ranges for


both perplexity and burstiness metrics. These thresholds act as benchmarks
against which the metrics of a given text sample are compared. Text samples
falling outside these predefined thresholds are flagged as potential anomalies,
indicating a deviation from expected statistical patterns.
PACKAGES

 NUMPY
 TANSFORMER
 TORCH
 RE
 STREAMLIT
 PDFPLUMBER
 PANDAS
 DOCUMENT

NUMPY:

NumPy, short for Numerical Python, is a fundamental package for scientific


computing in Python. It provides support for large, multi-dimensional arrays
and matrices, along with mathematical functions to operate on these arrays.
NumPy is a cornerstone library in the Python ecosystem, especially in fields
like data science, machine learning, and numerical analysis.

Key features of NumPy include:

Arrays: NumPy's main object is the numpy.ndarray, a multi-dimensional


array that can hold elements of the same data type. These arrays are more
efficient for numerical operations than Python lists.
Mathematical Functions: NumPy provides a wide range of mathematical
functions that can be applied element-wise to arrays. These functions include
basic operations, linear algebra, statistical operations, and more.

Broadcasting: NumPy enables operations on arrays of different shapes and


sizes through broadcasting, which automatically adjusts the dimensions to
perform element-wise operations.

Random Module: The numpy.random module allows the generation of


random numbers and samples. This is useful for tasks like simulating data or
initializing parameters in machine learning models.

Integration with other Libraries: NumPy integrates well with other scientific
computing libraries, such as SciPy (Scientific Python), Matplotlib (plotting
library), and Pandas (data manipulation library).

Here's a simple example demonstrating the creation of a NumPy array and


performing some basic operations:

NumPy's efficiency, versatility, and ease of use make it a fundamental tool


for numerical and scientific computing in Python.
TRANSFORMER:

If you're referring to the "transformers" Python package in the context of natural


language processing and machine learning, then you are likely talking about the
Hugging Face Transformers library. This library is a popular open-source
package that provides pre-trained models for natural language understanding
and generation tasks, based on transformer architectures.

Key features of the Hugging Face Transformers library include:

Pre-trained Models: The library offers a wide range of pre-trained transformer


models for tasks like text classification, named entity recognition, question-
answering, translation, and more. Models like BERT, GPT, and T5 are included.

Model Pipelines: It provides easy-to-use pipelines for common NLP tasks,


allowing users to quickly apply pre-trained models without dealing with the
complexities of model implementation.

Tokenization: The library includes tokenizers for various transformer models,


making it straightforward to encode and decode text for model input and output.

Fine-Tuning: Users can fine-tune pre-trained models on their specific tasks


using the library's interfaces, adapting the models to domain-specific data.

Here's a simple example of using the Hugging Face Transformers library:

Make sure to install the library using:


pip install transformers
If your reference is to a different "transformers" Python package, please provide
more context so I can offer more accurate information.
TORCH:

The "torch" Python package typically refers to PyTorch, which is an open-


source machine learning library developed by Facebook. PyTorch is widely
used for various machine learning tasks, including deep learning, neural
network implementations, and scientific computing. It is known for its dynamic
computational graph, ease of use, and flexibility.

Key features of PyTorch include:

Dynamic Computational Graph: PyTorch uses dynamic computational graphs,


which allows for more flexibility in model construction and easier debugging.
This is in contrast to static computational graphs used by some other deep
learning frameworks.

Tensors: PyTorch introduces the torch.Tensor data type, which is a multi-


dimensional matrix similar to NumPy arrays. Tensors are the fundamental
building blocks for creating neural networks.

Autograd: PyTorch's autograd (automatic differentiation) system allows for the


automatic computation of gradients during backpropagation. This simplifies the
training of neural networks.

Neural Network Module: The torch.nn module provides tools for building and
training neural networks. It includes predefined layers, loss functions, and
optimization algorithms.
To use PyTorch, you can install it via:

pip install torch


PyTorch is a powerful framework for deep learning research and application
development, offering a balance between flexibility and performance.
RE:

It seems like your request is a bit unclear. If you're referring to a Python


package with the name "re," it's important to note that "re" typically stands for
the regular expression module in Python, which is part of the standard library.

The re module provides support for regular expressions, allowing you to search,
match, and manipulate strings based on specific patterns. Regular expressions
are powerful tools for string manipulation and text processing.

Here's a simple example of using the re module to search for a pattern in a


string:

python
import re

text = "Hello, my email is [email protected]. Please contact me."

# Search for an email address pattern


pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
match = re.search(pattern, text)

if match:
print("Email found:", match.group())
else:
print("No email found.")
In this example, the regular expression pattern is designed to match email
addresses.
STREAMLIT:

Streamlit is a Python library that allows you to create web applications for data
science and machine learning projects with minimal effort. It's designed to
simplify the process of turning data scripts into interactive web apps quickly.
Streamlit is particularly popular for its simplicity and ease of use.

Here are some key features and concepts related to Streamlit:

Simple Syntax: Streamlit scripts are typically concise and easy to understand.
You can create interactive apps with just a few lines of Python code.
Widgets: Streamlit provides various widgets that you can use to create
interactive elements in your app, such as sliders, buttons, and text inputs. These
widgets help users interact with your data or model.
Live Updates: Streamlit apps are reactive, meaning they automatically update
when the user interacts with widgets or when underlying data changes. You
don't need to explicitly manage the update process.
Data Integration: You can easily integrate data visualizations created with
libraries like Matplotlib, Plotly, or Altair into your Streamlit app.
Deployment: Deploying Streamlit apps is straightforward. You can deploy them
on platforms like Streamlit Sharing, Heroku, or AWS.

To run this Streamlit app, you'll need to have Streamlit installed. You can install
it using:

pip install streamlit


Then, save the script and run:
streamlit run your_script.py
This will launch a local development server, and you can view your app in a
web browser.
PDFPLUMBERS:

pdfplumber is a powerful Python library designed for extracting information


from PDF documents. Built on top of pdfminer.six, it serves as a high-level
interface, making PDF data extraction more accessible for developers. In this
comprehensive exploration, we'll delve into the key features, use cases, and
practical examples of pdfplumber.

Key Features:

Text Extraction:

pdfplumber simplifies the extraction of text from PDFs using the extract_text()
method. It efficiently handles various text layouts, making it suitable for a wide
range of documents.
python

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:


first_page = pdf.pages[0]
text = first_page.extract_text()
print(text)
Table Extraction:

Extracting tabular data from PDFs is streamlined with pdfplumber. The


extract_table() method simplifies the extraction process, providing a structured
representation of tabular content.
python
with pdfplumber.open('table_example.pdf') as pdf:
table_page = pdf.pages[0]
table = table_page.extract_table()
print(table)
Image Extraction:

pdfplumber facilitates image extraction, allowing users to retrieve images


embedded in PDF documents. This can be useful in scenarios where visual
content is crucial.
python

with pdfplumber.open('image_example.pdf') as pdf:


image_page = pdf.pages[0]
images = image_page.images
for i, img in enumerate(images):
img_obj = img['object']
img_obj.to_image().save(f'image_{i}.png')
Page Navigation:

The library provides intuitive methods for navigating through pages, enabling
users to access and extract content from specific pages or page ranges.
python

with pdfplumber.open('example.pdf') as pdf:


for i, page in enumerate(pdf.pages):
text = page.extract_text()
print(f"Page {i + 1}:\n{text}\n")
Customizable Parsing:
pdfplumber offers flexibility in parsing PDFs. Users can customize extraction
behavior by adjusting parameters, such as text extraction strategies or table area
definitions.
python

with pdfplumber.open('custom_example.pdf') as pdf:


custom_page = pdf.pages[0]
custom_text = custom_page.extract_text(text_options={'left_margin': 100})
print(custom_text)
Use Cases:

Data Analysis:

For data analysts and scientists, pdfplumber is a valuable tool for extracting
relevant information from PDF reports, research papers, or financial statements.
Its ability to handle diverse document structures makes it versatile for data
extraction tasks.

Document Parsing:

Legal and business professionals often deal with PDF documents containing
crucial information. pdfplumber simplifies the process of extracting text and
tables, aiding in the automated parsing of legal documents, contracts, and
financial reports.
Automated Reporting:

In industries where automated reporting is essential, pdfplumber can be


integrated into workflows to extract data from PDFs and generate structured
reports. This is particularly useful for streamlining repetitive tasks.
Web Scraping:
Researchers and developers involved in web scraping may encounter PDF
documents containing relevant data. pdfplumber facilitates the extraction of text
and tables, enabling the integration of PDF content into web-based applications.
Practical Examples:

Extracting Text and Tables:

Suppose we have a PDF document containing text and a table. We can use
pdfplumber to extract both.
python

with pdfplumber.open('text_and_table.pdf') as pdf:


# Extract text from the first page
text_page = pdf.pages[0]
extracted_text = text_page.extract_text()
print("Extracted Text:\n", extracted_text)

# Extract table from the second page


table_page = pdf.pages[1]
extracted_table = table_page.extract_table()
print("\nExtracted Table:\n", extracted_table)
Customizing Text Extraction:

Customize text extraction parameters to adapt to specific document layouts.


python

with pdfplumber.open('custom_text.pdf') as pdf:


custom_page = pdf.pages[0]
custom_text = custom_page.extract_text(text_options={'left_margin': 100})
print("Custom Extracted Text:\n", custom_text)
Extracting Images:

Extracting images from a PDF document and saving them as separate files.
python

with pdfplumber.open('image_example.pdf') as pdf:


image_page = pdf.pages[0]
images = image_page.images
for i, img in enumerate(images):
img_obj = img['object']
img_obj.to_image().save(f'image_{i}.png')
Installation:

To use pdfplumber, you need to install it using:

pip install pdfplumber


PANDAS:

Pandas is a powerful and widely-used open-source data manipulation and


analysis library for Python. Built on top of the NumPy library, Pandas provides
data structures and functions needed to manipulate and analyze structured data
seamlessly. The two primary data structures in Pandas are Series and
DataFrame.

A Series is a one-dimensional array that can hold any data type. It is similar to a
column in a spreadsheet or a single-column table in a database. A DataFrame is
a two-dimensional table with labeled axes (rows and columns), akin to a
spreadsheet or SQL table. DataFrames are especially useful for handling
heterogeneous and tabular data.

Pandas excels at handling data preprocessing tasks, such as cleaning, merging,


reshaping, and aggregating data. It integrates well with other libraries in the
Python data science ecosystem, making it a fundamental tool for tasks like
exploratory data analysis, statistical analysis, and machine learning.

Pandas supports reading and writing data in various formats, including CSV,
Excel, SQL databases, and more. It offers a robust set of functionalities for
handling missing data, time series data, and categorical data.

Overall, Pandas is an essential tool for data professionals, analysts, and


scientists, providing an efficient and intuitive way to manipulate, analyze, and
visualize tabular data in Python. Its versatility and ease of use contribute to its
widespread adoption in both industry and academia for data-cen tric tasks.
DOCUMENT:

To work with documents in the DOCX format in Python, the python-docx


library is commonly used. This library allows you to create, modify, and extract
information from Word documents (.docx files). Here's a brief guide on using
python-docx to work with DOCX documents:

Installation:
You can install the python-docx library using the following command:

bash
pip install python-docx
Example: Reading Text from a DOCX Document:
python

from docx import Document

# Open the DOCX file


doc = Document('example.docx')

# Extract text from paragraphs


for paragraph in doc.paragraphs:
print(paragraph.text)
In this example, we open a DOCX file called 'example.docx' and iterate through
its paragraphs, printing the text of each paragraph.

Example: Creating a New DOCX Document:


python
from docx import Document

# Create a new document


new_doc = Document()

# Add a heading
new_doc.add_heading('My Document Heading', level=1)

# Add paragraphs
new_doc.add_paragraph('This is the first paragraph.')
new_doc.add_paragraph('This is the second paragraph.')

# Save the document


new_doc.save('new_document.docx')
CODE:

import numpy as np
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch
import re
import streamlit as st
import pdfplumber
import pandas as pd
import base64
from docx import Document
import streamlit.components.v1 as components

# Define the device, model, and tokenizer


device = "cpu"
# device = "mps" # for Apple Sillicon devices
# device ="cuda" # for CUDA supported devices

model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

max_length = 1024
stride = 256
ai_perplexity_threshold = 55
human_ai_perplexity_threshold = 80
def get_perplexity(sentence):
"""
Calculate the perplexity of a given sentence using the GPT-2 model.
"""
# Encode the sentence using the tokenizer
input_ids = tokenizer.encode(
sentence,
add_special_tokens=True,
truncation=True,
max_length=max_length,
return_tensors="pt"
).to(device)

total_nll = 0
total_tokens = 0

for start_pos in range(0, input_ids.shape[1], stride):


# Determine the end position of the current sequence
end_pos = min(start_pos + max_length, input_ids.shape[1])
target_len = end_pos - start_pos

# Create target_ids by detaching input_ids and filling non-target tokens


with -100
target_ids = input_ids[:, start_pos:end_pos].detach()
target_ids[:, :-target_len].fill_(-100)
# Compute the negative log likelihood loss
outputs = model(input_ids[:, start_pos:end_pos], labels=target_ids)
neg_log_likelihood = outputs.loss * target_len

total_nll += neg_log_likelihood.sum()
total_tokens += target_len

if total_tokens == 0:
# Assign infinity perplexity as a default value
perplexity = float('inf')
else:
perplexity = round(float(torch.exp(total_nll / total_tokens)), 2)

return perplexity

def analyze_text(sentence):
"""
Analyze the given text and determine the perplexity and label of the text.
"""
results = {}

# Count the total number of valid characters in the sentence


total_valid_char = sum(len(x)
for x in re.findall(r"[a-zA-Z0-9]+", sentence))

if total_valid_char < 200:


results["Label"] = -1
results["Output"] = "Insufficient Content"
results["Percent_ai"] = "-"
results["Perplexity"] = "-"
results["Burstiness"] = "-"

return results

# Split the sentence into lines based on punctuation and newlines


lines = re.split(r'(?<=[.?!][ \[\(])|(?<=\n)\s*', sentence)
lines = [line for line in lines if re.search(
r"[a-zA-Z0-9]+", line) is not None]
perplexities = []
total_characters = 0
ai_characters = 0
for line in lines:
total_characters += len(line)
perplexity = get_perplexity(line)
perplexities.append(perplexity)
if perplexity < ai_perplexity_threshold:
ai_characters += len(line)

results["Percent_ai"] = str(
round((ai_characters/total_characters)*100, 2))+"%"
results["Perplexity"] = round(sum(perplexities) / len(perplexities), 2)
results["Burstiness"] = round(np.var(perplexities), 2)
if results["Perplexity"] <= ai_perplexity_threshold:
results["Label"] = 0
results["Output"] = "AI"
elif results["Perplexity"] <= human_ai_perplexity_threshold:
results["Label"] = 1
results["Output"] = "Human + AI"
else:
results["Label"] = 2
results["Output"] = "Human"

return results

def process_text_file(file):
"""
Process the input text file (PDF or Word) and analyze the content.
"""
if file.type == "application/pdf":
with pdfplumber.open(file) as pdf:
text = ""
for page in pdf.pages:
extracted_text = page.extract_text()
text += extracted_text if extracted_text is not None else ""

elif file.type == "application/vnd.openxmlformats-


officedocument.wordprocessingml.document":
doc = Document(file)
text = ""
for para in doc.paragraphs:
text += para.text
else:
st.error("Unsupported file format. Please upload a PDF or Word
document.")
return

results = analyze_text(text)
return results

def main():
st.set_page_config(page_title='ChatGPT - AI-powered text analysis')
st.title("CheckGPT")
st.write("CheckGPT is an AI-powered text analysis tool that predicts the
content generated by AI by evaluating the perplexity and burstiness scores of
GPT model, and provides insights for investigating text authenticity.")
st.write("", unsafe_allow_html=True)
# Create an empty placeholder for the uploaded files
uploaded_files_placeholder = st.empty()

results_list = []

# Process the files only when the "Start" button is pressed


uploaded_files = uploaded_files_placeholder.file_uploader(
"Upload PDF or Word documents", type=["pdf", "docx"],
accept_multiple_files=True)

# Create a button to start processing


start_button = st.button("Start Checking")

st.markdown(
"""
<style>
.footer {
position: fixed;
bottom: 0;
left: 0;
width: 100%;
text-align: center;
padding: 10px;
background-color: #0A2742;
color: white;
}
</style>

""",
unsafe_allow_html=True
)

if start_button:
with st.spinner("Processing..."):
for uploaded_file in uploaded_files:
results = process_text_file(uploaded_file)
results["file_name"] = uploaded_file.name
results_list.append(results)
if results_list:
df = pd.DataFrame(results_list)
df = df[["file_name", "Percent_ai",
"Perplexity", "Burstiness", "Output"]]
df = df.astype(str)
df = df.rename(columns={"file_name": "File Name", "Percent_ai":
"Predicted AI percent",
"Perplexity": "Perplexity Score", "Output": "Predicted
Output"})
st.write("Results:")

# Apply conditional formatting to the "Output" cell only


df_styled = df.style.applymap(
lambda value: "color: grey" if value == "Insufficient Content" else
"color: green" if value == "Human" else
"color: DarkOrange" if value == "Human + AI" else
"color: red",
subset=["Predicted Output"]
)

st.dataframe(df_styled)

# Add a button to download the results as a CSV file


csv_data = df.to_csv(index=False)
b64 = base64.b64encode(csv_data.encode()).decode()
href = f'<a href="data:file/csv;base64,{b64}"
download="results.csv">Download CSV</a>'
st.markdown(href, unsafe_allow_html=True)
# Display the description of columns and disclaimer
st.markdown(
"""
<div class="small-text">
<strong>Column Descriptions:</strong><br>
- <strong>Predicted AI percent:</strong> Percentage of the text
predicted to be generated by AI.<br>
- <strong>Perplexity Score:</strong> Measurement the model's
confidence in generating the text.<br>
- <strong>Burstiness:</strong> Measurment of variation in perplexity
scores for the analyzed text.<br>
- <strong>Predicted Output:</strong> The predicted label for the text:
'AI', 'Human + AI', 'Human', or 'Insufficient Content'.<br><br>

<strong>Disclaimer:</strong><br>
These results are generated by an AI model and may not be 100%
accurate. Please use them for investigation purposes and exercise caution when
making decisions based on the results.
</div>
""",
unsafe_allow_html=True
)

if __name__ == "__main__":
main()
RESULTS:
Layout of Page

Selection of Files
Start Checking

Final Output

You might also like