AI Generated Text Detection Synopsis
AI Generated Text Detection Synopsis
PERPLEXITY
BURSTINESS
PERPLEXITY :
Perplexity is a measure used to evaluate the performance of language models. It
refers to how well the model is able to predict the next word in a sequence of
words. As you’ll probably know by now, AI-generated text is procedurally
generated; i.e. word-by-word. AI selects the next probable word in a sentence
from a K-number of weighted options in the sample.
As a really rough example, how do you think should this sentence end?
“I picked up the kids and dropped them off at…”
A language model with high perplexity might propose “icicle”, “pensive”, or
“luminous” as answers. Those words don’t make sense; it’s word salad.
Somewhere in the middle might be “the President’s birthday party”. It’s highly
unlikely but… I guess it might be plausible, on rare occasions?
But a language model with low perplexity
For writing generic content that’s intended to be standard or ordinary, lower
perplexity is the safest bet.
BURSTINESS :
AI is more robotic: uniform and regular. It has a steady tempo, compared to our
creative spontaneity. We humans get carried away and improvise; that’s what
captures the reader’s attention and encourages them to keep reading.
APPROACH :
Perplexity:
Perplexity is a metric commonly employed in natural language processing and
language modeling. It gauges the predictive power of a language model by
assessing how well it anticipates the occurrence of the next word in a sequence.
Lower perplexity scores typically indicate that a model is more adept at
predicting the next word, suggesting a higher likelihood of AI-generated
content.
Burstiness:
Burstiness, on the other hand, delves into the temporal distribution of terms in a
given text. A bursty pattern emerges when certain terms or phrases exhibit
clusters of consecutive repetitions. While burstiness is not inherently indicative
of AI generation, it can be a characteristic observed in specific types of
automated content creation.
Detection Method:
NUMPY
TANSFORMER
TORCH
RE
STREAMLIT
PDFPLUMBER
PANDAS
DOCUMENT
NUMPY:
Integration with other Libraries: NumPy integrates well with other scientific
computing libraries, such as SciPy (Scientific Python), Matplotlib (plotting
library), and Pandas (data manipulation library).
Neural Network Module: The torch.nn module provides tools for building and
training neural networks. It includes predefined layers, loss functions, and
optimization algorithms.
To use PyTorch, you can install it via:
The re module provides support for regular expressions, allowing you to search,
match, and manipulate strings based on specific patterns. Regular expressions
are powerful tools for string manipulation and text processing.
python
import re
if match:
print("Email found:", match.group())
else:
print("No email found.")
In this example, the regular expression pattern is designed to match email
addresses.
STREAMLIT:
Streamlit is a Python library that allows you to create web applications for data
science and machine learning projects with minimal effort. It's designed to
simplify the process of turning data scripts into interactive web apps quickly.
Streamlit is particularly popular for its simplicity and ease of use.
Simple Syntax: Streamlit scripts are typically concise and easy to understand.
You can create interactive apps with just a few lines of Python code.
Widgets: Streamlit provides various widgets that you can use to create
interactive elements in your app, such as sliders, buttons, and text inputs. These
widgets help users interact with your data or model.
Live Updates: Streamlit apps are reactive, meaning they automatically update
when the user interacts with widgets or when underlying data changes. You
don't need to explicitly manage the update process.
Data Integration: You can easily integrate data visualizations created with
libraries like Matplotlib, Plotly, or Altair into your Streamlit app.
Deployment: Deploying Streamlit apps is straightforward. You can deploy them
on platforms like Streamlit Sharing, Heroku, or AWS.
To run this Streamlit app, you'll need to have Streamlit installed. You can install
it using:
Key Features:
Text Extraction:
pdfplumber simplifies the extraction of text from PDFs using the extract_text()
method. It efficiently handles various text layouts, making it suitable for a wide
range of documents.
python
import pdfplumber
The library provides intuitive methods for navigating through pages, enabling
users to access and extract content from specific pages or page ranges.
python
Data Analysis:
For data analysts and scientists, pdfplumber is a valuable tool for extracting
relevant information from PDF reports, research papers, or financial statements.
Its ability to handle diverse document structures makes it versatile for data
extraction tasks.
Document Parsing:
Legal and business professionals often deal with PDF documents containing
crucial information. pdfplumber simplifies the process of extracting text and
tables, aiding in the automated parsing of legal documents, contracts, and
financial reports.
Automated Reporting:
Suppose we have a PDF document containing text and a table. We can use
pdfplumber to extract both.
python
Extracting images from a PDF document and saving them as separate files.
python
A Series is a one-dimensional array that can hold any data type. It is similar to a
column in a spreadsheet or a single-column table in a database. A DataFrame is
a two-dimensional table with labeled axes (rows and columns), akin to a
spreadsheet or SQL table. DataFrames are especially useful for handling
heterogeneous and tabular data.
Pandas supports reading and writing data in various formats, including CSV,
Excel, SQL databases, and more. It offers a robust set of functionalities for
handling missing data, time series data, and categorical data.
Installation:
You can install the python-docx library using the following command:
bash
pip install python-docx
Example: Reading Text from a DOCX Document:
python
# Add a heading
new_doc.add_heading('My Document Heading', level=1)
# Add paragraphs
new_doc.add_paragraph('This is the first paragraph.')
new_doc.add_paragraph('This is the second paragraph.')
import numpy as np
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch
import re
import streamlit as st
import pdfplumber
import pandas as pd
import base64
from docx import Document
import streamlit.components.v1 as components
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
max_length = 1024
stride = 256
ai_perplexity_threshold = 55
human_ai_perplexity_threshold = 80
def get_perplexity(sentence):
"""
Calculate the perplexity of a given sentence using the GPT-2 model.
"""
# Encode the sentence using the tokenizer
input_ids = tokenizer.encode(
sentence,
add_special_tokens=True,
truncation=True,
max_length=max_length,
return_tensors="pt"
).to(device)
total_nll = 0
total_tokens = 0
total_nll += neg_log_likelihood.sum()
total_tokens += target_len
if total_tokens == 0:
# Assign infinity perplexity as a default value
perplexity = float('inf')
else:
perplexity = round(float(torch.exp(total_nll / total_tokens)), 2)
return perplexity
def analyze_text(sentence):
"""
Analyze the given text and determine the perplexity and label of the text.
"""
results = {}
return results
results["Percent_ai"] = str(
round((ai_characters/total_characters)*100, 2))+"%"
results["Perplexity"] = round(sum(perplexities) / len(perplexities), 2)
results["Burstiness"] = round(np.var(perplexities), 2)
if results["Perplexity"] <= ai_perplexity_threshold:
results["Label"] = 0
results["Output"] = "AI"
elif results["Perplexity"] <= human_ai_perplexity_threshold:
results["Label"] = 1
results["Output"] = "Human + AI"
else:
results["Label"] = 2
results["Output"] = "Human"
return results
def process_text_file(file):
"""
Process the input text file (PDF or Word) and analyze the content.
"""
if file.type == "application/pdf":
with pdfplumber.open(file) as pdf:
text = ""
for page in pdf.pages:
extracted_text = page.extract_text()
text += extracted_text if extracted_text is not None else ""
results = analyze_text(text)
return results
def main():
st.set_page_config(page_title='ChatGPT - AI-powered text analysis')
st.title("CheckGPT")
st.write("CheckGPT is an AI-powered text analysis tool that predicts the
content generated by AI by evaluating the perplexity and burstiness scores of
GPT model, and provides insights for investigating text authenticity.")
st.write("", unsafe_allow_html=True)
# Create an empty placeholder for the uploaded files
uploaded_files_placeholder = st.empty()
results_list = []
st.markdown(
"""
<style>
.footer {
position: fixed;
bottom: 0;
left: 0;
width: 100%;
text-align: center;
padding: 10px;
background-color: #0A2742;
color: white;
}
</style>
""",
unsafe_allow_html=True
)
if start_button:
with st.spinner("Processing..."):
for uploaded_file in uploaded_files:
results = process_text_file(uploaded_file)
results["file_name"] = uploaded_file.name
results_list.append(results)
if results_list:
df = pd.DataFrame(results_list)
df = df[["file_name", "Percent_ai",
"Perplexity", "Burstiness", "Output"]]
df = df.astype(str)
df = df.rename(columns={"file_name": "File Name", "Percent_ai":
"Predicted AI percent",
"Perplexity": "Perplexity Score", "Output": "Predicted
Output"})
st.write("Results:")
st.dataframe(df_styled)
<strong>Disclaimer:</strong><br>
These results are generated by an AI model and may not be 100%
accurate. Please use them for investigation purposes and exercise caution when
making decisions based on the results.
</div>
""",
unsafe_allow_html=True
)
if __name__ == "__main__":
main()
RESULTS:
Layout of Page
Selection of Files
Start Checking
Final Output