SlideShare a Scribd company logo
Evaluation Challenges in Using
Generative AI for Science &
Technical Content
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Bradley Allen, Fina Polat, Xue Li, Daniel Daza
SemTech4STLD Workshop - ESWC 2025
Outline
• A use case & where we are today
• The challenges of evaluation in for information extraction and knowledge
graph construction
• Some routes forward & maybe a bold idea
Using AI to
Study Standards
• Provenance working group:
• 8820 public emails,
• 666 issues,
• 600 wiki pages,
• 6000 mercurial commits
• 152 teleconferences
Standards are hard
The rationale of PROVL Moreau, P Groth, J Cheney, T Lebo, S Miles
Web Semantics: Science, Services and Agents on the World Wide Web 35, 235-257
Standards are digital
Standard development leaves digital traces
New tools to analyze standards development
https://github.com/glasgow-ipl/ietfdata
https://github.com/datactive/bigbang
Nick Doty et al. https://github.com/IETF-Hackathon/ietf111-project-presentations/blob/main/ietf111-hackathon-bigbang.pdf
Questions one might like to ask
• Understand the content of email messages and their rhetorical
structure. (e.g. arguments were put forward but constantly ignored)
• Recover technical considerations and rationales behind the choices
made and ultimately documented in a standard
• More fine-grained quantitative and qualitative analysis
From: Michael Welzl, Stephan Oepen, Cezary Jaskula, Carsten Griwodz, and Safiqul Islam. 2021. Collaboration
in the IETF: an initial analysis of two decades in email discussions. SIGCOMM Comput. Commun. Rev. 51, 3
(July 2021), 29–32. DOI:https://doi.org/10.1145/3477482.3477488
Example uses of AI for standards analysis
From EUROCAE ED 133: FLIGHT OBJECT INTEROPERABILITY SPECIFICATION
Recognising entities in conversations
Predicting the success of a standard
Stephen McQuistin, Mladen Karan, Prashant Khare, Colin Perkins, Gareth Tyson, Matthew Purver, Patrick Healey, Waleed Iqbal, Junaid Qadir, and
Ignacio Castro. 2021. Characterising the IETF through the lens of RFC deployment. In <i>Proceedings of the 21st ACM Internet Measurement
Conference</i> (<i>IMC '21</i>). Association for Computing Machinery, New York, NY, USA, 137–149. DOI:https://doi.org/
10.1145/3487552.3487821
Intelligent Interventions Develop new natural language processing and machine learning
techniques to understand what’s going on within standards
development:
• How are people, organizations, topics, documents, priorities,
requirements, etc… connected?
• What are people and standards actually talking about?
Based on this understanding, develop intelligent tools to better
integrate public values.
Challenges in using AI for Standards Analysis
14
Email threads
https://lists.w3.org/Archives/Public/
● Long form conversations;
● Change of speaker;
● Lexical ambiguity;
● Specialized domain;
● Informal structures;
● Extensions across sessions;
● Lack of annotated data
● Complex entities
● Multiple perspectivies
● Dynamic analyses
15
Xue Li, Sara Magliacane, and Paul Groth. 2021. The Challenges of Cross-Document Coreference
Resolution for Email. In <i>Proceedings of the 11th on Knowledge Capture Conference</i>
(<i>K-CAP '21</i>). Association for Computing Machinery, New York, NY, USA, 273–276.
DOI:https://doi.org/10.1145/3460210.3493573
Methods for building databases
of information from standards conversations
1 – knowledge graphs
1
Decoder-only representative large language models.
Source: S. Pan et al., Unifying Large Language Models and Knowledge Graphs: A Roadmap
https://arxiv.org/abs/2306.08302
LLMs and Generative AI
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
Evaluation Challenges in Using Generative AI for Science & Technical Content
- Sustainability
- Security / Resilience
- Connecting the Unconnected
Evaluation
Challenges
The tale of SlotGan
Daniel Daza, Michael Cochez, and Paul Groth. 2022. SlotGAN: Detecting Mentions in Text via Adversarial Distant
Learning. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 32–39, Dublin, Ireland.
Association for Computational Linguistics.
Relation Extraction & Instruction Tuning
Do Instruction-tuned Large Language Models Help with Relation Extraction?
Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023
https://ceur-ws.org/Vol-3577/paper15.pdf
Results on REBEL dataset
Results on Post-Hoc Human Eval
Can we preserve relation extraction performance
while preserving in-context capabilities?
Method: Instruction Tune Dolly LLM with
LORA using a relation extraction dataset
(REBEL)
▫ Prompt Engineering techniques:
▿ Zero-shot, one-shot, few-shot
▿ RAG - Retrieval Augmented Generation
▿ CoT - Chain of Thought
▿ CoT self consistency
▿ ReAct - Reasoning (e.g.chain-of-thought prompting) and Acting
(e.g.action plan generation)
▫ Polat F, Tiddi I, Groth P. Testing prompt engineering methods for knowledge
extraction from text. Semantic Web. 2025;16(2). doi:10.3233/SW-243719
05.06.24 24
Test and compare Prompt Engineering for Knowledge Extraction
05.06.24 25
Open Information Extraction
26
Performance on RED-FM
05.06.24 27
Ontology Based Triple Assesment
28
Ontology Based
Assessment
Impressions
• Results appear to be really good qualitatively
• Annotation quality is varied
• Challenges in agreement
• Large scale is often automated
• Is everything in domain?
Routes
Forward
More complex tasks
User studies
E Papadopoulou. Retrieval Augmented Generation of Tabular Answers at Query
Time using Pre-trained Large Language Models. (2023) https://
scripties.uba.uva.nl/search?id=record_53599
LLMs as judges
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024.
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of
the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045,
Miami, Florida, USA. Association for Computational Linguistics.
LLMs as judges
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
LLM-as-a-judge with MT-bench and Chatbot Arena. In
Proceedings of the 37th International Conference on
Neural Information Processing Systems (NIPS '23).
Curran Associates Inc., Red Hook, NY, USA, Article
2020, 46595–46623.
Agreement
Problem statement
• We will focus on how LLMs can be used to
support the evaluation of class membership
relations in a KG
• Class membership represents
classification schemes
• Classification schemes
• Crucial to knowledge infrastructures
• Implications for social policy and scientific
consensus
• Class membership is important for data
governance
• "providing a set of mappings from a
representation language to agreed-upon
concepts in the real world" [Khatri and Brown]
36
Allen, B.P., Groth, P.T. (2025). Evaluating Class Membership Relations in Knowledge Graphs Using
Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web: ESWC 2024 Satellite
Events. ESWC 2024. Lecture Notes in Computer Science, vol 15344. Springer, Cham. https://
doi.org/10.1007/978-3-031-78952-6_2
Class membership relation evaluation
by an LLM
domain
knowledge in
natural language
corpus C
= arg max L (
𝑇
| (e, instance-of, o) )
knowledge
graph G
pre-training
sampling
(e, instance-of, c)
decision
37
Performance metrics
• Classifiers can exhibit good alignment with KGs (Q1)
• One LLM was in moderate agreement (κ > 0.60) with Wikidata
• Four were in moderate agreement with CaLiGraph
38
Error analysis results
• Error analysis based on review by one of the authors
• FNs, FPs with rationales and assign error to LLM or KG
• LLM errors: incorrect reasoning, missing data
• KG errors: missing relation, incorrect relation
• Error analysis performed for gpt-4-0125-preview
• Classifiers can detect missing or incorrect relations (Q2)
• 40.9% of errors were due to the problems with the KG
• 29.1% of errors were due to missing or insufficient data in the entity description
• 30.0% of errors due to incorrect reasoning by the LLM
• Pairwise human-KG and human-LLM agreement differed between the KGs
• Human showed fair agreement with Wikidata and no agreement with the classifier
• Human showed slight agreement with the classifier and no agreement with CaLiGraph
39
Agents as Peers
• Rationales
• Based on provenance and
evidence
• Consensus formation
• Encoding consensus as sharable
knowledge (graphs)
Conclusion
• Gen AI allows for impressive capabilities for Scienti
fi
c & Legal Content
• How do we know the results are good?
• Standard evaluations
• Approaches: complex tasks, user feedback, LLMs as judges
• consensus among peers - science!
Paul Groth | @pgroth | pgroth.com | indelab.org

More Related Content

More from Paul Groth (20)

Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
Paul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
Paul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
Paul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
Paul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
Paul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
Paul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
Paul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
Paul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
Paul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
Paul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
Paul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
Paul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
Paul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
Paul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
Paul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
Paul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
Paul Groth
 

Recently uploaded (20)

Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
Maxx nft market place new generation nft marketing place
Maxx nft market place new generation nft marketing placeMaxx nft market place new generation nft marketing place
Maxx nft market place new generation nft marketing place
usersalmanrazdelhi
 
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Nikki Chapple
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Microsoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentationMicrosoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentation
Digitalmara
 
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Aaryan Kansari
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Securiport - A Border Security Company
Securiport  -  A Border Security CompanySecuriport  -  A Border Security Company
Securiport - A Border Security Company
Securiport
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 
Cybersecurity Fundamentals: Apprentice - Palo Alto Certificate
Cybersecurity Fundamentals: Apprentice - Palo Alto CertificateCybersecurity Fundamentals: Apprentice - Palo Alto Certificate
Cybersecurity Fundamentals: Apprentice - Palo Alto Certificate
VICTOR MAESTRE RAMIREZ
 
Jira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : IntroductionJira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : Introduction
Ravi Teja
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
 
The case for on-premises AI
The case for on-premises AIThe case for on-premises AI
The case for on-premises AI
Principled Technologies
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
Jasper Oosterveld
 
Cognitive Chasms - A Typology of GenAI Failure Failure Modes
Cognitive Chasms - A Typology of GenAI Failure Failure ModesCognitive Chasms - A Typology of GenAI Failure Failure Modes
Cognitive Chasms - A Typology of GenAI Failure Failure Modes
Dr. Tathagat Varma
 
Contributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptxContributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptx
Patrick Lumumba
 
Introducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRCIntroducing the OSA 3200 SP and OSA 3250 ePRC
Introducing the OSA 3200 SP and OSA 3250 ePRC
Adtran
 
Maxx nft market place new generation nft marketing place
Maxx nft market place new generation nft marketing placeMaxx nft market place new generation nft marketing place
Maxx nft market place new generation nft marketing place
usersalmanrazdelhi
 
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Protecting Your Sensitive Data with Microsoft Purview - IRMS 2025
Nikki Chapple
 
TrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy ContractingTrustArc Webinar: Mastering Privacy Contracting
TrustArc Webinar: Mastering Privacy Contracting
TrustArc
 
Microsoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentationMicrosoft Build 2025 takeaways in one presentation
Microsoft Build 2025 takeaways in one presentation
Digitalmara
 
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Agentic AI Explained: The Next Frontier of Autonomous Intelligence & Generati...
Aaryan Kansari
 
Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025Kubernetes Cloud Native Indonesia Meetup - May 2025
Kubernetes Cloud Native Indonesia Meetup - May 2025
Prasta Maha
 
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyesEnd-to-end Assurance for SD-WAN & SASE with ThousandEyes
End-to-end Assurance for SD-WAN & SASE with ThousandEyes
ThousandEyes
 
Securiport - A Border Security Company
Securiport  -  A Border Security CompanySecuriport  -  A Border Security Company
Securiport - A Border Security Company
Securiport
 
Gihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai TechnologyGihbli AI and Geo sitution |use/misuse of Ai Technology
Gihbli AI and Geo sitution |use/misuse of Ai Technology
zainkhurram1111
 
European Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility TestingEuropean Accessibility Act & Integrated Accessibility Testing
European Accessibility Act & Integrated Accessibility Testing
Julia Undeutsch
 
Cybersecurity Fundamentals: Apprentice - Palo Alto Certificate
Cybersecurity Fundamentals: Apprentice - Palo Alto CertificateCybersecurity Fundamentals: Apprentice - Palo Alto Certificate
Cybersecurity Fundamentals: Apprentice - Palo Alto Certificate
VICTOR MAESTRE RAMIREZ
 
Jira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : IntroductionJira Administration Training – Day 1 : Introduction
Jira Administration Training – Day 1 : Introduction
Ravi Teja
 
Introducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and ARIntroducing FME Realize: A New Era of Spatial Computing and AR
Introducing FME Realize: A New Era of Spatial Computing and AR
Safe Software
 
Cyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptxCyber Security Legal Framework in Nepal.pptx
Cyber Security Legal Framework in Nepal.pptx
Ghimire B.R.
 
SDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhereSDG 9000 Series: Unleashing multigigabit everywhere
SDG 9000 Series: Unleashing multigigabit everywhere
Adtran
 
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....
Jasper Oosterveld
 
Cognitive Chasms - A Typology of GenAI Failure Failure Modes
Cognitive Chasms - A Typology of GenAI Failure Failure ModesCognitive Chasms - A Typology of GenAI Failure Failure Modes
Cognitive Chasms - A Typology of GenAI Failure Failure Modes
Dr. Tathagat Varma
 
Contributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptxContributing to WordPress With & Without Code.pptx
Contributing to WordPress With & Without Code.pptx
Patrick Lumumba
 
Ad

Evaluation Challenges in Using Generative AI for Science & Technical Content

  • 1. Evaluation Challenges in Using Generative AI for Science & Technical Content Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Bradley Allen, Fina Polat, Xue Li, Daniel Daza SemTech4STLD Workshop - ESWC 2025
  • 2. Outline • A use case & where we are today • The challenges of evaluation in for information extraction and knowledge graph construction • Some routes forward & maybe a bold idea
  • 3. Using AI to Study Standards
  • 4. • Provenance working group: • 8820 public emails, • 666 issues, • 600 wiki pages, • 6000 mercurial commits • 152 teleconferences Standards are hard The rationale of PROVL Moreau, P Groth, J Cheney, T Lebo, S Miles Web Semantics: Science, Services and Agents on the World Wide Web 35, 235-257
  • 7. New tools to analyze standards development https://github.com/glasgow-ipl/ietfdata https://github.com/datactive/bigbang
  • 8. Nick Doty et al. https://github.com/IETF-Hackathon/ietf111-project-presentations/blob/main/ietf111-hackathon-bigbang.pdf
  • 9. Questions one might like to ask • Understand the content of email messages and their rhetorical structure. (e.g. arguments were put forward but constantly ignored) • Recover technical considerations and rationales behind the choices made and ultimately documented in a standard • More fine-grained quantitative and qualitative analysis From: Michael Welzl, Stephan Oepen, Cezary Jaskula, Carsten Griwodz, and Safiqul Islam. 2021. Collaboration in the IETF: an initial analysis of two decades in email discussions. SIGCOMM Comput. Commun. Rev. 51, 3 (July 2021), 29–32. DOI:https://doi.org/10.1145/3477482.3477488
  • 10. Example uses of AI for standards analysis
  • 11. From EUROCAE ED 133: FLIGHT OBJECT INTEROPERABILITY SPECIFICATION Recognising entities in conversations
  • 12. Predicting the success of a standard Stephen McQuistin, Mladen Karan, Prashant Khare, Colin Perkins, Gareth Tyson, Matthew Purver, Patrick Healey, Waleed Iqbal, Junaid Qadir, and Ignacio Castro. 2021. Characterising the IETF through the lens of RFC deployment. In <i>Proceedings of the 21st ACM Internet Measurement Conference</i> (<i>IMC '21</i>). Association for Computing Machinery, New York, NY, USA, 137–149. DOI:https://doi.org/ 10.1145/3487552.3487821
  • 13. Intelligent Interventions Develop new natural language processing and machine learning techniques to understand what’s going on within standards development: • How are people, organizations, topics, documents, priorities, requirements, etc… connected? • What are people and standards actually talking about? Based on this understanding, develop intelligent tools to better integrate public values.
  • 14. Challenges in using AI for Standards Analysis 14 Email threads https://lists.w3.org/Archives/Public/ ● Long form conversations; ● Change of speaker; ● Lexical ambiguity; ● Specialized domain; ● Informal structures; ● Extensions across sessions; ● Lack of annotated data ● Complex entities ● Multiple perspectivies ● Dynamic analyses 15 Xue Li, Sara Magliacane, and Paul Groth. 2021. The Challenges of Cross-Document Coreference Resolution for Email. In <i>Proceedings of the 11th on Knowledge Capture Conference</i> (<i>K-CAP '21</i>). Association for Computing Machinery, New York, NY, USA, 273–276. DOI:https://doi.org/10.1145/3460210.3493573
  • 15. Methods for building databases of information from standards conversations 1 – knowledge graphs 1
  • 16. Decoder-only representative large language models. Source: S. Pan et al., Unifying Large Language Models and Knowledge Graphs: A Roadmap https://arxiv.org/abs/2306.08302 LLMs and Generative AI
  • 20. - Sustainability - Security / Resilience - Connecting the Unconnected
  • 22. The tale of SlotGan Daniel Daza, Michael Cochez, and Paul Groth. 2022. SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning. In Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 32–39, Dublin, Ireland. Association for Computational Linguistics.
  • 23. Relation Extraction & Instruction Tuning Do Instruction-tuned Large Language Models Help with Relation Extraction? Xue Li, Fina Polat and Paul Groth. LM-AKBC Workshop at ISWC 2023 https://ceur-ws.org/Vol-3577/paper15.pdf Results on REBEL dataset Results on Post-Hoc Human Eval Can we preserve relation extraction performance while preserving in-context capabilities? Method: Instruction Tune Dolly LLM with LORA using a relation extraction dataset (REBEL)
  • 24. ▫ Prompt Engineering techniques: ▿ Zero-shot, one-shot, few-shot ▿ RAG - Retrieval Augmented Generation ▿ CoT - Chain of Thought ▿ CoT self consistency ▿ ReAct - Reasoning (e.g.chain-of-thought prompting) and Acting (e.g.action plan generation) ▫ Polat F, Tiddi I, Groth P. Testing prompt engineering methods for knowledge extraction from text. Semantic Web. 2025;16(2). doi:10.3233/SW-243719 05.06.24 24 Test and compare Prompt Engineering for Knowledge Extraction
  • 27. 05.06.24 27 Ontology Based Triple Assesment
  • 29. Impressions • Results appear to be really good qualitatively • Annotation quality is varied • Challenges in agreement • Large scale is often automated • Is everything in domain?
  • 32. User studies E Papadopoulou. Retrieval Augmented Generation of Tabular Answers at Query Time using Pre-trained Large Language Models. (2023) https:// scripties.uba.uva.nl/search?id=record_53599
  • 33. LLMs as judges Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024. Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045, Miami, Florida, USA. Association for Computational Linguistics.
  • 34. LLMs as judges Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 2020, 46595–46623.
  • 36. Problem statement • We will focus on how LLMs can be used to support the evaluation of class membership relations in a KG • Class membership represents classification schemes • Classification schemes • Crucial to knowledge infrastructures • Implications for social policy and scientific consensus • Class membership is important for data governance • "providing a set of mappings from a representation language to agreed-upon concepts in the real world" [Khatri and Brown] 36 Allen, B.P., Groth, P.T. (2025). Evaluating Class Membership Relations in Knowledge Graphs Using Large Language Models. In: Meroño Peñuela, A., et al. The Semantic Web: ESWC 2024 Satellite Events. ESWC 2024. Lecture Notes in Computer Science, vol 15344. Springer, Cham. https:// doi.org/10.1007/978-3-031-78952-6_2
  • 37. Class membership relation evaluation by an LLM domain knowledge in natural language corpus C = arg max L ( 𝑇 | (e, instance-of, o) ) knowledge graph G pre-training sampling (e, instance-of, c) decision 37
  • 38. Performance metrics • Classifiers can exhibit good alignment with KGs (Q1) • One LLM was in moderate agreement (κ > 0.60) with Wikidata • Four were in moderate agreement with CaLiGraph 38
  • 39. Error analysis results • Error analysis based on review by one of the authors • FNs, FPs with rationales and assign error to LLM or KG • LLM errors: incorrect reasoning, missing data • KG errors: missing relation, incorrect relation • Error analysis performed for gpt-4-0125-preview • Classifiers can detect missing or incorrect relations (Q2) • 40.9% of errors were due to the problems with the KG • 29.1% of errors were due to missing or insufficient data in the entity description • 30.0% of errors due to incorrect reasoning by the LLM • Pairwise human-KG and human-LLM agreement differed between the KGs • Human showed fair agreement with Wikidata and no agreement with the classifier • Human showed slight agreement with the classifier and no agreement with CaLiGraph 39
  • 40. Agents as Peers • Rationales • Based on provenance and evidence • Consensus formation • Encoding consensus as sharable knowledge (graphs)
  • 41. Conclusion • Gen AI allows for impressive capabilities for Scienti fi c & Legal Content • How do we know the results are good? • Standard evaluations • Approaches: complex tasks, user feedback, LLMs as judges • consensus among peers - science! Paul Groth | @pgroth | pgroth.com | indelab.org