0% found this document useful (0 votes)

45 views22 pages

Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium

Uploaded by

陳賢明

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views22 pages

Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium

Uploaded by

陳賢明

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Get unlimited access to the best of Medium for less than $1/week. Become a member

Open in app

Beyond FixedSearch
Taxonomies: Zero-shot 44

Classification and Automated Category

Consolidation
Aimen Louafi · Follow
Published in Inside Doctrine
11 min read · 2 days ago

Listen Share More

by Aïmen Louafi and Julien Perrin

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 1/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

In today’s corporate landscape, the analysis of annual/shareholders general

meetings minutes (referred to as AGM) represents a critical yet complex task in
corporate governance and legal compliance. These documents, often containing
hundreds of resolutions across multiple meetings and companies, hold valuable
insights into corporate decision-making, governance practices, and strategic
directions. However, extracting and categorizing this information systematically has
remained a significant challenge — until now.

This blog post presents a systematic approach to analyzing Annual General Meeting
(AGM) resolutions through three key stages. We begin by examining the context and
inherent challenges of processing thousands of diverse legal documents. We then
explore automated classification techniques for categorizing resolutions. Finally, we
address the critical challenge of consolidating similar categories while preserving
important legal distinctions, testing and comparing various approaches to find an
optimal solution.

Understanding the Context

Annual general meeting minutes (AGMs) contain various types of resolutions, from
routine matters like dividend approvals to complex strategic decisions.

They serve multiple purposes:

Legal documentation of corporate decisions

Historical record of governance practices

Reference material for future corporate actions or for drafting

Source of insights for corporate governance research

Here’s a non exhaustive list of potential resolutions that can be found:

Approval of stock option plans for employees

Approval of financial statements

Appointment or reappointment of directors

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 2/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

The ability to effectively classify and analyze resolutions within these documents
can unlock valuable insights for legal professionals, corporate governance
researchers, and compliance officers. Traditional approaches have relied heavily on
manual classification using predefined taxonomies, but this method has proven
increasingly inadequate in the face of evolving corporate practices and the growing
volume of documents.

A Multi-faceted Challenge
1. Volume and Variety
Managing thousands of documents across various companies introduces a
significant challenge. These documents come with diverse writing styles,
formatting, and varying levels of detail and complexity. Additionally, the process of
digitizing physical documents often results in OCR-related errors, further
complicating the task.

2. Categorical Complexity
Classifying these documents presents its own set of obstacles. Categories frequently
overlap — for instance, a single clause might pertain to both financial decisions and
risk management. Moreover, interpretations are often context-dependent, with
corporate practices constantly evolving, resulting in new and emerging clause
types. Variations also arise based on region, industry, or company type, and some
documents don’t explicitly mention categories, ruling out a straightforward
extractive task. Adding to the complexity, the potential number of categories
remains unknown and can fluctuate over time.

This complexity — stemming from the vast volume, variety, and categorical
ambiguity — makes traditional rule-based or classification tasks approaches
inadequate. The overlapping categories, context-dependent interpretations, and
evolving nature of corporate practices demand a more flexible and adaptive
solution. This is where Large Language Models (LLMs) come in. Their ability to
understand nuanced language, generalize across varied contexts, and learn from
vast amounts of unstructured data makes them uniquely suited to tackle these
challenges, enabling more accurate classification and insight extraction from
diverse documents.

Advantages of This Approach

Flexibility: Eliminates the need for maintaining or updating a rigid category
system.

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 3/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Adaptability: Capable of handling new types of resolutions as corporate

governance practices evolve.

Nuanced Understanding: Accurately distinguishes subtle differences between

similar resolutions.

Scalability: Efficiently processes large volumes of resolutions at scale.

Error Correction: Mitigates errors introduced during OCR processing.

Models Evaluated
We conducted an extensive experiment analyzing 1,000,000 AGM resolution clauses
using various LLM models. Our approach utilized a carefully crafted prompt
designed to extract decision types while maintaining consistency and
generalization.

We evaluated various models, both closed ones and open sources one. For open
sourced ones, we leveraged vLLM with run_batch to optimize the throughput and
cost.

We tested three different models, each with distinct characteristics, matching our
costs constraints:

Models tested

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 4/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Sample outputs for the three models

Key Findings
After analyzing the results, GPT-4o-mini emerged as the optimal choice for several
reasons:

Category Consolidation: The model showed superior ability to group similar

decisions under consistent categories, resulting in fewer overall category variations
compared to other models. On the contrary, Mistral-7B yielded very precise
categories, but there were many duplicate ones. LLAMA3–7B yielded more outright
wrong categories.

Discriminative Power: Despite its tendency to consolidate, it maintained excellent

discrimination ability for crucial details, particularly in:

Bylaws modifications

Capital increase procedures

Cost-Effectiveness: While slightly more expensive per token, its better

categorization accuracy and consistency provided better value for the investment.

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 5/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Surprisingly, although it’s a close sourced model, it is on par with serving and
batching requests on open-sources models.

The Problem of Category Consolidation

One of the most significant challenges in using LLMs for zero-shot classification is
managing the diversity of generated categories. After extracting the categories with
an LLM, we ended with a lot of duplicate categories, like :

“Director Appointment”

“Director Nomination”

“New Executive Leadership Selection”

This is because the models might:

Create semantically similar but differently named categories

Generate categories at different levels of abstraction

Produce overlapping or nested classifications

Suggest context-specific categories that need broader alignment

These categories, while technically correct, require post-processing consolidation to

create a coherent and useful taxonomy.

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 6/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

The top 10 most frequent categories we get with the raw LLM output, many duplicates “Délégation de
pouvoir”

Exploring Different Deduplication Approaches

K-Means Clustering
Our first approach utilized K-Means clustering on sentence embeddings generated
from the category names. This method groups categories based on their semantic
similarity in the embedding space.

We explored various embeddings from the MTEB Leaderboard in French (a popular

embedding leaderboard): mostly gte-Qwen2–1.5B-instruct, KaLM-embedding-
multilingual-mini-v1, bilingual-embedding-base.

We processed the 200,000 distinct categories generated by the model by computing

embeddings for each category. Since the embeddings were trained using cosine
similarity but K-means clustering operates on Euclidean distance, we normalized
the vectors using L2 normalization before applying K-means.

However, remember that we have no idea what is the optimal number of clusters
(and there could be new ones emerging). Hence, we need to dynamically compute
https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 7/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

the optimal K number of clusters.

For this, the scientific litterature’s consensus seems to be that the elbow method is a
terrible criteria. So we experimented with using:

Silhouette Score

For a point i :

where:

a(i) = average distance between point i and all other points in its cluster

b(i) = min {average distance between i and all points in other clusters}

Bayesian Information Criterion (BIC)

where:

L = likelihood of the data

k = number of free parameters = K(d + 1)

n = number of data points

K = number of clusters

d = number of dimensions

Akaike Information Criterion (AIC)

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 8/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

where:

k = number of free parameters = K(d + 1)

L = likelihood of the data

K = number of clusters

d = number of dimensions

Indeed, we have very different optimal number of clusters for elbow and Silhouette (same X-axis scale)

With the three metrics and all embeddings vectors, we ended up with a very high
number of optimal clusters (around 40,000).

Some of the centroids we get from clustering (K=20000)

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 9/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

We encountered several significant challenges:

1. Computational Performance: The high dimensionality of the embedding

vectors combined with the large number of data points made K-means
clustering computationally intensive.

2. Curse of Dimensionality: K-means clustering is known to perform poorly with

high-dimensional data, a phenomenon known as the curse of dimensionality.

3. Cluster Quality: Manual inspection of the clusters revealed mixed results. While
some clusters were coherent, others contained overly diverse elements. This
inconsistency stems from K-means’ inability to incorporate custom thresholds
for precision and recall (most clusters had at least one false positive).

4. Business vs. Statistical Optimization: A crucial insight emerged: the statistically

optimal number of clusters often doesn’t align with business requirements. For
example, one cluster contained resolutions referencing different legal articles
(such as “Modifying XX regarding article R-111 of the commerce code” and
“Modifying XX regarding article R-99 of the commerce code”). While these items
are semantically similar, from a business perspective, they should remain
distinct.

This experience highlighted the importance of balancing mathematical

optimization with domain-specific requirements when designing clustering
solutions.

Latent Dirichlet Allocation (LDA)

While semantically similar sentences can have very different meanings, dense
representations make it hard to extract clear business insights since they aggregate
information into non-human-readable formats. This led us to adopt Latent Dirichlet
Allocation (LDA), a statistical approach that analyzes documents by finding both
word distributions across topics and topic distributions within documents. LDA
effectively separates distinct topics while grouping related words, making the
results easy to interpret through lists of the most probable words per topic.

Our implementation starts with text preprocessing, followed by vocabulary

computation and training of the LDA model.

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 10/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

# Preprocessing
nlp = spacy.load("fr_core_news_sm", disable=["ner"])
docs = nlp.pipe(categories, n_process=-1)
cleaned_categories = [clean_text(doc=doc) for doc in docs]

# Vectorize the text

vectorizer = CountVectorizer(
ngram_range=(1, 1),
stop_words=stop_words,
max_features=10000,
)
X = vectorizer.fit_transform(cleaned_categories)
# Fit the LDA model
lda = LatentDirichletAllocation(n_components=n_topics)
Y = lda.fit_transform(X)

Clusters look quite promising:

Topic #0: [‘société’, ‘siège’, ‘social’, ‘modification’, ‘adresse’, ‘statut’] Modification de

l’emplacement du siège social dans les statuts de la Société. Modification de l’adresse du
siège social

Topic #1: [‘société’, ‘commissaire’, ‘compte’, ‘nomination’, ‘exercice’, ‘mandat’]

Nomination d’une société Commissaire aux Comptes Titulaire. Nomination de la société
commissaire aux comptes pour un mandat de six exercices.

Topic #2: [‘réserver’, ‘salarier’, ‘capital’, ‘décision’, ‘augmentation’, ‘social’] Décision sur
une augmentation de capital réservée aux salariés et accordant un délai à la présidence
pour mettre en place un plan d’épargne entreprise. Délégation à la présidence pour
effectuer une augmentation du capital social réservée aux salariés adhérents d’un plan
d’épargne entreprise.

However, we faced many challenges:

The curse of dimensionality presents a significant challenge for LDA,

particularly due to our limited vocabulary derived from short, domain-specific
sentences. We observed notable performance degradation when scaling beyond
approximately 500 topics — a critical limitation given our need to categorize
diverse clauses into precise categories (bear in mind that the optimal number of
clusters in the previous method was 40,000).

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 11/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Statistical models are highly sensitive to the vocabulary on which they are
based, making careful, extensive analysis essential for optimal performance.
Currently, non-discriminating words frequently appear and are sometimes
plagued by ambiguities, as seen in cases where critical terms like ‘non’ are
missing, leading to opposing meanings.

Topic: [‘réduction’, ‘capital’, ‘social’, ‘perte’, ‘motiver’, ‘non’] Décision de réduction du

capital social non motivée par des pertes avec conditions suspensives. Réduction du capital
social par réabsorption de pertes.

Because topic compatibility with categories is represented by a probability

distribution, assigning a single topic to each category is challenging. A category
may be associated with multiple topics, complicating our goal of deduplication.

Some topics are not relevant business-wise:

Topic : [‘alsace’, ‘triperies’, ‘reunies’, ‘boyauderies’, ‘approbation’, ‘société’] Dissolution

sans liquidation de la société TRIPERIES BOYAUDERIES REUNIES D’ALSACE, suite à
une fusion par voie d’absorption, sans augmentation de capital. Approbation de l’adhésion
de la SEM NovaRhéna au GIE EPL SUD ALSACE.

Although LDA is heavily penalized by the problem’s dimensionality, limiting its

ability to meet our objective, it could still serve as a valuable tool for extracting high-
level concepts within our categories. This could help users identify related
categories or enhance our search engine in future iterations.

Paraphrase Detection
Our analysis revealed that at its core, this was fundamentally a paraphrase
detection challenge rather than a general semantic similarity problem. This
realization was crucial because category consolidation should only occur when two
descriptions are true paraphrases of each other, not merely when they share
semantic proximity.

In the legal domain, this distinction is particularly critical — terms that appear
semantically similar may carry significantly different legal implications. We
therefore prioritized precision over recall in our approach, recognizing that false
positives in deduplication (incorrectly merging distinct categories) could be more
problematic than false negatives (failing to merge true duplicates). This conservative

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 12/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

approach helps preserve the nuanced distinctions in legal terminology, where slight
variations in phrasing often reflect meaningful differences in corporate governance.

For paraphrase detection, we’re back with the MTEB benchmark, but in the Pair
Classification leaderboard. We chose
paraphrase-multilingual-MiniLM-L12-v2, paraphrase-multilingual-mpnet-base-v2,
sentence_croissant_alpha_v0.2 as our candidate models.

However, computing all the cosine similarity pairs can be too time consuming: as
this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of
sentences. Instead, we leveraged the paraphrase_mining module from
SentenceTransformers, which is optimized for this task through chunking.

In the end, we fetch pairs of similar sentences with a similarity score. Using these
scores, we build a graph where edges connect categories exceeding a carefully
calibrated similarity score threshold. Categories within each connected component
of this graph are then consolidated into clusters, with the most frequently occurring
description selected as the representative category.

import networkx as nx
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import paraphrase_mining

sentences = [...]
model = SentenceTransformer("all-MiniLM-L6-v2")
paraphrases = paraphrase_mining(model, sentences)
G = nx.Graph()
# Adding all similarity pairs with score >= threshold
for score, i, j in paraphrases:
if score >= THRESHOLD:
G.add_edge(i, j, weight=score)
clusters = list(nx.connected_components(G))

To ensure accuracy, we conducted a rigorous manual evaluation of the paraphrase

detection results on a representative subset of categories. This evaluation enabled
us to identify an optimal similarity threshold that maximizes precision — ensuring
reliable category consolidation while preventing the merging of legally distinct
categories. The threshold was specifically tuned to balance the competing needs of

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 13/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

maintaining legal precision and achieving meaningful consolidation of truly

duplicate categories.

Some similar sentence retrieved

The paraphrase-multilingual-mpnet-base-v2 model proved superior for our use case,

as it successfully preserved crucial legal distinctions in resolution texts. Unlike the
other tested models, it correctly distinguished between resolutions referencing
different legal articles — a critical requirement for legal document analysis.

With this approach, we successfully consolidated our initial category set by

reducing its size by 35%, limiting redundancy while maintaining the integrity of
legally distinct categories. This consolidation dramatically reduces manual
reconciliation time, improves accuracy in compliance monitoring, simplifies the
search experience and enables automated cross-subsidiary analysis that was
previously impractical.

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 14/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

New top 10 resolutions categories after our deduplication process

Final thoughts
The integration of zero-shot classification with intelligent category deduplication
represents a significant step forward in legal document analysis. As LLMs continue
to evolve, we anticipate further improvements in both accuracy and efficiency.
However, the key lesson from our work remains clear: successful application of AI
in specialized domains requires careful attention to domain-specific requirements
and constraints.

The next steps involve exploring alternative clustering methods, such as

hierarchical clustering, to better group similar categories. Additionally, fine-tuning
the LLM for this specific task will enhance its performance. We also aim to leverage
the LLM directly in the deduplication process, allowing for more accurate and
efficient identification of redundant categories.

Engineering Llm Clustering Lda NLP

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 15/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Written by Aimen Louafi

0 Followers · Writer for Inside Doctrine

Aimen Louafi in Inside Doctrine

Comprehensive Analysis of OCR Solutions for High-Volume French

Documents Processing: Performance…
Imagine a world where navigating through volumes of detailed corporate documents is as easy
as a simple keyword search, or combing through…

Apr 24 301

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 16/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Ben Riou in Inside Doctrine

Upgrading EKS to 1.25

TL; DR: We just completed our Kubernetes Upgrade Campaign on EKS to 1.25. If you want to
reproduce it at home (or at work…), here are a few…

Feb 28, 2023 36 2

Philippe Chadenier in Inside Doctrine

Indexing and aggregating lawyers blogs posts at scale

How we handle thousands of lawyers’ blogs using the News-Please library

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 17/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Oct 9 185

Jérémie Uzan in Inside Doctrine

How We Published Our Design System: A Journey into Continuous

Integration
At Doctrine, our path to publishing our Design System has been an adventure full of lessons. It
wasn’t a decision we made overnight, but…

Oct 22 139

See all from Aimen Louafi

See all from Inside Doctrine

Recommended from Medium

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 18/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Deepak in Top Python Libraries

Building a Python Web Scraper with Data Analysis, Visualization, and

Automation
In today’s data-driven world, the ability to gather, analyze, and present real-time data is
invaluable. This project will guide you through…

6d ago 150 2

León Andrés M. in The Quantastic Journal

AI on the Podium: Revolution or Mistake in Stockholm?

About the controversy of the 2024 Nobel Prize in Physics

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 19/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

5d ago 468 4

Lists

Natural Language Processing

1798 stories · 1408 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories · 496 saves

Leadership
61 stories · 478 saves

Leadership upgrades
7 stories · 109 saves

Tomaz Bratanic in Towards Data Science

Building Knowledge Graphs with LLM Graph Transformer

A deep dive into LangChain’s implementation of graph construction with LLMs

2d ago 364 3

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 20/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Abdur Rahman in Stackademic

Python is No More The King of Data Science

5 Reasons Why Python is Losing Its Crown

Oct 23 3.1K 19

Gabriel Melo in Qantev

Fine-Tuning Donut Transformer For Document Classification

Document classification is a machine learning problem in which, given a document file as input,
one receives its class as output. This task…

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 21/22
2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Aug 8 189 1

Datadrifters

LitServe: FastAPI on Steroids for Serving AI Models — Tutorial with Llama

3.2 Vision
I recently tried an open-source gem called LitServe, no more wrestling with serving AI models.

3d ago 191

See more recommendations

https://medium.com/doctrine/beyond-fixed-taxonomies-zero-shot-classification-and-automated-category-consolidation-06517c1319ae 22/22

A Multi-Stage Framework With Taxonomy-Guided Reasoning For Occupation Classification Using Large Language Models
No ratings yet
A Multi-Stage Framework With Taxonomy-Guided Reasoning For Occupation Classification Using Large Language Models
29 pages
Eu Tax Sus
No ratings yet
Eu Tax Sus
20 pages
Fennemore
No ratings yet
Fennemore
24 pages
Proposed Attribute-Based LAA
No ratings yet
Proposed Attribute-Based LAA
7 pages
A Machine Learning Approach To Workflow Management
No ratings yet
A Machine Learning Approach To Workflow Management
12 pages
Credit Genie
No ratings yet
Credit Genie
23 pages
Hack RX
No ratings yet
Hack RX
15 pages
Untapped Global
No ratings yet
Untapped Global
24 pages
ALM First
No ratings yet
ALM First
24 pages
Data Science Document Processing & Structuring Project
No ratings yet
Data Science Document Processing & Structuring Project
6 pages
Solve Unstructured Data EmergeGen
No ratings yet
Solve Unstructured Data EmergeGen
21 pages
2013NICE4149
No ratings yet
2013NICE4149
229 pages
LLM Document Processing System
No ratings yet
LLM Document Processing System
18 pages
Rag Semi Structured
No ratings yet
Rag Semi Structured
20 pages
Bajaj Finasiv
No ratings yet
Bajaj Finasiv
10 pages
Conclusions
No ratings yet
Conclusions
22 pages
Azure AI Fundamentals Exam Prep
No ratings yet
Azure AI Fundamentals Exam Prep
69 pages
Microsoft - Ai 900.VFeb 2024.by .VCEplus.110q
No ratings yet
Microsoft - Ai 900.VFeb 2024.by .VCEplus.110q
69 pages
Ai 900
No ratings yet
Ai 900
5 pages
Peter Ansell Thesis
No ratings yet
Peter Ansell Thesis
221 pages
Research Paper Outline AI RPA Workflows v1
No ratings yet
Research Paper Outline AI RPA Workflows v1
10 pages
Microsoft Azure AI-900
No ratings yet
Microsoft Azure AI-900
37 pages
Unstructured Data Mining Guide
No ratings yet
Unstructured Data Mining Guide
12 pages
Towards AI Search Paradigm
No ratings yet
Towards AI Search Paradigm
63 pages
04 - Text Classification
No ratings yet
04 - Text Classification
22 pages
Gen AI Use Cases
No ratings yet
Gen AI Use Cases
43 pages
Project Scope
No ratings yet
Project Scope
5 pages
AutoML for B2B Contract Negotiation
No ratings yet
AutoML for B2B Contract Negotiation
2 pages
PHP & LLM Integration with Laravel
No ratings yet
PHP & LLM Integration with Laravel
49 pages
New Methods For Metadata Extraction From Scientific Literature
No ratings yet
New Methods For Metadata Extraction From Scientific Literature
175 pages
AI Multi-Agent Workflow with LangChain
No ratings yet
AI Multi-Agent Workflow with LangChain
13 pages
How To Enhance AI Chatbots With Real-Time Data From Bright Data Using OpenAI and LangChain - by Victor Yakubu - Jan, 2025 - Python in Plain English
No ratings yet
How To Enhance AI Chatbots With Real-Time Data From Bright Data Using OpenAI and LangChain - by Victor Yakubu - Jan, 2025 - Python in Plain English
22 pages
Streamlit Chatbot with LangChain LLMs
No ratings yet
Streamlit Chatbot with LangChain LLMs
15 pages
Beyond Skills - Unlocking The Full Potential of Data Scientists. - by Eric Colson - Oct, 2024 - Towards Data Science
No ratings yet
Beyond Skills - Unlocking The Full Potential of Data Scientists. - by Eric Colson - Oct, 2024 - Towards Data Science
19 pages
GCVE Protected Deployment Best Practices
No ratings yet
GCVE Protected Deployment Best Practices
20 pages
Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
No ratings yet
Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
14 pages
Graph-Based RAG with Neo4j
No ratings yet
Graph-Based RAG with Neo4j
13 pages
Build Your Own Memory-Powered Chatbot With Google Generative AI, LangChain, and Gradio - by Vinod Pillai - Nov, 2024 - Medium
No ratings yet
Build Your Own Memory-Powered Chatbot With Google Generative AI, LangChain, and Gradio - by Vinod Pillai - Nov, 2024 - Medium
13 pages
Azure AI Speech Service - Breaking Language Barriers With Video Translation - by Tajinder Singh - Nov, 2024 - Medium
No ratings yet
Azure AI Speech Service - Breaking Language Barriers With Video Translation - by Tajinder Singh - Nov, 2024 - Medium
16 pages
Build Your Multimodal RAG System
No ratings yet
Build Your Multimodal RAG System
19 pages
Fine-Tuning Embedding Models - Achieving More With Less - by Nilesh Raghuvanshi - Nov, 2024 - Towards AI
No ratings yet
Fine-Tuning Embedding Models - Achieving More With Less - by Nilesh Raghuvanshi - Nov, 2024 - Towards AI
20 pages
BigQuery Change Data Capture (CDC) Using Pub - Sub - by Ajith Urimajalu - Google Cloud - Community - Sep, 2024 - Medium
No ratings yet
BigQuery Change Data Capture (CDC) Using Pub - Sub - by Ajith Urimajalu - Google Cloud - Community - Sep, 2024 - Medium
15 pages
7 Notebook LM Uses You'll Wish You Knew Sooner - by Woyera - Oct, 2024 - Medium
No ratings yet
7 Notebook LM Uses You'll Wish You Knew Sooner - by Woyera - Oct, 2024 - Medium
10 pages
GraphRAG vs. Traditional RAG Insights
No ratings yet
GraphRAG vs. Traditional RAG Insights
26 pages
Chrome AI: Local Gemini Nano Setup
No ratings yet
Chrome AI: Local Gemini Nano Setup
19 pages
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
No ratings yet
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
16 pages
AI-Driven Environmental Monitoring and Conservation - by Preeti - Nov, 2024 - Medium
No ratings yet
AI-Driven Environmental Monitoring and Conservation - by Preeti - Nov, 2024 - Medium
23 pages
Legal-BERT Fine-Tuning Guide
No ratings yet
Legal-BERT Fine-Tuning Guide
27 pages
Best LLM 2024 - Top Models For Speed, Accuracy, and Price - Medium
No ratings yet
Best LLM 2024 - Top Models For Speed, Accuracy, and Price - Medium
17 pages
AI's Role in Quantum System Simulation
No ratings yet
AI's Role in Quantum System Simulation
12 pages
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
100% (1)
A Step-By-Step Guide To Building AI Agents With LangGraph - by Alannaelga - Coinmonks - Nov, 2024 - Medium
32 pages
Build Whatsapp Chatbot With Flask and Open Source LLM - LLAMA3? - by Mayankchugh Jobathk - Medium
No ratings yet
Build Whatsapp Chatbot With Flask and Open Source LLM - LLAMA3? - by Mayankchugh Jobathk - Medium
23 pages
Building A Telegram Bot in 2024 With Python - by Erich Hohenstein - Level Up Coding
No ratings yet
Building A Telegram Bot in 2024 With Python - by Erich Hohenstein - Level Up Coding
15 pages
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
No ratings yet
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
14 pages
A Simple Way To Organize Your Styles & Themes in Flutter - by Leonidas Kanellopoulos - Sep, 2024 - Medium
No ratings yet
A Simple Way To Organize Your Styles & Themes in Flutter - by Leonidas Kanellopoulos - Sep, 2024 - Medium
19 pages
A Small Step Towards Reproducing OpenAI O1 - Progress Report On The Steiner Open Source Models - by Yichao 'Peak' Ji - Oct, 2024 - Medium
No ratings yet
A Small Step Towards Reproducing OpenAI O1 - Progress Report On The Steiner Open Source Models - by Yichao 'Peak' Ji - Oct, 2024 - Medium
16 pages
SAP2000 Frame Analysis Tutorial
No ratings yet
SAP2000 Frame Analysis Tutorial
21 pages
AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
No ratings yet
AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
8 pages
Smart Tourism Destination A Critical Reflection
No ratings yet
Smart Tourism Destination A Critical Reflection
18 pages
SageMaker Algorithms Guide
No ratings yet
SageMaker Algorithms Guide
20 pages
L 0019290643 PDF
No ratings yet
L 0019290643 PDF
30 pages
Overview of Topic Modeling Techniques
No ratings yet
Overview of Topic Modeling Techniques
27 pages
Social Media Big Data Analytics For Demand Forecas
No ratings yet
Social Media Big Data Analytics For Demand Forecas
18 pages
1 s2.0 S2405851324000370 Main
No ratings yet
1 s2.0 S2405851324000370 Main
16 pages
Presentation Topic
No ratings yet
Presentation Topic
4 pages
Advancing Fake News Detection Hybrid Deep Learning With FastText and Explainable AI
No ratings yet
Advancing Fake News Detection Hybrid Deep Learning With FastText and Explainable AI
19 pages
Data Miningof Public Opinion An Overview
No ratings yet
Data Miningof Public Opinion An Overview
12 pages
Wade 200622083
No ratings yet
Wade 200622083
153 pages
Yla Anttila Et Al 2021 Topic Modeling For Frame Analysis A Study of Media Debates On Climate Change in India and Usa
No ratings yet
Yla Anttila Et Al 2021 Topic Modeling For Frame Analysis A Study of Media Debates On Climate Change in India and Usa
22 pages
Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details This Is An Early Draft. Your Feedbacks Are Highly Appreciated
No ratings yet
Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details This Is An Early Draft. Your Feedbacks Are Highly Appreciated
17 pages
Mining Free Text Medical Notes
No ratings yet
Mining Free Text Medical Notes
8 pages
Cross Model by DR - Zafar
No ratings yet
Cross Model by DR - Zafar
4 pages
GPU优化机器学习算法
No ratings yet
GPU优化机器学习算法
63 pages
Research Paper FINAL
No ratings yet
Research Paper FINAL
15 pages
Detection
No ratings yet
Detection
15 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Ust Graduate School Thesis Template
100% (3)
Ust Graduate School Thesis Template
7 pages
Report NLP
No ratings yet
Report NLP
25 pages
NLP Record
No ratings yet
NLP Record
15 pages
Artificial Intelligence For Quality of Life Study A Systematic Literature Review
No ratings yet
Artificial Intelligence For Quality of Life Study A Systematic Literature Review
30 pages
Advances in Information Technology in Civil and Building Engineering
No ratings yet
Advances in Information Technology in Civil and Building Engineering
446 pages
Elhassan Elboraey Resume - SWE
No ratings yet
Elhassan Elboraey Resume - SWE
1 page
Personality Prediction Based On Twitter
No ratings yet
Personality Prediction Based On Twitter
6 pages
Community Detection
No ratings yet
Community Detection
72 pages
Tanev & Sieklicki, 2025, TM As Semantic Technology, Applsci-15-03253
No ratings yet
Tanev & Sieklicki, 2025, TM As Semantic Technology, Applsci-15-03253
23 pages
Embedded Topic Models Explained
No ratings yet
Embedded Topic Models Explained
12 pages
(Ebooks PDF) Download Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 1 Vijendra Singh Full Chapters
100% (3)
(Ebooks PDF) Download Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 1 Vijendra Singh Full Chapters
55 pages

Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium

Uploaded by

Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium

Uploaded by

2024/11/7 晚上11:04 Beyond Fixed Taxonomies: Zero-shot Classification and Automated Category Consolidation | by Aimen Louafi | Inside Doctr…

Classification and Automated Category

Listen Share More

by Aïmen Louafi and Julien Perrin

In today’s corporate landscape, the analysis of annual/shareholders general

Understanding the Context

They serve multiple purposes:

Legal documentation of corporate decisions

Historical record of governance practices

Reference material for future corporate actions or for drafting

Source of insights for corporate governance research

Here’s a non exhaustive list of potential resolutions that can be found:

Approval of stock option plans for employees

Approval of financial statements

Appointment or reappointment of directors

Advantages of This Approach

Adaptability: Capable of handling new types of resolutions as corporate

Nuanced Understanding: Accurately distinguishes subtle differences between

Scalability: Efficiently processes large volumes of resolutions at scale.

Error Correction: Mitigates errors introduced during OCR processing.

Sample outputs for the three models

Category Consolidation: The model showed superior ability to group similar

Discriminative Power: Despite its tendency to consolidate, it maintained excellent

Capital increase procedures

Cost-Effectiveness: While slightly more expensive per token, its better

The Problem of Category Consolidation

“New Executive Leadership Selection”

This is because the models might:

Create semantically similar but differently named categories

Generate categories at different levels of abstraction

Produce overlapping or nested classifications

Suggest context-specific categories that need broader alignment

These categories, while technically correct, require post-processing consolidation to

Exploring Different Deduplication Approaches

We explored various embeddings from the MTEB Leaderboard in French (a popular

We processed the 200,000 distinct categories generated by the model by computing

the optimal K number of clusters.

Bayesian Information Criterion (BIC)

L = likelihood of the data

k = number of free parameters = K(d + 1)

n = number of data points

Akaike Information Criterion (AIC)

k = number of free parameters = K(d + 1)

L = likelihood of the data

Some of the centroids we get from clustering (K=20000)

We encountered several significant challenges:

1. Computational Performance: The high dimensionality of the embedding

2. Curse of Dimensionality: K-means clustering is known to perform poorly with

4. Business vs. Statistical Optimization: A crucial insight emerged: the statistically

This experience highlighted the importance of balancing mathematical

Latent Dirichlet Allocation (LDA)

Our implementation starts with text preprocessing, followed by vocabulary

# Vectorize the text

Clusters look quite promising:

Topic #0: [‘société’, ‘siège’, ‘social’, ‘modification’, ‘adresse’, ‘statut’] Modification de

Topic #1: [‘société’, ‘commissaire’, ‘compte’, ‘nomination’, ‘exercice’, ‘mandat’]

However, we faced many challenges:

The curse of dimensionality presents a significant challenge for LDA,

Topic: [‘réduction’, ‘capital’, ‘social’, ‘perte’, ‘motiver’, ‘non’] Décision de réduction du

Because topic compatibility with categories is represented by a probability

Some topics are not relevant business-wise:

Topic : [‘alsace’, ‘triperies’, ‘reunies’, ‘boyauderies’, ‘approbation’, ‘société’] Dissolution

Although LDA is heavily penalized by the problem’s dimensionality, limiting its

To ensure accuracy, we conducted a rigorous manual evaluation of the paraphrase

maintaining legal precision and achieving meaningful consolidation of truly

Some similar sentence retrieved

The paraphrase-multilingual-mpnet-base-v2 model proved superior for our use case,

With this approach, we successfully consolidated our initial category set by

New top 10 resolutions categories after our deduplication process

The next steps involve exploring alternative clustering methods, such as

Engineering Llm Clustering Lda NLP

Written by Aimen Louafi

More from Aimen Louafi and Inside Doctrine

Aimen Louafi in Inside Doctrine

Comprehensive Analysis of OCR Solutions for High-Volume French

Ben Riou in Inside Doctrine