Hacker News new | past | comments | ask | show | jobs | submit login
DeepSeek OCR (github.com/deepseek-ai)
772 points by pierre 12 hours ago | hide | past | favorite | 203 comments
add comment
krackers 12 hours ago | next [–]
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is
this quote
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision
tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless
OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens
are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping
the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic
encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence
between information loss in the textual domain and the image domain.
reply
miki123211 8 hours ago | parent | next [–]
Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.
The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector
embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct
the "context", a matrix where each row is a vector taken from that lookup table.
Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're
small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would
be far less efficient, as embeddings often consist of thousands of floating point numbers.
Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-
based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There
are no token ids, there's no lookup table, we go straight from image data to token embeddings.
This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings
themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes
more bytes.
You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you
have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector
itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually
much larger.
There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most
tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently.
You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.
reply
rco8786 7 hours ago | root | parent | next [–]
Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after
giving it some thought it makes sense. There's only so many words/subword units that get used in any given
language. The entropy comes from all the billions of different ways those subwords can be ordered.
reply
jerf 4 hours ago | root | parent | next [–]
Textual language is really, really amazing if you sit down and think about what it does versus the
resources it consumes to do it.
It's a common pasttime for programmers to claim that our textual programming languages are just
terrible and need to be replaced somehow with something visual, but I think this very often comes from a
place of not understanding just how amazing textual languages are. Not they couldn't possibly be
improved by something in at least some domains, and there are after all some successful niches for
visual languages, but I think if you set out to wholesale replace textual languages without an
understanding of and appreciation for the impressive nature of the competition they offer you're setting
yourself up to fail.
reply
freeqaz 5 hours ago | root | parent | prev | next [–]
There is also a tradeoff between different vocabulary sizes (how many entries exist in the token ->
embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-
armchair stance, but you can read more in depth here[0][1].)
If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but
you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like
n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific
token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the
LLM.
On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well,
it'd be far more than the ~100k entries used by popular LLMs, and that has some computational
tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that
for each token, as well as increasing the size of certain layers within the LLM (more memory + compute
required for each token, basically).
Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run
into specific tokens that only appear a handful of times in the training data and the model is never able to
fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific
example being somebody's username on the internet.)
Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of
shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who
works in AI security)
As LLMs have improved, models have pushed towards the largest vocabulary they can get away with
without hurting performance. This is about where my knowledge on the subject ends, but there have
been many analyses done to try to compute the optimal vocabulary size. (See the links below)
One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start
trying to represent 'higher order' concepts without using human vocabulary for them. One example
being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the
final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a
more dense, almost compressed latent space that the large LLM can understand).
It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP
call to a model running on your laptop to say 'hey, go through all of the code and give me the general
vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just
directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code
based on regex like it does today...
This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like
why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore
without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just
network inspector. I don't need access to the underlying Protobuf files to understand what each byte
means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P
Exciting times!
0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...
1: https://arxiv.org/html/2407.13623v1
2: https://en.wikipedia.org/wiki/Glitch_token
3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
reply
rco8786 4 hours ago | root | parent | next [–]
Again, super interesting thanks!
> One area that I have been spending a lot of time thinking about is what Tokenization looks like if
we start trying to represent 'higher order' concepts without using human vocabulary for them. One
example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)
I've had similar ideas in the past. High level languages that humans write are designed for
humans. What does an "LLM native" programming language look like? And, to your point about
protobufs vs JSON, how does a human debug it when the LLM gets stuck?
> It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make
an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me
the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer
tokens than just directly uploading all of the code, and it _feels_ like it would be better than
uploading chunks of code based on regex like it does today...
That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven
way. Claude will do semantic search through YAML frontmatter to determine what skill might be
useful in a given context, then load that entire skill file into context to execute it. Your idea here is
similar, use a small local model to summarize each file (basically dynamically generate that YAML
front matter), feed those into the larger model's context, and then it can choose which file(s) it
cares about based on that.
reply
lubesGordi 2 hours ago | root | parent | prev | next [–]
So in terms of OCR, does the neural network 'map' the words into an embedding directly, or is it getting a
bunch of words like "Hamlet's monologue" and mapping that to an embedding? Basically what I'm asking is if
the neural network image encoder is essentially doing OCR 'internally' when it is coming up with the embedding
(if that makes any sense).
reply
isaacfung 2 hours ago | root | parent | prev | next [–]
Some models use vector quantized variational autoencoders to discretize images into sequences of discrete
symbols from a fixed codebook.
https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...
reply
ttul 4 hours ago | root | parent | prev | next [–]
This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for
display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-
dimensional information into concepts again.
So to me it’s not a surprise that you can transform the two-dimensional representation of the same information
into concepts again without losing much.
The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s
intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very
long context is to provide it with an image representation instead of text tokens.
reply
miki123211 3 hours ago | root | parent | next [–]
Text is actually one-dimensional, writing is two-dimensional.
To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship
between character 15 of line 1 and character 15 of line 2.
A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.
reply
storus 1 hour ago | root | parent | prev | next [–]
That's not really true, the latest autoregressive image models create a codebook of patches that are then
encoded as tokens and image is assembled out of them.
reply
jph00 4 hours ago | root | parent | prev | next [–]
Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors.
There has been some success in that direction in diffusion models for instance.
reply
HarHarVeryFunny 5 hours ago | parent | prev | next [–]
I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs -
convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions
and encoding those.
I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more
about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail
to be able to accurately perform OCR on it.
It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically
done.
reply
ssivark 3 hours ago | parent | prev | next [–]
Surely the appropriate ratio depends on the resolution of each character, relative to the size of the vision token patch?
That is the only way the number of text tokens needed to describe the output of OCR can be independent of the
resolution of the image (as it should).
reply
runeblaze 10 hours ago | parent | prev | next [–]
each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously
compresses much more than subword slices.
disclaimer: not expert, on top of my head
reply
looobay 12 hours ago | parent | prev | next [–]
LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into
visual tokens with their VLM.
Maybe they would render texts to an image before tokenizing to reduce the compute cost.
reply
krackers 12 hours ago | root | parent | next [–]
But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the
same number of bits (or more) than the representation as textual token? You're changing representation sure,
but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you
can take advantage of in the domain you transform to right?
So I guess my question is where is the juice being squeezed from, why does the vision token representation
end up being more efficient than text tokens.
reply
HarHarVeryFunny 2 hours ago | root | parent | next [–]
A text token generally represents a portion of a single word, while a vision token represents a portion of
the entire page, which may include multiple words. This is where the "compression factor" comes from.
The number of bits to represent a text or vision token is the same, since they are both represented as
embeddings of a fixed number of dimensions defined by the Transformer (maybe a few thousand for a
large SOTA model).
Whether a vision token actually contains enough information to accurately extract (OCR) all the text data
from that portion of the image is going to depend on how many pixels that vision token represents and
how many words were present in that area of the image. It's just like considering images of the same
page of text at different resolutions - a 1024x1024 image vs a 64x64 one, etc. As the resolution
decreases so will OCR accuracy. At some point the resolution is insufficient and the words become a
blurry mess and OCR accuracy suffers.
This is what DeepSeek are reporting - OCR accuracy if you try to use a single vision token to represent,
say, 10 text tokens, vs 20 text tokens. The vision token may have enough resolution to represent 10
tokens well, but not enough for 20.
reply
f33d5173 11 hours ago | root | parent | prev | next [–]
Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For
example, two words that look similar must never appear in similar contexts, or else they would be
conflated. Hence we can safely reduce such words to the same token. Or something like that.
reply
fxtentacle 7 hours ago | root | parent | next [–]
That also works purely on text and it's the trick I used in my German speech recognition engine (
https://arxiv.org/abs/2206.12693 ).
"I'm studying at Oxford Univ" has basically no loss in meaning even though "University" was
truncated to less than half its characters.
reply
UltraSane 3 hours ago | root | parent | next [–]
This is like how many CLIs accept the shortest unique version of commands.
reply
ffsm8 8 hours ago | root | parent | prev | next [–]
Is that really factual/true?
Lots of words have multiple meanings and can mean different things even if used in the same
sentence/context just from the interpretation of the person reading it.
Heck, it'd argue that most (not all) dayjob conflicts are down to such differences in interpretation
/miscommunications
reply
psb217 10 hours ago | root | parent | prev | next [–]
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from
a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision
tokens can convey significantly more bits per token than text tokens. This allows them to pack the
content of multiple text tokens into a single vision token.
reply
imjonse 11 hours ago | root | parent | prev | next [–]
I wonder if text written using chinese characters is more compatible with such vision centric compression
than latin text.
reply
looobay 12 hours ago | root | parent | prev | next [–]
Vision tokens are a good compression medium because with one vision token you have one vector of N
elements, but with textual tokens you have M vectors of N elements, because one vision token represent
multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.
It will never be as precise as textual tokens but it can be really good as they show in the paper.
reply
krackers 11 hours ago | root | parent | next [–]
>with one vision token you have one vector of N elements, but with textual tokens you have M
vectors of N elements
Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple
vision tokens. So assuming that the embedding size of the vision token and text token is the same
`d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison
be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence
in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor
of 10 as quoted in the paper).
That said, if I do experimentally try this by shrinking this very comment down to the smallest font
size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I
expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it
can't be as simple as "each vision token can cover more text", since that only enables better
compression if the encoder can actually uncover some sort of redundancy within each token. (And
presumably the type of redundancy it uncovers probably isn't something that "classical"
compression techniques can exploit, otherwise it seems like it would have been tried by now?).
reply
looobay 11 hours ago | root | parent | next [–]
You should read the 6th page of the paper (and page 5 for architecture breakdown), they
show that they are compressing the vision tokens with convolution to keep a strong
semantic understanding and keep a small amount of tokens.
But I think it's still experimentall.
reply
numpad0 10 hours ago | root | parent | prev | next [–]
just a hunch but like, from something to do with Unicode?
reply
simonw 1 hour ago | prev | next [–]
I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by
running Claude Code as root in a new Docker container and having it figure it out. Notes here:
https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...
Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image:
https://static.simonwillison.net/static/2025/ft.jpeg
reply
jjcm 1 hour ago | parent | next [–]
Looks like this did really solid, with the exception of the paragraph directly below the quote. It hallucinated some filler
there and bridge it with the next column.
Thanks for running the test quickly!
reply
throwaway314155 1 hour ago | parent | prev | next [–]
> by running Claude Code as root in a new Docker container
How do you get the "as root" part of that to work?
(sorry if it's explained in your article)
reply
simonw 32 minutes ago | root | parent | next [–]
Run it on a root account and do:
IS_SANDBOX=1 claude --dangerously-skip-permissions
reply
breadislove 7 hours ago | prev | next [–]
For everyone wondering how good this and other benchmarks are:
- the OmniAI benchmark is bad
- Instead check OmniDocBench[1] out
- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini
- End to End OCR is still extremely tricky
- composed pipelines work better (layout detection -> reading order -> OCR every element)
- complex table parsing is still extremely difficult
[1]: https://github.com/opendatalab/OmniDocBench
reply
cheema33 1 minute ago | parent | next [–]
> the OmniAI benchmark is bad
According to Omni OCR benchmark, Omni OCR is the best OCR. I am sure you all will find no issues with these
findings.
reply
hakunin 5 hours ago | parent | prev | next [–]
Wish someone benchmarked Apple Vision Framework against these others. It's built into most Apple devices, but
people don't know you can actually harness it to do fast, good quality OCR for you (and go a few extra steps to
produce searchable pdfs, which is my typical use case). I'm very curious where it would fall in the benchmarks.
reply
graeme 3 hours ago | root | parent | next [–]
Interesting. How do you harness it for that purpose? I've found apple ocr to be very good.
reply
hakunin 2 hours ago | root | parent | next [–]
The short answer is a tool like OwlOCR (which also has CLI support). The long answer is that there are
tools on github (I created the stars list: https://github.com/stars/maxim/lists/apple-vision-framework/)
that try to use the framework for various things. I’m also trying to build an ffi-based Ruby gem that
provides convenient access in Ruby to the framework’s functionality.
reply
wahnfrieden 5 hours ago | root | parent | prev | next [–]
It is unusable trash for languages with any vertical writing such as Japanese. It simply doesn’t work.
reply
thekid314 5 hours ago | root | parent | next [–]
Yeah, and fails quickly at anything handwritten.
reply
hakunin 4 hours ago | root | parent | next [–]
I mostly OCR English, so Japanese (as mentioned by parent) wouldn't be an issue for me, but I do
care about handwriting. See, these insights are super helpful. If only there was, say, a benchmark
to show these.
My main question really is: what are practical OCR tools that I can string together on my MacBook
Pro M1 Max w/ 64GB Ram to maximize OCR quality for lots of mail and schoolwork coming into my
house, all mostly in English.
I use ScanSnap Manager with its built in OCR tools, but that's probably super outdated by now.
Apple Vision does way better job than that. I heard people say also that Apple Vision is better than
Tesseract. But is there something better still that's also practical to run in a scripted environment
on my machine?
reply
CaptainOfCoit 5 hours ago | root | parent | prev | next [–]
Yeah, if it was cross-platform maybe more people would be curious about it, but something that can only run on
~10% of the hardware people have doesn't make it very attractive to even begin to spend time on Apple-
exclusive stuff.
reply
ch1234 4 hours ago | root | parent | next [–]
But you can have an apple device deployed in your stack to handle the OCR, right? I get on-device is a
hardware limitation for many, but if you have an apple device in your stack, can’t you leverage this?
reply
CaptainOfCoit 3 hours ago | root | parent | next [–]
Yeah, but handling macOS is a infrastructure-capacity sucks, Apple really doesn't want you to so
tooling is almost none existing. I've setup CI/CD stacks before that needed macOS builders and it's
always the most cumbersome machines to manage as infrastructure.
reply
coder543 2 hours ago | root | parent | next [–]
AWS literally lets you deploy Macs as EC2 instances, which I believe includes all of AWS's
usual EBS storage and disk imaging features.
reply
CaptainOfCoit 2 hours ago | root | parent | next [–]
Alright, so now the easy thing is done, now how do you actually manage them, keep
them running and do introspection without resorting to SSH or even remote desktop?
reply
coder543 2 hours ago | root | parent | next [–]
How do you manage any EC2 instance “without resorting to SSH”? Even for
Linux EC2 instances, the right answer is often tools like Ansible, which do still
use SSH under the hood.
reply
CaptainOfCoit 57 minutes ago | root | parent | next [–]
You usually provision them via images, that they then either install from
or boot from directly. Not to mention there are countless of
infrastructure software to run that works for at least Linux, sometimes
Windows and seldom even macOS.
reply
coder543 52 minutes ago | root | parent | next [–]
I specifically mentioned the imaging capability of EBS for Mac,
which you dismissed as the easy part. Now you’re claiming that is
the main thing? Well, good news!
And yes, Ansible (among other tools) can be used to manage
macOS.
This discussion doesn’t seem productive. You have a preconceived
view point, and you’re not actually considering the problem or
even doing 5 seconds of googling.
Managing a Mac fleet on AWS isn’t a real problem. If Apple’s OCR
framework were significantly above the competition, it could
easily be used. I would like to see benchmarks of it, as the other
person was also asking for.
reply
hakunin 4 hours ago | root | parent | prev | next [–]
10% of hardware is an insanely vast amount, no?
reply
CaptainOfCoit 3 hours ago | root | parent | next [–]
Well, it's 90% less than what everyone else uses, so even if the total number is big, relatively it
has a small user-base.
reply
hakunin 3 hours ago | root | parent | next [–]
I don’t think 10% of anything would be considered relatively small even if we talk about 10
items: literally there’s only 10 items and this 1 has the rare quality of being among 10. Let
alone billions of devices. Unless you want to reduce it to tautology, and instead of answering
“why it’s not benchmarked” just go for “10 is smaller than 90, so I’m right”.
My point is, I don’t think any comparative benchmark would ever exclude something based
on “oh it’s just 10%, who cares.” I think the issue is more that Apple Vision Framework is
not well known as an OCR option, but maybe it’s starting to change.
And another part of the irony is that Apple’s framework probably gets way more real world
usage in practice than most of the tools in that benchmark.
reply
CaptainOfCoit 2 hours ago | root | parent | next [–]
The initial wish was that more people cared about Apple Vision Framework, I'm
merely claiming that since most people don't actually have Apple hardware, they're
avoiding Apple technology as it commonly only runs on Apple hardware.
So I'm not saying it should be excluded because it's can only used by relatively few
people, but I was trying to communicate that I kind of get why not so many people
care about it and why it gets forgotten, since most people wouldn't be able to run it
even if they wanted to.
Instead, something like DeepSeek OCR could be deployed on any of the three major
OSes (assuming there is implementations of the architecture available), so of course
it gets a lot more attention and will be included in way more benchmarks.
reply
hakunin 2 hours ago | root | parent | next [–]
I get what you're saying, I'm just disagreeing with your thought process. By
that logic benchmarks would also not include the LLMs that they did, since
most people wouldn't be able to run those either (it takes expensive
hardware). In fact, more people would probably be able to run Vision
framework than those LLMs, for cheaper (Vision is even on iPhones). I'm more
inclined to agree if you say "maybe people just don't like Apple". :)
reply
ellisd 10 hours ago | prev | next [–]
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting
OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.
https://annas-archive.org/blog/duxiu-exclusive.html
reply
bluecoconut 4 hours ago | parent | next [–]
Previous paper from DeepSeek has mentioned Anna’s Archive.
> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions
of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper
reply
ikamm 7 hours ago | parent | prev | next [–]
Why do they need to grant access for people to use copies of books they don’t own?
reply
JohnLocke4 6 hours ago | root | parent | next [–]
Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from
the people they choose to share it with. This is to improve their existing service by making the content of books
(and not just their title/tags) searchable.
As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.
reply
ikamm 4 hours ago | root | parent | next [–]
Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access
to copyrighted material they don't own the rights to
reply
throawayonthe 9 hours ago | parent | prev | next [–]
hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released
reply
singularfutur 7 hours ago | parent | prev | next [–]
Yes it means they will never release their dataset :(
reply
dev1ycan 6 hours ago | parent | prev | next [–]
Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that
students and researchers use, META torrenting 70TB from library genesis wasn't enough
reply
sigmoid10 5 hours ago | root | parent | next [–]
Seems like they are doing fine:
https://open-slum.org
reply
c0balt 5 hours ago | root | parent | prev | next [–]
It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to
satisfy the request safely.
reply
dumpsterkid 4 hours ago | prev | next [–]
I haven't fired this up yet to try but I've been evaluating & working with quite a few different VLMs from the small granite,
qwen etc models up to the larger VLMs available to see if we can fully replace traditional OCR in our system but I've been
disappointed so far - our system takes documents from customers and supplies them back normalized documents (i.e
rasterized multi-page bitmaps) marked up as they've requested - however in our use case we need accurate coordinates of
data down to the letter/word level and from my experience the positional outputs from these VLMs are either wildly
inconsistent, completely hallucinated, or so vague that it doesn't allow us to target anything with any kind of accuracy or
granularity.
our solution so far has been to stick to using tesseract with good clean-up routines and then augmenting/fixing-up the output
using the VLM OCR text where we don't have structured source document data available
it could be that we just have a very niche use-case and it doesn't matter to most people, I'm sure if you just want a text
dump or restructured markdown/html representation of documents these VLMs work well but the number of articles &
comments I've seen claiming that these models have 'solved' OCR just seems counter to our experiences
reply
kamranjon 56 minutes ago | parent | next [–]
Have you tried moondream yet[1]? The moondream 3 preview model[2], according to the blogpost[3] appears to
outperform many frontier models on VLM tasks and does so with a relatively small footprint.
[1] https://moondream.ai/
[2] https://huggingface.co/moondream/moondream3-preview
[3] https://moondream.ai/blog/moondream-3-preview
reply
sampton 3 hours ago | parent | prev | next [–]
You can train a cnn to find bounding boxes of text first. Then run VLM on each box.
reply
yoran 11 hours ago | prev | next [–]
How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-
us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
reply
ozgune 11 hours ago | parent | next [–]
OmniAI has a benchmark that companies LLMs to cloud OCR services.
https://getomni.ai/blog/ocr-benchmark (Feb 2025)
Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family,
particularly Qwen3-VL-235B-A22B-Instruct for our use-case.
reply
cheema33 3 minutes ago | root | parent | next [–]
Omni OCR team says that according to their own benchmark, the best OCR is the Omni OCR. I am quite
surprised.
reply
CaptainOfCoit 7 hours ago | root | parent | prev | next [–]
Magistral-Small-2509 is pretty neat as well for its size, has reasoning + multimodality, which helps in some
cases where context isn't immediately clear, or there are few missing spots.
reply
stopyellingatme 51 minutes ago | parent | prev | next [–]
Not sure about the others but we use Azure AI Document Intelligence and its working well for our resume parsing
system. Took a good bit of tuning but we havent had to touch it for almost a year now.
reply
daemonologist 5 hours ago | parent | prev | next [–]
My base expectation is that the proprietary OCR models will continue to win on real-world documents, and my guess is
that this is because they have access to a lot of good private training data. These public models are trained on arxiv
and e-books and stuff, which doesn't necessarily translate to typical business documents.
As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across
the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)
reply
numpad0 8 hours ago | parent | prev | next [–]
Classical OCR still probably make undesirable su6stıtutìons in CJK from there being far too many of similar ones, even
some absurd ones that are only distinguishable under microscope or by looking at binary representations. LLMs are
better constrained to valid sequences of characters, and so they would be more accurate.
Or at least that kind of thing would motivate them to re-implement OCR with LLM.
reply
fluoridation 5 hours ago | root | parent | next [–]
Huh... Would it work to have some kind of error checking model that corrected common OCR errors? That
seems like it should be relatively easy.
reply
colonCapitalDee 1 hour ago | root | parent | next [–]
It's harder then it first seems. The root problem is that for text like "hallo", correcting to "hello" may be
fixing an error or introducing an error. In general, the more aggressive your error correction, the more
errors you inadvertently introduce. You can try and make a judgement based on context ("hallo, how are
you?"), which certainly helps, but it's only a mitigation. Light error correction is common and effective,
but you can't push it to a full solution. The only way to fully solve this problem is to look at the entire
document at once so you have maximum context available, and this is what non-traditional OCR attempts
to do.
reply
fluoridation 1 hour ago | root | parent | next [–]
Okay, but there way more common errors that should be easy to fix. "He11o", "Emest
Herningway", incorrect diacritics like the other person mentioned, etc.
reply
sandblast 11 hours ago | parent | prev | next [–]
Not sure why you're being downvoted, I'm also curious.
reply
make3 5 hours ago | parent | prev | next [–]
aren't all of these multimodal LLM approaches, just open vs closed ones
reply
pietz 10 hours ago | prev | next [–]
My impression is that OCR is basically solved at this point.
The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's
because general purpose LLMs have gotten better at OCR than their own OCR product.
I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and
asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around
$0.20 for 1000 pages in batch mode and the results have been great.
I'd be interested to hear where OCR still struggles today.
reply
cahaya 9 hours ago | parent | next [–]
Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with
multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables
that are not understood correctly. Also Llamaindex fails miserably on those things.
Curious to hear which OCR/ LLM excels with these specific issues? Example complex table:
https://cdn.aviation.bot/complex-tables.zip
I can only parse this table correctly by first parsing the table headers manually into HTML as example output.
However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist
reply
CaptainOfCoit 9 hours ago | root | parent | next [–]
> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:
But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly
changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a
screen", the problem-space gets too big.
For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out
what's the right structure to parse into, second pass to actually fetch and structure the data.
reply
kmacdough 5 hours ago | root | parent | next [–]
> But that's something else, that's no longer just OCR ("Optical Character Recognition").
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal
semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex
structured documents where wider context may be important. But saying it's "not OCR" doesn't seem
meaningful from a technical perspective. It's an extension of the same goal to convert images of
documents into the most accurate and useful digitized form with the least manual intervention.
reply
CaptainOfCoit 5 hours ago | root | parent | next [–]
Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and
structure". It is true that some OCR systems can handle trying to replicate the structure, but still
today I think that's the exception, not the norm.
Not to mention it's helpful to separate the two because there is such a big difference in the
difficulty of the tasks.
reply
kmacdough 5 hours ago | root | parent | prev | next [–]
> But that's something else, that's no longer just OCR ("Optical Character Recognition").
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal
semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex
structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical
perspective.
reply
eeixlk 9 hours ago | root | parent | prev | next [–]
htceaad t nofdnsy lyruuieo sieerrr t owcope?
reply
pietz 9 hours ago | root | parent | prev | next [–]
I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it
extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I
didn't specify preferences.
reply
carschno 10 hours ago | parent | prev | next [–]
Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy,
but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.
reply
mormegil 10 hours ago | root | parent | next [–]
This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to
read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record
without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed
text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old
vital record transcription – even though the transcribed text is far from perfect, many letters and words are
recognized incorrectly, it helps me a lot with the more difficult records.
reply
pietz 9 hours ago | root | parent | prev | next [–]
We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I
am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I
can't read that" it just made up something.
reply
CraigRood 7 hours ago | root | parent | next [–]
I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the
reality that they are not very accurate, nor can be properly measured. That said, there clearly are use
cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.
reply
sramam 10 hours ago | root | parent | prev | next [–]
Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?
I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this
with whiteboard discussion snapshots and it has worked really well.
reply
lazide 8 hours ago | root | parent | next [–]
Often, the review LLM will also say everything is okay when it’s made up too.
reply
raincole 10 hours ago | parent | prev | next [–]
If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes
it's solved.
(I'm not being snarky. It's acceptable in some cases.)
reply
jakewins 10 hours ago | root | parent | next [–]
But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up
plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in
fairness
reply
rkagerer 8 hours ago | root | parent | next [–]
Good libraries gave results with embedded confidence levels for each unit recognized.
reply
wahnfrieden 5 hours ago | root | parent | prev | next [–]
Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences
reply
criddell 4 hours ago | root | parent | next [–]
I usually run the image(s) through more than one converter then compare the results. They all
have problems, but the parts they agree on are usually correct.
reply
KoolKat23 3 hours ago | root | parent | prev | next [–]
This must be some older/smaller model.
reply
Davidzheng 5 hours ago | root | parent | prev | next [–]
rarely happens to me using LLMs to transcribe pdfs
reply
KoolKat23 3 hours ago | root | parent | prev | next [–]
These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends
to be limitation of the image qualify ( max dpi).
Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think
it's an 8, zoom in and see it's a 6.
Google's image quality on uploads is still streets ahead of openai for instance btw.
reply
red75prime 9 hours ago | root | parent | prev | next [–]
Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).
reply
wahnfrieden 5 hours ago | root | parent | prev | next [–]
Do any LLM OCRs give bounding boxes anyway? Per character and per block.
reply
kelvinjps10 54 minutes ago | root | parent | next [–]
Gemini does but it's not as good as Google vision, and the format it's différent Here it's the
documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...
Also Simon Willison Made a blog post that might be helpful
https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
I hope that this capability improves so I can use only Gemini API.
reply
peter-m80 10 hours ago | parent | prev | next [–]
No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage
computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them
requiere a lot of human intervention
reply
pietz 9 hours ago | root | parent | next [–]
Could you provide an example that fails? I'm interested in this.
reply
jmkni 10 hours ago | root | parent | prev | next [–]
do you have an example of a particularly tricky one?
reply
ekianjo 10 hours ago | root | parent | next [–]
Just try old ads you will see how hard it gets
reply
constantinum 4 hours ago | root | parent | prev | next [–]
I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example >
https://postimg.cc/ts3vT7kG
https://pg.llmwhisperer.unstract.com/
reply
Gazoche 3 hours ago | parent | prev | next [–]
There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on
your problem domain.
Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say,
digitalizing old printed novels, then yeah it's probably good enough.
But what if your documents are personal records with millions of names, to insert in some administrative database?
Now 1 out of 1000 persons will have their name misspelled. Ooops.
reply
kbumsik 10 hours ago | parent | prev | next [–]
> My impression is that OCR is basically solved at this point.
Not really in practice to me. Especially they still struggle with Table format detection.
reply
coulix 10 hours ago | root | parent | next [–]
This.
Any complex parent table span cell relationship still has low accuracy.
Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a
HTML table.
They will fail.
reply
pietz 9 hours ago | root | parent | next [–]
Maybe my imagination is limited or our documents aren't complex enough, but are we talking about
realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it
fails, but in that case you already have the data in structured form anyway, no?
reply
daemonologist 4 hours ago | root | parent | next [–]
Now if someone mails or faxes you that spreadsheet? You're screwed.
Spreadsheets are not the biggest problem though, as they have a reliable 2-dimensional grid - at
worst some cells will be combined. The form layouts and n-dimensional table structures you can
find on medical and insurance documents are truly unhinged. I've seen documents that I struggled
to interpret.
reply
KoolKat23 3 hours ago | root | parent | next [–]
To be fair, this is problematic for humans too. My old insurer outright rejected things like
that stating it's not legible.
(I imagine it also had the benefit of reducing fraud/errors).
In this day and age, it's probably easier/better to change the process around that as there's
little excuse for such shit quality input. I understand this isn't always possible though.
reply
kbumsik 8 hours ago | root | parent | prev | next [–]
> realistic written documents?
Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.
I have seen so many mistakes when looking at the result closely.
Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.
https://www.sec.gov/Archives/edgar/data/1108524/000110852425...
reply
grosswait 6 hours ago | root | parent | next [–]
Historical filings are still a problem, but hasn’t the SEC required filing in an XML format since
the end of 2024?
reply
richardlblair 5 hours ago | root | parent | next [–]
It's not really about SEC filings, though. While we folks on HN would never think of
hard copies of invoices, but much of the world still operates this way.
As mentioned above I have about 200 construction invoices. They are all formatted in
a way that doesn't make sense. Most fail both OCR and OpenAI
reply
KoolKat23 3 hours ago | root | parent | next [–]
OpenAI has unusuably low image DPI. Try Gemini.
reply
bobsmooth 10 hours ago | root | parent | prev | next [–]
Maybe I misunderstood the assignment but it seems to work for me.
https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...
Edit: Just caught a mistake, transcribed one of the prices incorrectly.
reply
kbumsik 10 hours ago | root | parent | next [–]
Right, I wouldn't use full table detection to VLM model because they tend to mistake with numbers
in table...
reply
richardlblair 5 hours ago | root | parent | prev | next [–]
I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail
through both OCR and OpenAI.
It's a hard (and very interesting) problem space.
reply
6gvONxR4sf7o 1 hour ago | parent | prev | next [–]
OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello
world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."
reply
burpsnard 7 hours ago | parent | prev | next [–]
I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what
resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of
character sequences, pretty useless On 'line noise'
reply
cle 3 hours ago | parent | prev | next [–]
That will not work with many of the world's most important documents because of information density. For example,
dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting
and features like strikethroughs, etc.
To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information
density threshold of the model, given the task.
This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element
can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that
generally is an open problem.
reply
kelvinjps10 1 hour ago | parent | prev | next [–]
Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.
reply
blindriver 15 minutes ago | parent | prev | next [–]
I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were
pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90%
accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.
reply
robotswantdata 9 hours ago | parent | prev | next [–]
VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.
reply
simlevesque 5 hours ago | parent | prev | next [–]
> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.
Can you explain more about your setup ? I have a quarter million pages I want to OCR.
reply
KoolKat23 4 hours ago | parent | prev | next [–]
I agree, Gemini 2.5 models are excellent.
The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to
consider addressing compute/offline requirements.
It would also be nice for there to be a good competitor.
reply
vintermann 9 hours ago | parent | prev | next [–]
OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard
to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.
reply
Davidzheng 5 hours ago | parent | prev | next [–]
I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams
almost all models still struggle. especially very complicated commutative diagrams.
reply
darkwater 10 hours ago | parent | prev | next [–]
So, the mug with inspirational text says "Bountiful Potential"?
reply
llm_nerd 7 hours ago | parent | prev | next [–]
Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure
OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and
captions and so on (basically any paper, or even trade documents), it absolutely falls apart.
And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.
There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.
reply
sbinnee 8 hours ago | parent | prev | next [–]
Maybe for English. Other languages are very much not solved.
reply
baobun 7 hours ago | parent | prev | next [–]
Chinese, especially handwritten.
reply
constantinum 4 hours ago | parent | prev | next [–]
Why PDF parsing is Hell[1]:
Fixed layout and lack of semantic structure in PDFs.
Non-linear text flow due to columns, sidebars, or images.
Position-based text without contextual or relational markers.
Absence of standard structure tags (like in HTML).
Scanned or image-based PDFs requiring OCR.
Preprocessing needs for scanned PDFs (noise, rotation, skew).
Extracting tables from unstructured or visually complex layouts.
Multi-column and fancy layouts breaking semantic text order.
Background images and watermarks interfering with text extraction.
Handwritten text recognition challenges.
[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
reply
foofoo12 6 hours ago | prev | next [–]
How does it compare to Tesseract? https://github.com/tesseract-ocr/tesseract
I use ocrmypdf (which uses Tesseract). Runs locally and is absolutely fantastic. https://ocrmypdf.readthedocs.io/en/latest/
reply
utopiah 5 hours ago | parent | next [–]
Indeed, seems the default benchmark is LLM/VLM based alternatives as if they somehow "solved" the problem but
IMHO even if it goes from (totally made up numbers) 80% with tesseract to 95% with this or Qwen or whatever but it
takes 100x harddisk with containers or a CUDA stack, dedicated hardware, e.g. GPU with 16GB or VRAM, etc then it's
such a trade of it should be considered.
reply
modeless 4 hours ago | prev | next [–]
Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM
improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of
academic papers as a data source.
Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for
LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline
for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use
case and it's not the main focus.
reply
edtechdev 3 hours ago | prev | next [–]
I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU,
olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models
are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to
create fully accessible alternatives to PDFs.
reply
mediaman 3 hours ago | parent | next [–]
I have a lot of success asking models such as Gemini to OCR the text, and then to describe any images on the
document, including charts. I have it format the sections with XML-ish tags. This also works for tables.
reply
allanren 3 hours ago | prev | next [–]
It says the conversation can reduce size with large compression, which basic make the image blur but still contaim import
information.
This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade
out, the image are getting blurred.
Not sure if those close source multimodal models are already using this method.
reply
rsp1984 4 hours ago | prev | next [–]
Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why
it's a big deal?
It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk
about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level
context?
reply
schopra909 4 hours ago | prev | next [–]
It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.
In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and
description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about
nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).
For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past
CLIP. That would really improve LLMs for our usecases.
Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.
reply
neves 6 hours ago | prev | next [–]
I see that the project uses conda for development. Is it still a good tool now that pip also install binaries?
reply
modeless 4 hours ago | parent | next [–]
No. Everyone should be using uv instead.
reply
tidbeck 6 hours ago | prev | next [–]
How does this compare to https://huggingface.co/ibm-granite/granite-docling-258M in performance and how they work?
reply
bugglebeetle 5 hours ago | parent | next [–]
The granite Dockling models are unfortunately quite far below SOTA. dots-ocr and PaddleOCR were best here.
reply
dlowe24 2 hours ago | prev | next [–]
The only model that I found so far that extract table data with OCR is dots.ocr. Models that came after it have not done a
good job. Interesting on testing this new model.
reply
shepardrtc 4 hours ago | prev | next [–]
How does this fair with the Vidore benchmark?
https://huggingface.co/spaces/vidore/vidore-leaderboard
reply
2big2fail_47 8 hours ago | prev | next [–]
I find it interesting that there's all these independent AI-OCR Projects but still no commercial offering. Is it still too
inaccurate, too complex or simply too expensive?
reply
Annatar01 8 hours ago | parent | next [–]
I dont know, but maybe existing commercial OCR is still on top, and also using ML. Recently tried a free trial for
OCR/reading Sütterlin and it was a weird feeling being so outclassed in reading.
reply
rsolva 7 hours ago | parent | prev | next [–]
Mistral offers their OCR commercially through their API and in their Chat services, at least.
https://mistral.ai/news/mistral-ocr
reply
simlevesque 5 hours ago | parent | prev | next [–]
https://cloud.google.com/document-ai
reply
aleinin 3 hours ago | parent | prev | next [–]
One that I’ve seen recently is https://reducto.ai It appears to be an OCR wrapper.
reply
daemonologist 4 hours ago | parent | prev | next [–]
There are commercial OCR offerings from the big cloud providers (plus, like, Adobe). In my experience they generally
outperform anything open-weights, although there's been a lot of improvement in VLMs in the past year or two.
reply
Eisenstein 8 hours ago | parent | prev | next [–]
It is because the AI is not actually doing OCR. It is giving an interpretation of what the text in an image is by ingesting
vision tokens and mapping them onto text tokens.
So you either have to be fine with a lot of uncertainty as to the accuracy of that interpretation or you have to wait for
an LLM that can do it in a completely reproducible way every time.
reply
piker 12 hours ago | prev | next [–]
This looks really cool for prototyping and playing around.
It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition
right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV
application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this
doesn’t produce state of the art results—that would change the analysis.
reply
CheeseFromLidl 8 hours ago | parent | next [–]
As a hobby photographer, I organise everything for speedy retrieval but this would be amazing to search my collection.
reply
randomNumber7 12 hours ago | parent | prev | next [–]
Imagine you build an image segmentation model for a e.g. specific industrial application.
With this LLM approach you can at least create your training data from the raw images with natural language.
reply
piker 12 hours ago | root | parent | next [–]
That does make sense
reply
CloseChoice 11 hours ago | prev | next [–]
It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's
not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...
TLDR: It's MIT licensed
reply
AndroTux 11 hours ago | parent | next [–]
> since it's not obvious in the GitHub repo
Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE
reply
maxloh 10 hours ago | parent | prev | next [–]
Model weights are MIT too: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...
reply
mrasong 10 hours ago | prev | next [–]
Kinda reminds me of PaddleOCR.
Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more
convenient!
reply
pzo 9 hours ago | parent | next [–]
iOS already have on device both text detector and document scanner in apple Vision API. Hard to say how good are
they compared to LLM based solutions. Similarly google had MLKit with OCR working on devices also for many years.
reply
k_sze 12 hours ago | prev | next [–]
It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful
ones.
reply
daemonologist 4 hours ago | parent | next [–]
I think maybe to distinguish their dynamic resolution approach from the t-shirt sizes, which have a fixed input.
(Although I don't know why "Gundam")
reply
ammar_x 9 hours ago | prev | next [–]
Language support is not mentioned in the repo. But from the paper, it offers extensive multilingual support (nearly 100
languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.
reply
zacmps 6 hours ago | parent | next [–]
I suspect the number of langauges it can do with reasonable accuracy is actually much smaller, probably <15.
reply
loaderchips 6 hours ago | prev | next [–]
Great work guys, how about we replace the global encoder with a Mamba (state-space) vision backbone to eliminate the
O(n²) attention bottleneck, enabling linear-complexity encoding of high-resolution documents. Pair this with a non-
autoregressive (Non-AR) decoder—such as Mask-Predict or iterative refinement—that generates all output tokens in parallel
instead of sequentially. Together, this creates a fully parallelizable vision-to-text pipeline, The combination addresses both
major bottlenecks in DeepSeek-OCR.
reply
loaderchips 6 hours ago | parent | next [–]
not sure why i m getting downvoted. Would love to have a technical discussion on the validity of my suggestions.
reply
hank2000 5 hours ago | prev | next [–]
Have yall seen tensorlake? I’m curious how this compares to a model custom built for the problem. My guess is it can be as
good. But can it be as efficient?
disclaimer: I do not work for tensorlake—but i know the folks behind it.
reply
x______________ 11 hours ago | prev | next [–]
>先天下之忧而忧
How is this an example of a prompt?
Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."
Can anyone shed some light on this saying or why it's in the article?
reply
raincole 11 hours ago | parent | next [–]
It's a very famous (classical) Chinese phrase.
Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they
have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")
I don't know why it's a prompt example.
reply
jdthedisciple 10 hours ago | root | parent | next [–]
Sibling comment has the second part as
后天下之乐而乐
which one is correct?
reply
raincole 10 hours ago | root | parent | next [–]
Traditional vs Simplified Chinese.
There are two (modern) "spellings" of written Chinese. Basically colour vs color.
reply
Y_Y 9 hours ago | root | parent | prev | next [–]
It depends on who you think is the rightful successor to the Qing dynasty
reply
SequoiaHope 11 hours ago | parent | prev | next [–]
Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses
the Confucian ideal of selfless concern for people and society.
reply
fspeech 11 hours ago | parent | prev | next [–]
Google is closer. This is from a famous essay expressing tbe author's desire to bear the burden for the world. Essay is
岳阳楼记 by 范仲淹 in year 1046 https://zh.wikisource.org/zh-hans/%E5%B2%B3%E9%99%BD%E6%A8%9...
reply
gudzpoz 11 hours ago | parent | prev | next [–]
This clause is usually used together with the next sentence in the original poem:
> 先天下之忧而忧,后天下之乐而乐
> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong,
and raincole has a definitely better translation
Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the
model should be able to complete the whole sentence pair. The paper also mentions this:
> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.
So I believe it is just a text-only demonstration.
reply
jdthedisciple 10 hours ago | root | parent | next [–]
Sibling comment has the second part as
後天下之樂而樂
Which one is correct?
reply
numpad0 9 hours ago | root | parent | next [–]
a) 后天下之乐而乐
b) 後天下之樂而樂
c) 後天下之楽而楽
a) is clearly Simplified Chinese from a sibling comment, b) is Traditional copied from your comment, and
c) is as I just typed in my own language. Unicode Hanzi/Kanji are a mess and there are characters same
or different, in appearance or in binary, depending on intended variants, languages, fonts, systems,
keyboard, distance between Earth and Alpha Centauri, etc.
reply
jdthedisciple 6 hours ago | root | parent | next [–]
Fascinating! That's exactly why I asked, so thank you.
Do people usually recognize all variants as valid and legible? Or does any particular set of
letters/symbols prevail in practice?
reply
numpad0 4 hours ago | root | parent | next [–]
Traditional kinds are usually recognizable, but I'd be unsure or straight up wrong about most
Simplified versions. Overall proportions and small details often feel "wrong" for both as well
due to cultures converging at different points.
reply
hank2000 5 hours ago | root | parent | prev | next [–]
Very location dependent. But when you learn to write the characters you understand the
variants differently. They look like random strokes to an untrained eye. But they’re not. I’m
not sure if that makes sense.
Take a lowercase a in English for example. This font writes it differently than a child. Or in
cursive. Or probably than you would write it. But you recognize all of them and don’t really
think about it.
reply
singularity2001 9 hours ago | prev | next [–]
Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation
model? And what would that be at less than 30 GB?
reply
empressplay 12 hours ago | prev | next [–]
This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up,
but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can
convert magazine layouts to markdown too
reply
farseer 12 hours ago | prev | next [–]
How good is this compared to most commercial OCR software?
reply
ozim 12 hours ago | parent | next [–]
Any vision model is better than commercial OCR software.
reply
Etheryte 11 hours ago | root | parent | next [–]
I'm not really sure if that's an accurate summary of the state of the art, [0] is a better overview. In short, SOTA
multi-modal LLMs are the best option for handwriting, nearly anything is good at printed text, for printed media,
specialty models from hyperscalers are slightly better than multi-modal LLMs.
[0] https://research.aimultiple.com/ocr-accuracy/
reply
ozim 11 hours ago | root | parent | next [–]
I see it confirms what I wrote state of art is “not using tessaract anymore” and I think bunch of
commercial solutions are stuck with tessaract.
reply
ares623 10 hours ago | root | parent | next [–]
I assume Tesseract has the advantage of being able to give a confidence score?
reply
dragonwriter 4 hours ago | root | parent | prev | next [–]
Since “commercial OCR software” includes VLM-based commercial offerings, that's clearly not correct.
reply
brightUiso 11 hours ago | prev | next [–]
Please a bit of education, what does it do?
reply
bugglebeetle 12 hours ago | prev | next [–]
Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re
not as open as the Deepseek team because its so crazy good and would love to know how it was trained.
reply
rfoo 12 hours ago | parent | next [–]
If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p
reply
bugglebeetle 2 hours ago | root | parent | next [–]
Oh you’re right! Good catch!
reply
bethekind 12 hours ago | parent | prev | next [–]
Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens
needed. The x axis is inverted, descending with distance from the origin.
reply
tinyhouse 7 hours ago | prev | next [–]
OCR is not a great name for these models. While they can do traditional OCR such as digitize and scanned PDF for example,
they do so much more.
reply
intalentive 3 hours ago | parent | next [–]
>they do so much more I'm not familiar. What else are they good for?
reply
tinyhouse 2 hours ago | root | parent | next [–]
They can take something like an image of a graph and provide a description of it. From my understanding,
these are multimodal models with reasoning capabilities.
reply
joshstrange 2 hours ago | prev [–]
> [2025/x/x] We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.
So close but it should be 2025/X/XX as "X" = 10 in Roman Numerals /s
Jokes aside, this is really neat and I'm looking forward to getting this running. For most OCR-type stuff I just use AWS
Textract since I need it so rarely and that service does a decent job. I really like how well this model seems to extract
images/figures as well from the original document.
reply
Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Search: