-
Notifications
You must be signed in to change notification settings - Fork 11
Language Detection: Should we allow results that are less than "und" to be included? #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, this is tricky. To flesh out your example more so that it sums to 1, let's say we have
In this case the current algorithm gives
The root cause, of course, is that we're aggregating two cases into
I don't really know what the most useful outcome would be here for developers. Some possibilities:
I welcome ideas. |
Yes.
=>
I'm starting to feel that we need to accept that sometimes |
Adding i18n labels and summoning @aphillips in case he has any thoughts. I posted a Twitter poll on the ordering at https://x.com/domenic/status/1912036204163555338 . |
I'm currently not a user of whatever it is you are building; just following your invitation to voice my thoughts on the topic. My gut feeling is: the list should be ordered by I can rationalize my gut feeling with several points: SemanticsSematically it is not a tuple you are returning; in a tuple like react's By placing {
detected: [ { language: 'es', confidence: 0.15 }, { language: 'en', confidence: 0.06 } ],
undetermined: 0.1,
} Robustnes w.r.t. Future ChangeIf you now decide the contract should be: "the last item in the list is always const result = translationAPI.detectLanguage("Lorem ipsum ...")
const undetermined = result[result.length - 1]
const detectedLanguages = result.slice(0, -1) ...but if you now decide the contract should be: "the items are sorted by confidence, const result = translationAPI.detectLanguage("Lorem ipsum ...")
const undetermined = result.find( res => res.language === 'und' )
const detectedLanguages = result.filter( res => res !== undetermined ) Now should you decide in the future that when the language can actually be detected definitely you might want to omit the |
I brought this up in the I18N call of 2025-04-17, where we discussed the problem at length (the notes are incomplete). The point of language detection is positive detection. That is, nobody builds an The value ascribed to Thought experiment: suppose your detector can do many languages, but has no detector for any of the languages using the Adlam script. If you hand it some Fula that has Latin-script inclusions ("𞤃𞤭 𞤴𞤭𞤴𞤭𞥅 𞤫 𞤼𞤫𞤤𞤫 Obama 𞤴𞤢𞤸𞤭𞥅𞤲𞤮 𞤊𞤢𞤪𞤢𞤴𞤧𞤫" <= I saw on television that Obama went to France). What you'd like to occur is that
So... I don't know if putting |
Thanks all for the discussion. I'm currently thinking we should put it in the middle of the list, but there's more to discuss for sure, including possibly changing the return value shape similar to @teetotum's suggestion.
I am sympathetic to this. The problem is that, at least with our current spec strategies, attempting to get rid of the ones after it will work poorly. The constraint that we're placing ourselves under is that the total sum of all detection results (including But consider what would happen if we took the original example (#51 (comment)) and tried to get rid of the results below
This seems subpar to me as we've lost potentially useful information. It's even worse with a slight modification of the second example (#51 (comment)):
To avoid this comedy of errors, while still avoiding including results below
Not by the implementation/spec, but perhaps you meant by web developer code?
As you can see from the above, the answer is basically yes: we're ascribing it two meanings, one of which is "the language detector doesn't know the language" and the other is "it was a very low-probability detection result". |
After researching other language detection APIs, I think this might be the right solution. I cannot find any indications that other APIs have imposed this constraint on themselves. |
A lot of the examples in this thread already don't sum to 1 and I came to this issue this morning to point out that there should not be a hidden assumption that they sum up, only to find you'd gotten there first. One design that detectors can use is separate parallel state machines, where each language produces a score independent of the others (this is especially helpful when using different detection strategies for different languages, e.g. n-gram for alphabetic scripts but char frequency for CJK, see for example the old chardet code). Note in the multi-round example that the scores above If everything adds up to
That's what I thought. Going back to my Adlam detector example, if you had 100 language detectors (but none for an Adlam script language) and each produce a score of Just to reemphasize:
I think |
Another question to add to the discussion is should we ever exclude |
Thanks again @aphillips. At this point I've come around to
instead of our current approach of also including low-probability results. We can do this by relaxing the sum-to-1 constraint, and so we just throw out the low-probability results. I've drafted a version of this at #52. Note that currently low-probability is defined as:
It sounds like from your
you might not agree with the first criteria here. I agree your example is an interesting one. We could even go a step further and omit the |
IMO it should not. It feels a bit cleaner if |
@domenic noted:
That's not an
Actually, I agree with the first criterion more than the second one. The probabilities emitted by a language detector can be somewhat arbitrary. Is a detection with 2% probability really any more interesting that 1%? Is 3%? 5%? How do these numbers get assigned? If they are at all algorithmic, the more interesting signal is the "not me" signal from each. |
@nathanmemmott pointed out in another channel that this could result in returning an empty array. That seems pretty confusing for developers. So at this point I think #52 (throw away low-probability languages, keep und) is the best path forward.
Good distinction! Sadly Google Chrome's current language detection model doesn't have I wonder if the specification should discuss this distinction more. |
"und" could outrank another result. For example, if we have:
It would result in:
Not sure if that's necessarily undesirable. If not, then another question would be should und still be last? Again maybe being last is still desirable even if it outranks other results.
The text was updated successfully, but these errors were encountered: