Skip to content

Language Detection: Should we allow results that are less than "und" to be included? #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nathanmemmott opened this issue Apr 14, 2025 · 13 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@nathanmemmott
Copy link

"und" could outrank another result. For example, if we have:

es: 0.15
unknown: 0.1
en: 0.06

It would result in:

es: 0.15
und: 0.16

Not sure if that's necessarily undesirable. If not, then another question would be should und still be last? Again maybe being last is still desirable even if it outranks other results.

@domenic
Copy link
Collaborator

domenic commented Apr 15, 2025

Hmm, this is tricky.

To flesh out your example more so that it sums to 1, let's say we have

it: 0.69
es: 0.15
unknown: 0.1
en: 0.06

In this case the current algorithm gives

it: 0.69
es: 0.15
und: 0.16

The root cause, of course, is that we're aggregating two cases into und:

  1. The implementation believes that there is a 10% probability of the language being different from all the possible languages it knows how to detect

  2. The implementation believes that there is a 6% probability of it being English, but we say that since this is even less confidence than (1), it's not useful to give this information to developers.

I don't really know what the most useful outcome would be here for developers. Some possibilities:

  • It's fine as-is, maybe with some reordering like you mention.
  • Stop treating things below the unknown value as unimportant. (Can the same problem just reoccur because of the cumulative confidence cutoff clause, which also causes us to lump some very low-probability languages into und?)
  • ... something else? ...

I welcome ideas.

@domenic
Copy link
Collaborator

domenic commented Apr 15, 2025

  • (Can the same problem just reoccur because of the cumulative confidence cutoff clause, which also causes us to lump some very low-probability languages into und?)

Yes.

it: 0.50
es: 0.20
fr: 0.10
pt: 0.09
ro: 0.04
de: 0.03
pl: 0.02
zh: 0.007
nl: 0.003
... a bunch of even smaller results ...

=>

it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
ro:  0.04
de:  0.03
pl:  0.02
zh:  0.007
nl:  0.003
und: 0.01

I'm starting to feel that we need to accept that sometimes und will be larger than some entries in the list. The remaining question is then whether we should order strictly by likelihood, or always put und last.

@domenic
Copy link
Collaborator

domenic commented Apr 15, 2025

Adding i18n labels and summoning @aphillips in case he has any thoughts. I posted a Twitter poll on the ordering at https://x.com/domenic/status/1912036204163555338 .

@domenic domenic added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 15, 2025
@teetotum
Copy link

I'm currently not a user of whatever it is you are building; just following your invitation to voice my thoughts on the topic.

My gut feeling is: the list should be ordered by confidence/probability and und should not have a fixed position in the result list.

I can rationalize my gut feeling with several points:

Semantics

Sematically it is not a tuple you are returning; in a tuple like react's const [state, setState] = useState() the position of items is fixed and does carry semantic meaning.
Sematically you are returning a list; the order in a list is either arbitrary and irrelevant, or it is ordered according to some predicate.

By placing und at the end you would mix up tuple and list semantics. I would argue in this case und has no business being part of the list but should be placed next to it in the result data structure:

{
  detected: [ { language: 'es', confidence: 0.15 }, { language: 'en', confidence: 0.06 } ],
  undetermined: 0.1,
}

Robustnes w.r.t. Future Change

If you now decide the contract should be: "the last item in the list is always und" users of you API would happily write (ugly) code like this:

const result = translationAPI.detectLanguage("Lorem ipsum ...")
const undetermined = result[result.length - 1]
const detectedLanguages = result.slice(0, -1)

...but if you now decide the contract should be: "the items are sorted by confidence, und is part of the list but can be in any position" users of you API would write:

const result = translationAPI.detectLanguage("Lorem ipsum ...")
const undetermined = result.find( res => res.language === 'und' )
const detectedLanguages = result.filter( res => res !== undetermined )

Now should you decide in the future that when the language can actually be detected definitely you might want to omit the und item in those cases. The breaking change would in the former case just silently result in wrong results, but in the latter case would crop up as .find() returning undefined which I think would lead to faster detection that a breaking change happened and would be easier fixed. In fact it might even give you the idea now to say the contract should be: "the items are sorted by confidence, und might be part of the list (in any position) but is not guaranteed to be present".

@aphillips
Copy link

I brought this up in the I18N call of 2025-04-17, where we discussed the problem at length (the notes are incomplete).

The point of language detection is positive detection. That is, nobody builds an und detector. Instead they build a detector (or suite of detectors) for some list of languages (which is always less than the 7K+ languages in the world). When detecting languages that share some elements of the writing system (script, punctuation, etc.), you'll get some signal from "everyone" because there will be n-grams or cognates or something that matches all of them. Many single-language detectors produce a "not me" score (at least internally).

The value ascribed to und then is usually an aggregate of detectors saying "probably not me" or some kind of user-assigned cutoff. To @domenic's point, this means there will probably be languages below the cutoff threshold. It's not clear to me what utility there is in looking at languages below the cutoff for most end-user applications. Getting the full list, however, might be of interest to people developing detectors or tuning performance?

Thought experiment: suppose your detector can do many languages, but has no detector for any of the languages using the Adlam script. If you hand it some Fula that has Latin-script inclusions ("𞤃𞤭 𞤴𞤭𞤴𞤭𞥅 𞤫 𞤼𞤫𞤤𞤫 Obama 𞤴𞤢𞤸𞤭𞥅𞤲𞤮 𞤊𞤢𞤪𞤢𞤴𞤧𞤫" <= I saw on television that Obama went to France). What you'd like to occur is that und has the highest score, but there's at least that one word that the English (and Swahili and French and...) detector knows. So:

und: 0.95
en: 0.03
sw: 0.02
fr: 0.01
de: 0.0095
etc: ...  // yes, it's a valid tag but it means Etchemin

So... I don't know if putting und at the end or mid-list makes more sense. If it's mid-list, is there any point to the values after it? If it's at the end, did you already select a language that has lower confidence? Is there danger of it being truncated from the results? Is there some other meaning ascribed to und?

@domenic
Copy link
Collaborator

domenic commented Apr 21, 2025

Thanks all for the discussion. I'm currently thinking we should put it in the middle of the list, but there's more to discuss for sure, including possibly changing the return value shape similar to @teetotum's suggestion.

If it's mid-list, is there any point to the values after it?

I am sympathetic to this. The problem is that, at least with our current spec strategies, attempting to get rid of the ones after it will work poorly.

The constraint that we're placing ourselves under is that the total sum of all detection results (including und) is 1. So this means we lump all lower-confidence results into und.

But consider what would happen if we took the original example (#51 (comment)) and tried to get rid of the results below und:

// Raw results
it: 0.69
es: 0.15
unknown: 0.1
en: 0.06

// Current algorithm: roll "unknown" + anything below it into "und"
it: 0.69
es: 0.15
und: 0.16

// New proposed step: get rid of results below "und", rolling them into "und"
it: 0.69
und: 0.31

This seems subpar to me as we've lost potentially useful information. It's even worse with a slight modification of the second example (#51 (comment)):

// Raw results:
it: 0.50
es: 0.20
fr: 0.10
pt: 0.09
ro: 0.04
de: 0.031
pl: 0.019
zh: 0.007
nl: 0.003
... a bunch of even smaller results ...

// Current algorithm: once we reach 0.99 cumulative, roll everything else into "und"
it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
ro:  0.04
de:  0.031
pl:  0.019
zh:  0.007
nl:  0.003
und: 0.01

// New proposed step: get rid of results below "und", rolling them into "und"
it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
ro:  0.04
de:  0.031
pl:  0.019
und: 0.02

// New proposed step round 2
it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
ro:  0.04
de:  0.031
und: 0.039

// New proposed step round 3
it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
ro:  0.04
und: 0.07

// New proposed step round 4
it:  0.50
es:  0.20
fr:  0.10
pt:  0.09
und: 0.11

// ... etc ...
it: 0.50
und: 0.50

To avoid this comedy of errors, while still avoiding including results below und's confidence, we'd need to do one of the following:

  • Remove the constraint on summing to 1, so that instead of rolling low-probability values into und we can just drop them.
  • Distinguish "unknown" (the raw language detector doesn't know what the language is) from "low probability and lumped together". Not sure which of those would be und, and what the other one would be. (Maybe this would involve a different structure like { unknown, residual, results } somewhat similar to Language Detection: Should we allow results that are less than "und" to be included? #51 (comment).)
  • Change our strategies for omitting low-probability results to something else that avoids this failure mode.

Is there danger of it being truncated from the results?

Not by the implementation/spec, but perhaps you meant by web developer code?

Is there some other meaning ascribed to und?

As you can see from the above, the answer is basically yes: we're ascribing it two meanings, one of which is "the language detector doesn't know the language" and the other is "it was a very low-probability detection result".

@domenic
Copy link
Collaborator

domenic commented Apr 21, 2025

  • Remove the constraint on summing to 1, so that instead of rolling low-probability values into und we can just drop them.

After researching other language detection APIs, I think this might be the right solution. I cannot find any indications that other APIs have imposed this constraint on themselves.

@aphillips
Copy link

  • Remove the constraint on summing to 1, so that instead of rolling low-probability values into und we can just drop them.

After researching other language detection APIs, I think this might be the right solution. I cannot find any indications that other APIs have imposed this constraint on themselves.

A lot of the examples in this thread already don't sum to 1 and I came to this issue this morning to point out that there should not be a hidden assumption that they sum up, only to find you'd gotten there first.

One design that detectors can use is separate parallel state machines, where each language produces a score independent of the others (this is especially helpful when using different detection strategies for different languages, e.g. n-gram for alphabetic scripts but char frequency for CJK, see for example the old chardet code).

Note in the multi-round example that the scores above und (such as for it) don't change or improve as the detector subsumes languages into und. (That example would be better if the score for it were 0.49, since eventually und overtakes it and "wins"--which makes detection useless in cases where surety is split more evenly.)

If everything adds up to 1, then detectors probably should rescore after a group of languages are crossed out.

Is there some other meaning ascribed to und?

As you can see from the above, the answer is basically yes: we're ascribing it two meanings, one of which is "the language detector doesn't know the language" and the other is "it was a very low-probability detection result".

That's what I thought.

Going back to my Adlam detector example, if you had 100 language detectors (but none for an Adlam script language) and each produce a score of 0.01, you'd have an und of 0.00 when you'd like that to be more like 0.95

Just to reemphasize:

...nobody builds an und detector

I think und should only mean "the probability that the detector doesn't know the language", since "it was a very low-probability detection result" appears in the list. When you see und in the list, the items below it are probably not the language, but having them in the list could tell you potentially something about the text (such as what to do with an inclusion, such as the "Obama" in my example)

@nathanmemmott
Copy link
Author

Another question to add to the discussion is should we ever exclude expectedInputLanguages from the results? For example, if expectedInputLanguages is ['en', 'es'], should await detector.detect('Bonjour!') have 'en' and 'es' in its results even if they're zero?

@domenic
Copy link
Collaborator

domenic commented Apr 22, 2025

Thanks again @aphillips. At this point I've come around to

und should only mean "the probability that the detector doesn't know the language"

instead of our current approach of also including low-probability results. We can do this by relaxing the sum-to-1 constraint, and so we just throw out the low-probability results.

I've drafted a version of this at #52.

Note that currently low-probability is defined as:

  • Either, below the the detector's reported probability of not knowing the language at all; or
  • Contributes <1% to the cumulative probability

It sounds like from your

When you see und in the list, the items below it are probably not the language, but having them in the list could tell you potentially something about the text (such as what to do with an inclusion, such as the "Obama" in my example)

you might not agree with the first criteria here. I agree your example is an interesting one.

We could even go a step further and omit the und entry entirely. This loses a bit of information for web developers, but maybe just steers them into more robust coding patterns. @etiennenoel is scared that web developers will be trying to build und detectors, e.g. to detect computer code or math formulas. Any thoughts on that option?

@domenic
Copy link
Collaborator

domenic commented Apr 22, 2025

Another question to add to the discussion is should we ever exclude expectedInputLanguages from the results? For example, if expectedInputLanguages is ['en', 'es'], should await detector.detect('Bonjour!') have 'en' and 'es' in its results even if they're zero?

IMO it should not. It feels a bit cleaner if expectedInputLanguages only impacts creation time. But, I am open to changing this if we have some developer feedback...

@aphillips
Copy link

@domenic noted:

@etiennenoel is scared that web developers will be trying to build und detectors, e.g. to detect computer code or math formulas. Any thoughts on that option?

That's not an und detector. That's a zxx detector--and a very valuable tool a zxx detector is in the detection of language, since non-linguistic material clogs up, confuses, and fuzzes the results of actual language detection, so you'll want to suppress runs of it from what is being processed.

It sounds like from your

When you see und in the list, the items below it are probably not the language, but having them in the list could tell you potentially something about the text (such as what to do with an inclusion, such as the "Obama" in my example)

you might not agree with the first criteria here.

Actually, I agree with the first criterion more than the second one. The probabilities emitted by a language detector can be somewhat arbitrary. Is a detection with 2% probability really any more interesting that 1%? Is 3%? 5%? How do these numbers get assigned? If they are at all algorithmic, the more interesting signal is the "not me" signal from each.

@domenic
Copy link
Collaborator

domenic commented Apr 23, 2025

We could even go a step further and omit the und entry entirely. This loses a bit of information for web developers, but maybe just steers them into more robust coding patterns. @etiennenoel is scared that web developers will be trying to build und detectors, e.g. to detect computer code or math formulas. Any thoughts on that option?

@nathanmemmott pointed out in another channel that this could result in returning an empty array. That seems pretty confusing for developers.

So at this point I think #52 (throw away low-probability languages, keep und) is the best path forward.

That's not an und detector. That's a zxx detector--and a very valuable tool a zxx detector is in the detection of language, since non-linguistic material clogs up, confuses, and fuzzes the results of actual language detection, so you'll want to suppress runs of it from what is being processed.

Good distinction! Sadly Google Chrome's current language detection model doesn't have zxx detection support... Maybe we can talk to the team that built it to ask for an improvement in that regard.

I wonder if the specification should discuss this distinction more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

4 participants