Using gpt-4.1 with structured outputs I get occasional (about 1 in 10) repeated \n\n\n\n added to my structured output. The structured output BEFORE the \n starts is valid. But there are a lot of \n added (a 100k characters! ) they are neatly order in groups
In some integrations Iâve seen it output the structured output multiple times, with or without differing data, as a way to (undesirably) provide JSONL instead of using a simple array as intended. So, the main issue here appears to be that the model is allowed to continue outputting after it has completed its object, and being constrained to valid JSON symbols isnât enough to prevent these issues.
As a workaround, the Chat Completions API has parameters for presence and frequency penalties. Using either may encourage the model to terminate its output when itâs finished.
Weâre aware that GPTâ4.1 (and occasionally GPTâ4o) can add a long string of blank lines or extra array items after the correct JSON when you stream a structuredâoutput response. Engineering is on it. Until the fix ships, you can avoid most cases by turning off streaming or by parsing only the first complete JSON object and discarding anything that follows.
Thanks for the note - much appreciated.
Iâm not (ever) using streaming, so it is not related to streaming.
It also takes a very long time for OpenaAI api to come back with the response in those cases, and they are 100k bytes long (filling all the way up to the max 32k tokens I defined) - but I guess I can add some code in the exception handler retry. Itâs not that trivial since the sequences are not always the same - I guess Iâll ask Codex
*** humbled *** Actually it WAS easy to work around and so I will post the problem and the solution here for anyone that runs into this problem (in Python) and needs a work around.
Turns out (thank you, o4-mini-high) that there is also just raw_decode and you do:
dec = json. JSONDecoder()
json, garbage = dec.raw_decode(self.response)
HOWEVER this of course ONLY works when proper JSON is returned in the first place. Which is not always the case. So I am still VERY MUCH waiting on a fix for this.
We do have a repeatable (goes wrong every time) query BTW if youâre interested.
There is still the issue of being charged for all those redundant \n and \t\t output tokens being generated, correct?
Also, when using the Chat Completions API, would setting the frequency_penalty to some positive number (as default is 0) help address the cost problem by truncating the response early, as in theory, every new line of \n or \t\t would be penalized? What do you think?
The alternative for the Responses API could perhaps be setting the max_output_tokens value to a conservative number thatâs well below the maximum limit for the model youâre using, depending on the use case. Maybe that could work as well?
Yes it is not âTHEâ solution since even with the extras removed the JSON is not always valid. Also, the bad queries take forever to produce. But this âfixâ makes the problem about 50% less problematic by my estimate.
This should be able to be countered with logit_bias. Chat Completions. You could find all these tab combos that OpenAI trained the AI models on as token numbers (and NOBODY wants tab-indented multi-line JSONâŚ), and harshly demote them - make only single-line JSON possible, without whitespace.
This in-string tuning could be done at the same time OpenAI is enforcing a context-free grammar and then releasing the AI model into a string where it can write these bad characters. Tabs are possible in a JSON string, but highly unlikely to be desired in any use case, as JSON itself is the data structure, not table data in a string.
Then after coming up with a long list of things the AI is trying to write (JSON structure but within the JSON data) and you killing them off, in regular interactions, and trying json_object mode, try it on your over-specified non-strict (non enforced) JSON schemaâŚ
Unfortunately, OpenAI also messed up the way that logit_bias is supposed to work. It is completely broken and without effect if you use either temperature or top_p.
Then, messed with the logprobs and delivering them for examinations of the precise production within functions or structured output, leaving you to infer token numbers and token strings yourself.
Even being able to promote special tokens (so the model is more likely to finish instead of going into a loop of multiple outputs) is blocked.
Doesnât matter, Responses is completely feature-less. You canât even add a crapload of tabs as a stop sequence.
So: bad models, broken API parameters violating API reference, and thenâŚbad endpoint âResponsesâ completely blocking any such self-service.
@OpenAI_Support I also observe this very frequently when using structured outputs with both o3 and o4-mini. These tabs and newlines get added for 5-10 minutes before resolving.
Might the fix you are talking about also help resolve things with these reasoning models (rather than just GPT 4.1 as you mentioned?)
This topic is about the âResponsesâ API endpoint.
stop is not a supported parameter. As I just said above:
What would the API be validating? It is more likely you get a 500 server error from the JSON not finishing when in a context-free grammar enforcement, whereas with a stream, you at least receive something you can deal with - and see that the issue is all the escaped tabs and newlines being written.
With a stream, you can close() instead of letting the model run up a 16k token bill.
The simple fact is: the API should be emitting a special token of ChatML format to end the output, the context-free grammar must never block the stop token, and that should be a built-in stop sequence that is always caught by the API. One of those is not happening and must be fixed.
Provide real logprobs without the special tokens stripped out, and logit_bias for Responses that also works when including sampling parameters (which is now broken on Chat Completions) and takes special token numbers, and then developer can actually see a misbehaving model or generation continuing past where termination was predicted - or not.
internal prompt: <|im_start|>assistant (token 200006)
generation: <|im_sep|>Sure, I can write JSON.<|im_end> (token 200008, 007)
or generation: to=image_gen.text2im<|im_sep|>\n{"prompt"...
stop: [200007]
(token strings are not decoded)
Now you have nothing proprietary left to hide and can turn on a good version of logprobs that starts from token 0 at doesnât shut off.
Thanks for the response.
Since Iâm using Chat Completions API, I can try to add a stop sequence.
But turning off streaming is not an option for me. And not sure if itâs possible to trim the response when I make an API call using this:
async with
self.async_openai_client.beta.chat.completions.stream(
model=self._llm.model,
temperature=self._llm.temperature,
response_format=output_model_cls,
messages=messages,
stop=["}\n"],
) as stream:
async for event in stream:
if (
event.type == "content.delta"
and event.delta is not None
):
When the issue occurs, it hangs for at least 5mn until the exception openai.LengthFinishReasonError is raised, and therefore I cannot even access the (partial?) output at all. "strict": true is always set when I use response_format with json_schema, by the way.
Hi @OpenAI_Support
Adding stop sequence [â}\nâ] does not work for me.
Would appreciate if there is an actual fix soon as the issue affects our customers.
You can send the parameter max_tokens so that the termination of sequence generation occurs earlier than simply the modelâs maximum output or maximum context window. Budget just beyond the maximum length a valid output would ever produce.
For a stop sequence on Chat Completions, the suggestion by OpenAI staff is also bad, as a stop sequence is stripped from the output besides terminating. Removing ["}\n"] from the tail would break the JSON and leave it unclosed. There is no separate âstopâ finish_reason for your own stop sequence vs the internal one.
Instead, you can set "stop": to ["\n\n\n", "\t\t"] to catch the start of a linefeed or tab loop, whitespace that should never be in an AI-written JSON.
You can break any repetitive pattern by increasing frequency_penalty. Every token produced demotes that same token in the future. This also can change the production from a sequence to something else, hopefully something that terminates output.
The SDK library event-producing beta method hasnât been maintained. You can code from scratch with httpx making your JSON REST requests, and if it fails, it is because of you, your issue-handling, and not because of a library you can do nothing about.