encoding/xml: XML CDATA section could be joined together with regular characters #12611

pgundlach · 2015-09-14T11:14:15Z

go version go1.5 darwin/amd64

One thing I stumbled across yesterday (not a real bug, but a minor nuisance from a user's perspective perhaps):

package main

import (
    "encoding/xml"
    "fmt"
    "strings"
)

func main() {
    src := `<root>a<![CDATA[b]]>c</root>`
    r := strings.NewReader(src)

    dec := xml.NewDecoder(r)
    for {
        tok, err := dec.Token()
        if err != nil {
            fmt.Println(err)
            break
        }
        fmt.Printf("%#v\n", tok)
    }
}

gives

xml.StartElement{Name:xml.Name{Space:"", Local:"root"}, Attr:[]xml.Attr{}}
xml.CharData{0x61}
xml.CharData{0x62}
xml.CharData{0x63}
xml.EndElement{Name:xml.Name{Space:"", Local:"root"}}
EOF

I would expect one xml.CharData{} token instead:

xml.StartElement{Name:xml.Name{Space:"", Local:"root"}, Attr:[]xml.Attr{}}
xml.CharData{0x61, 0x62, 0x63}
xml.EndElement{Name:xml.Name{Space:"", Local:"root"}}
EOF

While I understand the source of the three tokens, I would expect one as the user (= me) is unable to distinguish between a CDATA node and a regular text node.

The text was updated successfully, but these errors were encountered:

lly-c232733 · 2023-12-22T20:24:45Z

also is a problem if the xml element contains indented children and a cdata section

example:

<parent>
<![CDATA[Description of parent]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

",cdata" of parent ends up being:
\nDescription of parent\n\n\n\n

workaround is to use ",innerxml" and create a custom marshalXML/unmarshalXML method for your datatype

pgundlach · 2023-12-23T09:58:42Z

also is a problem if the xml element contains indented children and a cdata section
[...]
",cdata" of parent ends up being: \nDescription of parent\n\n\n\n

Two comments from me:

I believe this is expected: the string value of parent is what you write
This is unrelated to the report (joining adjacent CDATA sections)

lly-c232733 · 2023-12-23T15:05:10Z

You said:

I would expect one as the user (= me) is unable to distinguish between a CDATA node and a regular text node.

I agree. However I can't distinguish cdata sections using unmarshal, and I can't distinguish cdata sections using token.

This behavior is not to spec:

When I simply write what I read using unmarshal and then marshal these types of cdata sections get a ton of extra newlines (in addition to the regular indent ones) all wrapped in cdata. Aka crazy output.

Very much against section 2.11
Of the spec:
https://www.w3.org/TR/xml/#sec-line-ends

lly-c232733 · 2023-12-23T15:34:40Z

Example crazy output using ",cdata":

Input of 'Unmarshal'

<parent>
<![CDATA[Description of parent]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

Output of 'MarshalIndent'

<parent>
<![CDATA[
Description of parent



]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

pgundlach · 2023-12-23T16:13:16Z

I still believe this is a different issue, related but not the same as this one. I am not the author of the XML package, so I can't give an authorative answer. Perhaps you should post code in a new bug report which shows the behaviour? This makes it easier to reproduce the problem. Your “crazy output” seems crazy to me, too. I think you have hit a bug, while my issue is just a nuisance.

lly-c232733 · 2023-12-23T16:15:47Z

Fair enough, thanks for your feedback, and have a Merry Christmas

garysferrao · 2025-04-25T14:18:28Z

i was going to make a new bug report because of a bug with the comment node parsing, but i found this issue still open.
i tried to marshal and unmarshal XML string but the tree is not maintained correctly for comments and CDATA; at least according to the Mozilla JavaScript implementation of the parser.

existing bug:

CDATA node is joined together with adjacent text nodes

i wanted to add the following ~~bugs~~:

multiple CDATA nodes are joined into one CDATA node
multiple comments nodes are joined into one comment node

~~i believe that maybe these nodes are merged because it's provided as a convenience.~~ (there were other use-cases where it was merged, e.g.: [1].)

upon some more looking at the XML specification, it looks like the Go XML parser behaves according to specification.

https://www.w3.org/TR/REC-xml/#sec-comments

an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments.

https://www.w3.org/TR/REC-xml/#syntax

Definition: All text that is not markup constitutes the character data of the document.

https://www.w3.org/TR/REC-xml/#dt-cdsection

Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

‌

although, i wonder whether the Go parser would allow getting the actual XML tree, like how the Mozilla Firefox JS engine does:

https://go.dev/play/p/2FnjtqHdKnC

package main

import (
	"encoding/xml"
	"fmt"
)

type Person struct {
	XMLName  xml.Name    `xml:"PERSON"`
	Comment1 xml.Comment `xml:",comment"`
	Name     struct {
		XMLName xml.Name `xml:"NAME"`
		CData1  string   `xml:",cdata"`
		Example struct {
			XMLName xml.Name `xml:"EXAMPLE"`
		}
		CData2 string `xml:",cdata"`
	}
	Comment2 xml.Comment `xml:",comment"`
}

func main() {
	var d string = `<PERSON>
	<!-- comment1 -->
	<NAME>
		<![CDATA[John1]]>
		<EXAMPLE />
		<![CDATA[Doe]]>
	</NAME>
	<!-- comment2 -->
</PERSON>`
	fmt.Printf("input XML document: %s\n", d)

	var unmarshalledData Person
	var err error
	err = xml.Unmarshal([]byte(d), &unmarshalledData)
	if err != nil {
		fmt.Println("error unmarshalling XML:", err)
		return
	}
	fmt.Printf("got unmarshalled data: %+v\n", unmarshalledData)

	var output []byte
	output, err = xml.Marshal(unmarshalledData)
	if err != nil {
		fmt.Println("error marshalling XML:", err)
		return
	}
	fmt.Printf("then marshalled the unmarshalled data: %s\n", output)

	var expectedData Person = Person{
		XMLName: xml.Name{
			Local: "PERSON",
		},
		Comment1: xml.Comment(" comment1 "),
		Name: struct {
			XMLName xml.Name `xml:"NAME"`
			CData1  string   `xml:",cdata"`
			Example struct {
				XMLName xml.Name `xml:"EXAMPLE"`
			}
			CData2 string `xml:",cdata"`
		}{
			XMLName: xml.Name{
				Local: "NAME",
			},
			CData1: "John",
			Example: struct {
				XMLName xml.Name `xml:"EXAMPLE"`
			}{
				XMLName: xml.Name{
					Local: "EXAMPLE",
				},
			},
			CData2: "Doe",
		},
		Comment2: xml.Comment(" comment2 "),
	}
	fmt.Printf("expected unmarshalled data (ignore text nodes because not specified in `Person` struct): %+v\n", expectedData)

	output, err = xml.Marshal(expectedData)
	if err != nil {
		fmt.Println("error marshalling XML:", err)
		return
	}
	fmt.Printf("then marshalled the expected unmarshalled data: %s\n", output)
}

// Output:
// input XML document: <PERSON>
// 	<!-- comment1 -->
// 	<NAME>
// 		<![CDATA[John1]]>
// 		<EXAMPLE />
// 		<![CDATA[Doe]]>
// 	</NAME>
// 	<!-- comment2 -->
// </PERSON>
// got unmarshalled data: {XMLName:{Space: Local:PERSON} Comment1:[32 99 111 109 109 101 110 116 49 32 32 99 111 109 109 101 110 116 50 32] Name:{XMLName:{Space: Local:NAME} CData1:
// 		John1
//
// 		Doe
// 	 Example:{XMLName:{Space: Local:EXAMPLE}} CData2:} Comment2:[]}
// then marshalled the unmarshalled data: <PERSON><!-- comment1  comment2 --><NAME><![CDATA[
// 		John1
//
// 		Doe
// 	]]><EXAMPLE></EXAMPLE></NAME></PERSON>
// expected unmarshalled data (ignore text nodes because not specified in `Person` struct): {XMLName:{Space: Local:PERSON} Comment1:[32 99 111 109 109 101 110 116 49 32] Name:{XMLName:{Space: Local:NAME} CData1:John Example:{XMLName:{Space: Local:EXAMPLE}} CData2:Doe} Comment2:[32 99 111 109 109 101 110 116 50 32]}
// then marshalled the expected unmarshalled data: <PERSON><!-- comment1 --><NAME><![CDATA[John]]><EXAMPLE></EXAMPLE><![CDATA[Doe]]></NAME><!-- comment2 --></PERSON>

here's what the JavaScript engine of Mozilla Firefox 137 gave me:

const xmlStr = `<PERSON>
	<!-- comment1 -->
	<NAME>
		<![CDATA[John1]]>
		<EXAMPLE />
		<![CDATA[Doe]]>
	</NAME>
	<!-- comment2 -->
</PERSON>` 
const parser = new DOMParser();
const doc = parser.parseFromString(xmlStr, "application/xml");
// print the name of the root element or error message
const errorNode = doc.querySelector("parsererror");
if (errorNode) {
	console.error("error while parsing", errorNode);
} else {
	console.log(doc.childNodes[0].childNodes);
	console.log(doc.childNodes[0].childNodes[3].childNodes);
}
// Output:
// NodeList(7) [ #text, <!--  comment1  -->, #text, NAME, #text, <!--  comment2  -->, #text ]
// NodeList(7) [ #text, CDATASection, #text, EXAMPLE, #text, CDATASection, #text ]

ianlancetaylor changed the title ~~XML CDATA section could be joined together with regular characters~~ encoding/xml: XML CDATA section could be joined together with regular characters Sep 14, 2015

ianlancetaylor added this to the Unplanned milestone Sep 14, 2015

seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/xml: XML CDATA section could be joined together with regular characters #12611

encoding/xml: XML CDATA section could be joined together with regular characters #12611

pgundlach commented Sep 14, 2015

lly-c232733 commented Dec 22, 2023 •

edited

Loading

pgundlach commented Dec 23, 2023

lly-c232733 commented Dec 23, 2023 •

edited

Loading

lly-c232733 commented Dec 23, 2023

pgundlach commented Dec 23, 2023

lly-c232733 commented Dec 23, 2023

garysferrao commented Apr 25, 2025 •

edited

Loading

encoding/xml: XML CDATA section could be joined together with regular characters #12611

encoding/xml: XML CDATA section could be joined together with regular characters #12611

Comments

pgundlach commented Sep 14, 2015

lly-c232733 commented Dec 22, 2023 • edited Loading

pgundlach commented Dec 23, 2023

lly-c232733 commented Dec 23, 2023 • edited Loading

You said:

This behavior is not to spec:

lly-c232733 commented Dec 23, 2023

pgundlach commented Dec 23, 2023

lly-c232733 commented Dec 23, 2023

garysferrao commented Apr 25, 2025 • edited Loading

lly-c232733 commented Dec 22, 2023 •

edited

Loading

lly-c232733 commented Dec 23, 2023 •

edited

Loading

garysferrao commented Apr 25, 2025 •

edited

Loading