Skip to content

encoding/xml: XML CDATA section could be joined together with regular characters #12611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pgundlach opened this issue Sep 14, 2015 · 7 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@pgundlach
Copy link
Contributor

go version go1.5 darwin/amd64

One thing I stumbled across yesterday (not a real bug, but a minor nuisance from a user's perspective perhaps):

package main

import (
    "encoding/xml"
    "fmt"
    "strings"
)

func main() {
    src := `<root>a<![CDATA[b]]>c</root>`
    r := strings.NewReader(src)

    dec := xml.NewDecoder(r)
    for {
        tok, err := dec.Token()
        if err != nil {
            fmt.Println(err)
            break
        }
        fmt.Printf("%#v\n", tok)
    }
}

gives

xml.StartElement{Name:xml.Name{Space:"", Local:"root"}, Attr:[]xml.Attr{}}
xml.CharData{0x61}
xml.CharData{0x62}
xml.CharData{0x63}
xml.EndElement{Name:xml.Name{Space:"", Local:"root"}}
EOF

I would expect one xml.CharData{} token instead:

xml.StartElement{Name:xml.Name{Space:"", Local:"root"}, Attr:[]xml.Attr{}}
xml.CharData{0x61, 0x62, 0x63}
xml.EndElement{Name:xml.Name{Space:"", Local:"root"}}
EOF

While I understand the source of the three tokens, I would expect one as the user (= me) is unable to distinguish between a CDATA node and a regular text node.

@ianlancetaylor ianlancetaylor changed the title XML CDATA section could be joined together with regular characters encoding/xml: XML CDATA section could be joined together with regular characters Sep 14, 2015
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Sep 14, 2015
@lly-c232733
Copy link

lly-c232733 commented Dec 22, 2023

also is a problem if the xml element contains indented children and a cdata section

example:

<parent>
<![CDATA[Description of parent]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

",cdata" of parent ends up being:
\nDescription of parent\n\n\n\n

workaround is to use ",innerxml" and create a custom marshalXML/unmarshalXML method for your datatype

@pgundlach
Copy link
Contributor Author

also is a problem if the xml element contains indented children and a cdata section
[...]
",cdata" of parent ends up being: \nDescription of parent\n\n\n\n

Two comments from me:

  1. I believe this is expected: the string value of parent is what you write
  2. This is unrelated to the report (joining adjacent CDATA sections)

@lly-c232733
Copy link

lly-c232733 commented Dec 23, 2023

You said:

I would expect one as the user (= me) is unable to distinguish between a CDATA node and a regular text node.

I agree. However I can't distinguish cdata sections using unmarshal, and I can't distinguish cdata sections using token.

This behavior is not to spec:

When I simply write what I read using unmarshal and then marshal these types of cdata sections get a ton of extra newlines (in addition to the regular indent ones) all wrapped in cdata. Aka crazy output.

Very much against section 2.11
Of the spec:
https://www.w3.org/TR/xml/#sec-line-ends

@lly-c232733
Copy link

Example crazy output using ",cdata":

Input of 'Unmarshal'

<parent>
<![CDATA[Description of parent]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

Output of 'MarshalIndent'

<parent>
<![CDATA[
Description of parent



]]>
<child ID=1></child>
<child ID=2></child>
<child ID=3></child>
</parent>

@pgundlach
Copy link
Contributor Author

I still believe this is a different issue, related but not the same as this one. I am not the author of the XML package, so I can't give an authorative answer. Perhaps you should post code in a new bug report which shows the behaviour? This makes it easier to reproduce the problem. Your “crazy output” seems crazy to me, too. I think you have hit a bug, while my issue is just a nuisance.

@lly-c232733
Copy link

Fair enough, thanks for your feedback, and have a Merry Christmas

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2024
@garysferrao
Copy link

garysferrao commented Apr 25, 2025

i was going to make a new bug report because of a bug with the comment node parsing, but i found this issue still open.
i tried to marshal and unmarshal XML string but the tree is not maintained correctly for comments and CDATA; at least according to the Mozilla JavaScript implementation of the parser.

existing bug:

  • CDATA node is joined together with adjacent text nodes

i wanted to add the following bugs:

  • multiple CDATA nodes are joined into one CDATA node
  • multiple comments nodes are joined into one comment node

i believe that maybe these nodes are merged because it's provided as a convenience. (there were other use-cases where it was merged, e.g.: [1].)

upon some more looking at the XML specification, it looks like the Go XML parser behaves according to specification.

an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments.

Definition: All text that is not markup constitutes the character data of the document.

Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

although, i wonder whether the Go parser would allow getting the actual XML tree, like how the Mozilla Firefox JS engine does:

https://go.dev/play/p/2FnjtqHdKnC

package main

import (
	"encoding/xml"
	"fmt"
)

type Person struct {
	XMLName  xml.Name    `xml:"PERSON"`
	Comment1 xml.Comment `xml:",comment"`
	Name     struct {
		XMLName xml.Name `xml:"NAME"`
		CData1  string   `xml:",cdata"`
		Example struct {
			XMLName xml.Name `xml:"EXAMPLE"`
		}
		CData2 string `xml:",cdata"`
	}
	Comment2 xml.Comment `xml:",comment"`
}

func main() {
	var d string = `<PERSON>
	<!-- comment1 -->
	<NAME>
		<![CDATA[John1]]>
		<EXAMPLE />
		<![CDATA[Doe]]>
	</NAME>
	<!-- comment2 -->
</PERSON>`
	fmt.Printf("input XML document: %s\n", d)

	var unmarshalledData Person
	var err error
	err = xml.Unmarshal([]byte(d), &unmarshalledData)
	if err != nil {
		fmt.Println("error unmarshalling XML:", err)
		return
	}
	fmt.Printf("got unmarshalled data: %+v\n", unmarshalledData)

	var output []byte
	output, err = xml.Marshal(unmarshalledData)
	if err != nil {
		fmt.Println("error marshalling XML:", err)
		return
	}
	fmt.Printf("then marshalled the unmarshalled data: %s\n", output)

	var expectedData Person = Person{
		XMLName: xml.Name{
			Local: "PERSON",
		},
		Comment1: xml.Comment(" comment1 "),
		Name: struct {
			XMLName xml.Name `xml:"NAME"`
			CData1  string   `xml:",cdata"`
			Example struct {
				XMLName xml.Name `xml:"EXAMPLE"`
			}
			CData2 string `xml:",cdata"`
		}{
			XMLName: xml.Name{
				Local: "NAME",
			},
			CData1: "John",
			Example: struct {
				XMLName xml.Name `xml:"EXAMPLE"`
			}{
				XMLName: xml.Name{
					Local: "EXAMPLE",
				},
			},
			CData2: "Doe",
		},
		Comment2: xml.Comment(" comment2 "),
	}
	fmt.Printf("expected unmarshalled data (ignore text nodes because not specified in `Person` struct): %+v\n", expectedData)

	output, err = xml.Marshal(expectedData)
	if err != nil {
		fmt.Println("error marshalling XML:", err)
		return
	}
	fmt.Printf("then marshalled the expected unmarshalled data: %s\n", output)
}

// Output:
// input XML document: <PERSON>
// 	<!-- comment1 -->
// 	<NAME>
// 		<![CDATA[John1]]>
// 		<EXAMPLE />
// 		<![CDATA[Doe]]>
// 	</NAME>
// 	<!-- comment2 -->
// </PERSON>
// got unmarshalled data: {XMLName:{Space: Local:PERSON} Comment1:[32 99 111 109 109 101 110 116 49 32 32 99 111 109 109 101 110 116 50 32] Name:{XMLName:{Space: Local:NAME} CData1:
// 		John1
//
// 		Doe
// 	 Example:{XMLName:{Space: Local:EXAMPLE}} CData2:} Comment2:[]}
// then marshalled the unmarshalled data: <PERSON><!-- comment1  comment2 --><NAME><![CDATA[
// 		John1
//
// 		Doe
// 	]]><EXAMPLE></EXAMPLE></NAME></PERSON>
// expected unmarshalled data (ignore text nodes because not specified in `Person` struct): {XMLName:{Space: Local:PERSON} Comment1:[32 99 111 109 109 101 110 116 49 32] Name:{XMLName:{Space: Local:NAME} CData1:John Example:{XMLName:{Space: Local:EXAMPLE}} CData2:Doe} Comment2:[32 99 111 109 109 101 110 116 50 32]}
// then marshalled the expected unmarshalled data: <PERSON><!-- comment1 --><NAME><![CDATA[John]]><EXAMPLE></EXAMPLE><![CDATA[Doe]]></NAME><!-- comment2 --></PERSON>

here's what the JavaScript engine of Mozilla Firefox 137 gave me:

const xmlStr = `<PERSON>
	<!-- comment1 -->
	<NAME>
		<![CDATA[John1]]>
		<EXAMPLE />
		<![CDATA[Doe]]>
	</NAME>
	<!-- comment2 -->
</PERSON>` 
const parser = new DOMParser();
const doc = parser.parseFromString(xmlStr, "application/xml");
// print the name of the root element or error message
const errorNode = doc.querySelector("parsererror");
if (errorNode) {
	console.error("error while parsing", errorNode);
} else {
	console.log(doc.childNodes[0].childNodes);
	console.log(doc.childNodes[0].childNodes[3].childNodes);
}
// Output:
// NodeList(7) [ #text, <!--  comment1  -->, #text, NAME, #text, <!--  comment2  -->, #text ]
// NodeList(7) [ #text, CDATASection, #text, EXAMPLE, #text, CDATASection, #text ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants