Python: Port and extend XXE modeling #6112

jorgectf · 2021-06-19T17:09:52Z

This PR introduces the modeling of the following XML parsing-related libraries and specific methods:

XML Parsers:
  xml.etree.ElementTree.XMLParser() - not extends entities anymore
  lxml.etree.XMLParser() - no_network=True huge_tree=False resolve_entities=True
  lxml.etree.get_default_parser() - no options, default above options
  xml.sax.make_parser() - parser.setFeature(xml.sax.handler.feature_external_ges, True)

XML Parsing:
  string:
    xml.etree.ElementTree.fromstring(list)
    xml.etree.ElementTree.XML
    lxml.etree.fromstring(list)
    lxml.etree.XML
    xmltodict.parse - disable_entities=True

  file StringIO(), BytesIO(b):
    xml.etree.ElementTree.parse
    lxml.etree.parse
    xml.dom.(mini|pull)dom.parse(String)

jorgectf · 2021-06-19T22:41:45Z

I have thought about directly changing the current code, but I will be writing into experimental because of a dilemma that has just come up.

codeql/python/ql/src/semmle/python/Concepts.qll

Lines 112 to 140 in 26a04d6

    
           /** 
        
            * A data-flow node that decodes data from a binary or textual format. This 
        
            * is intended to include deserialization, unmarshalling, decoding, unpickling, 
        
            * decompressing, decrypting, parsing etc. 
        
            * 
        
            * A decoding (automatically) preserves taint from input to output. However, it can 
        
            * also be a problem in itself, for example if it allows code execution or could result 
        
            * in denial-of-service. 
        
            * 
        
            * Extend this class to refine existing API models. If you want to model new APIs, 
        
            * extend `Decoding::Range` instead. 
        
            */ 
        
           class Decoding extends DataFlow::Node { 
        
             Decoding::Range range; 
        
             Decoding() { this = range } 
        
             /** Holds if this call may execute code embedded in its input. */ 
        
             predicate mayExecuteInput() { range.mayExecuteInput() } 
        
             /** Gets an input that is decoded by this function. */ 
        
             DataFlow::Node getAnInput() { result = range.getAnInput() } 
        
             /** Gets the output that contains the decoded data produced by this function. */ 
        
             DataFlow::Node getOutput() { result = range.getOutput() } 
        
             /** Gets an identifier for the format this function decodes from, such as "JSON". */ 
        
             string getFormat() { result = range.getFormat() } 
        
           }

Should we treat XXE as a deserialization? If so, according to Concepts.qll (L114), the only way to look for sinks in taint configs is mayExecuteInput() (L130). However, an XXE won't execute code/commands (unless PHP's expect wrapper is loaded) but can be dangerous (SSRF, DoS).

Taking into account that changing mayExecuteInput() to mayBeDangerous() is a bit ambiguous, I guess I will create an XXE Concept for now and leave the issue for the pros 😎.

jorgectf · 2021-07-24T00:34:46Z

This query is ready for code review 😃

python/ql/src/experimental/Security/CWE-611/XXE.qlref

I had forgotten about this, but better late than never... also added a small representative test

But handling this in a nice way will require some restructuring

and handle parser being passed as positional argument

RasmusWL

As we discussed privately, instead of typing out lengthy replies, I would simple do the changed modeling myself... that is ready now in jorgectf#9.

That PR I made IS quite a mouthful. I think there are interesting things for you to learn from this, but I can also understand that it could take some time for you to process this. If you have not had time to review this within 1 week, I think I will just merge this PR, and apply this commits on top, so we can get your good work closer to being part of the default query suite (unless you object to this 1 week).

RasmusWL · 2022-03-03T13:59:50Z

python/ql/test/experimental/query-tests/Security/CWE-611/xml_etree.py

+@app.route("/xml_etree_fromstring-lxml_etree_XMLParser")
+def xml_parser_2():
+    xml_content = request.args['xml_content']
+
+    parser = lxml.etree.XMLParser()
+    return xml.etree.ElementTree.fromstring(xml_content, parser=parser).text
+
+@app.route("/xml_etree_fromstring-lxml_get_default_parser")
+def xml_parser_3():
+    xml_content = request.args['xml_content']
+
+    parser = lxml.etree.get_default_parser()
+    return xml.etree.ElementTree.fromstring(xml_content, parser=parser).text
+
+@app.route("/xml_etree_fromstring-lxml_get_default_parser")
+def xml_parser_4():
+    xml_content = request.args['xml_content']
+
+    parser = xml.sax.make_parser()
+    parser.setFeature(xml.sax.handler.feature_external_ges, True)
+    return xml.etree.ElementTree.fromstring(xml_content, parser=parser).text


Have you seen anyone use xml.etree with a parser from a different package in any real code? If not, I would consider this usecase a bit too obscure, and not have any tests for it.

Not really, I just tried all parsing modules with all parsers and noted which worked. Using a different parser than the one from the package being used doesn't really make sense, I'm fine removing this use case :)

RasmusWL · 2022-03-03T14:01:02Z

python/ql/src/experimental/semmle/python/frameworks/Xml.qll

+    override DataFlow::Node getAnInput() { none() }
+
+    override predicate vulnerable(string kind) {
+      kind = "XXE" and not this.getArgByName("resolve_entities").asExpr() = any(False f)


I see that you already re-wrote this, but consider:

lxml.etree.XMLParser(resolve_entities=True)

The first predicate will hold for such a call (meaning we treat it as vulnerable to XXE), but since the resolve_entities keyword argument is present, the second predicate will not hold for such a call (meaning we treat it as safe for XXE). So there is a subtle difference. (and having tests for all such cases really helps to get such things right 😉)

predicate works() { not this.getArgByName("resolve_entities").getALocalSource().asExpr() = any(False f) } predicate doesNotWork() { not ( exists(this.getArgByName("resolve_entities")) or this.getArgByName("resolve_entities").asExpr() = any(False f) ) }

I ended up writing this as

( // resolve_entities has default True not exists(this.getArgByName("resolve_entities")) or this.getArgByName("resolve_entities").getALocalSource().asExpr() = any(True f) )

RasmusWL · 2022-03-03T14:01:14Z

python/ql/src/experimental/semmle/python/frameworks/Xml.qll

+    predicate vulnerable(DataFlow::Node n, string kind) {
+      exists(API::Node handler, API::Node feature |
+        handler = API::moduleImport("xml").getMember("sax").getMember("handler") and
+        DataFlow::exprNode(trackSaxFeature(this, feature).asExpr())
+            .(DataFlow::LocalSourceNode)
+            .flowsTo(n)
+      |
+        kind = ["XXE", "DTD retrieval"] and
+        feature = handler.getMember("feature_external_ges")
+      )
+    }


I've rewritten this

RasmusWL · 2022-03-03T14:01:26Z

python/ql/src/experimental/semmle/python/frameworks/Xml.qll

+      exists(DataFlow::MethodCallNode parse, API::Node handler, API::Node feature |
+        handler = API::moduleImport("xml").getMember("sax").getMember("handler") and
+        parse.calls(trackSaxFeature(this, feature), "parse") and
+        parse.getArg(0) = this.getAnInput() // enough to avoid FPs?


I've rewritten this

RasmusWL · 2022-03-03T14:02:04Z

python/ql/src/experimental/semmle/python/security/dataflow/XmlInjection.qll

+  predicate xmlInjectionVulnerable(DataFlow::PathNode source, DataFlow::PathNode sink, string kind) {
+    xmlInjection(source, sink) and
+    (
+      xmlParsingInputAsVulnerableSink(sink.getNode(), kind) or
+      xmlParserInputAsVulnerableSink(sink.getNode(), kind)
+    )
+  }


I have written a solution for this

RasmusWL · 2022-03-03T18:37:17Z

python/ql/src/experimental/semmle/python/frameworks/Xml.qll

+   * * `getAnInput()`'s result would be `foo`.
+   * * `vulnerable(kind)`'s `kind` would be `Billion Laughs` and `Quadratic Blowup`.
+   */
+  private class XMLRPCServer extends DataFlow::CallCfgNode, XML::XMLParser::Range {


I ended up writing separate query for this.

Co-authored-by: Jorge <[email protected]>

@jorgectf

Nice spotted @jorgectf!

Rasmus' rewrite of github#6112 See github#6112 (review)

RasmusWL · 2022-03-04T16:24:20Z

Cheers 👍 I think this should be good to go now, but will need tests to be ✔️ first. Will probably merge it by monday 👍

Not the prettiest solution... but it works ¯\_(ツ)_/¯

jorgectf · 2022-03-14T11:41:52Z

This PR has been an amazing ride @RasmusWL, thank you!

Empty commit

0e61558

github-actions bot added the Python label Jun 19, 2021

jorgectf changed the title ~~Python: Port and extend unsafe deserialization modeling~~ Python: Port and extend XXE modeling Jun 19, 2021

jorgectf added 7 commits June 22, 2021 16:41

Upload main structure and initial tests

78deec8

Move tests to test/

b9fa57f

Add XMLParser concept

c3b3bde

Add partial modeling

d475d52

Format tests

11f4c1c

Write (String|Bytes)IO additional taint step

b5e10b6

Finish modeling

068150b

jorgectf marked this pull request as ready for review July 22, 2021 17:35

jorgectf requested a review from a team as a code owner July 22, 2021 17:35

jorgectf added 3 commits July 24, 2021 01:23

Polish documentation

0d2646f

Polish tests

61e873d

Write qldocs

b83b31c

jorgectf added 2 commits July 25, 2021 01:51

Fix undetected tests

1dd77f1

Add .expected

93c8529

RasmusWL self-assigned this Aug 25, 2021

jorgectf marked this pull request as draft August 25, 2021 14:14

Fix references' link anchor

48bca5b

jorgectf mentioned this pull request Aug 25, 2021

[Python]: CWE-611: XXE github/securitylab#424

Closed

1 task

jorgectf marked this pull request as ready for review August 25, 2021 15:18

m-y-mo reviewed Aug 26, 2021

View reviewed changes

python/ql/src/experimental/Security/CWE-611/XXE.qlref Outdated Show resolved Hide resolved

m-y-mo reviewed Aug 26, 2021

View reviewed changes

python/ql/src/experimental/Security/CWE-611/XXE.qlref Outdated Show resolved Hide resolved

Update .qlref

21da603

m-y-mo reviewed Sep 8, 2021

View reviewed changes

python/ql/src/experimental/Security/CWE-611/XXE.qlref Outdated Show resolved Hide resolved

jorgectf and others added 2 commits September 9, 2021 19:06

Extend .qlref

61a81b6

Merge branch 'main' into jorgectf/python/deserialization

67fddda

RasmusWL added 12 commits March 3, 2022 21:18

Python: Handle more functions and kw-args

3278793

Python: Update XmlEntityInjection.expected

f72f673

I had forgotten about this, but better late than never... also added a small representative test

Python: Support feed method of lxml/xml.etree Parsers

33ebcdf

Python: Add test for XMLPullParser

46238d5

But handling this in a nice way will require some restructuring

Python: Restructure overall XML modeling

de0e67f

Python: Align QLdocs of XML modeling

a033b71

Python: Restructure modeling of xml.etree parsers

c0a2c25

Python: Restructure lxml modeling

c0a6f9f

and handle parser being passed as positional argument

Python: Minor fixup of qldoc

df8e0fc

Python: Remove XMLParser concept

837daaa

Python: Minor qldoc improvement

0d69dc8

Python: Rename vulnerable predicate => vulnerableTo

3f6c55e

RasmusWL mentioned this pull request Mar 3, 2022

XXE: Changes for review jorgectf/codeql#9

Merged

RasmusWL requested changes Mar 3, 2022

View reviewed changes

jorgectf and others added 7 commits March 4, 2022 01:02

Apply suggestions from code review

683c2fa

Python: Apply suggestions from code review

3cd165d

Co-authored-by: Jorge <[email protected]>

Python: huge_tree tests were wrong

d6cbfec

Nice spotted @jorgectf!

Python: Fix huge_tree modeling

f0131af

Python: Add conditional assignment check for sax parser

1a9620a

Python: Fix typo in set_default_parser

ef045a6

Merge pull request #9 from RasmusWL/WIP

5552834

Rasmus' rewrite of github#6112 See github#6112 (review)

jorgectf requested a review from RasmusWL March 4, 2022 16:22

RasmusWL added 2 commits March 8, 2022 11:15

Merge branch 'main' into jorgectf/python/deserialization

6b14c1d

Python: Resolve name conflict over XML module

0e9da4a

Not the prettiest solution... but it works ¯\_(ツ)_/¯

RasmusWL force-pushed the jorgectf/python/deserialization branch from 44c9443 to 0e9da4a Compare March 9, 2022 10:06

RasmusWL approved these changes Mar 14, 2022

View reviewed changes

RasmusWL merged commit 2f4a22c into github:main Mar 14, 2022

jorgectf deleted the jorgectf/python/deserialization branch March 14, 2022 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Port and extend XXE modeling #6112

Python: Port and extend XXE modeling #6112

jorgectf commented Jun 19, 2021 •

edited

Loading

jorgectf commented Jun 19, 2021 •

edited

Loading

jorgectf commented Jul 24, 2021

RasmusWL left a comment

RasmusWL Mar 3, 2022

jorgectf Mar 3, 2022

RasmusWL Mar 3, 2022

RasmusWL Mar 3, 2022

RasmusWL Mar 3, 2022

RasmusWL Mar 3, 2022

RasmusWL Mar 3, 2022

RasmusWL commented Mar 4, 2022

jorgectf commented Mar 14, 2022

Python: Port and extend XXE modeling #6112

Python: Port and extend XXE modeling #6112

Conversation

jorgectf commented Jun 19, 2021 • edited Loading

jorgectf commented Jun 19, 2021 • edited Loading

jorgectf commented Jul 24, 2021

RasmusWL left a comment

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

jorgectf Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL Mar 3, 2022

Choose a reason for hiding this comment

RasmusWL commented Mar 4, 2022

jorgectf commented Mar 14, 2022

jorgectf commented Jun 19, 2021 •

edited

Loading

jorgectf commented Jun 19, 2021 •

edited

Loading