DotNetSlackers: ASP.NET News for lazy Developers

Tuesday, July 30, 2019

Returning XML Encoded String in .NET

XML isn't as fashionable because it once was, however there is still plenty of XML primarily based configuration and information floating around these dayssimply these days i used to be operating with a conversion routine that mustgenerate XML formatted templates, and one issue that I required is a simple thanks to generate a properly encoded XML string.

In most cases you will need to use a correct XML processor whether or not it's associate XML Document, XmlWriter or LINQ to XML to get your XML. once you use those options the info conversion from string (and most alternative types) is made in and principally automatic.

However, during this case I even have an enormous block of principally static XML text and making the whole document victimisation structured XML documents sounds like overkill once extremely i simply ought to inject a couple of easy values.

So during this case i am trying to find the way to format values as XML that the XmlConvert static category works well.

Should be straightforward right? Well...


The XMLConvert category works well - aside from string conversions that it does not support. XmlConvert.ToString() works with concerning any of the common base sorts aside from string to convert properly XML formatted content.


Reading an encoded XML Value

There are a number of different ways that you can generate XML output and all of them basically involve creating some sort of XML structure and reading the value out of the 'rendered' document.
The most concise way I've found (on StackOverflow from John Skeet with modifications to return just the content) is the following:
public static string XmlString(string text)
{
    return new XElement("t", text).LastNode.ToString();
}
The XElement returns the entire XML fragment, while LastNode is the text node which contains the actual node's content.
You can call XmlString() with:
void Main()
{
    XmlString("Brackets & stuff <doc> and \"quotes\" and more 'quotes'.").Dump();
}
which produces:
Brackets &amp; stuff &lt;doc&gt; and "quotes" and more 'quotes'.
But hold on - this doesn't take into account attributes which require some additional encoding for quotes and control characters. So a little more work is required for the wrapper:
public static string XmlString(string text, bool isAttribute = false)
{
 if (string.IsNullOrEmpty(text))
  return text;
  
 if (!isAttribute)
  return new XElement("t", text).LastNode.ToString();

 return new XAttribute("__n",text)
         .ToString().Substring(5).TrimEnd('\"'); 
}
If you don't want to use LINQ to XML you can use an XML Document instead.
private static XmlDoc _xmlDoc;

public string XmlString(string text)
{
 _xmlDoc = _xmlDoc ?? new XmlDocument();
 var el = _xmlDoc.CreateElement("t");
 el.InnerText = text;
 return el.InnerXml;
}
Note that using XmlDocument is considerably slower than XElement even with the document caching used above.

System.Security.SecurityElement.Escape()?

The SecurityElement.Escape() is a built-in CLR function that performs XML encoding. It's a single function so it's easy to call, but it lways encodes all quotes without options. This is OK, but can result in extra characters if you're encoding for XML elements. Only attributes need quotes encoded.
The function is also considerably slower than the other mechanisms mentioned here.

Just Code

If you don't want to deal with adding a reference to LINQ to XML or even System.Xml you can also create a simple code routine. XML strings really just escape 5 characters (3 if you're encoding for elements), plus it throws for illegal characters < CHR(32) with the exception of tabs, returns and line feeds.
The simple code to do this looks like this:
///  <summary>
///  Turns a string into a properly XML Encoded string.
///  Uses simple string replacement.
/// 
///  Also see XmlUtils.XmlString() which uses XElement
///  to handle additional extended characters.
///  </summary>
///  <param name="text">Plain text to convert to XML Encoded string</param>
/// <param name="isAttribute">
/// If true encodes single and double quotes, CRLF and tabs.
/// When embedding element values quotes don't need to be encoded.
/// When embedding attributes quotes need to be encoded.
/// </param>
/// <returns>XML encoded string</returns>
///  <exception cref="InvalidOperationException">Invalid character in XML string</exception>
public static string XmlString(string text, bool isAttribute = false)
{
    var sb = new StringBuilder(text.Length);

    foreach (var chr in text)
    {
        if (chr == '<')
            sb.Append("&lt;");
        else if (chr == '>')
            sb.Append("&gt;");
        else if (chr == '&')
            sb.Append("&amp;");

        // special handling for quotes
        else if (isAttribute && chr == '\"')
            sb.Append("&quot;");
        else if (isAttribute && chr == '\'')
            sb.Append("&apos;");

        // Legal sub-chr32 characters
        else if (chr == '\n')
            sb.Append(isAttribute ? "&#xA;" : "\n");
        else if (chr == '\r')
            sb.Append(isAttribute ? "&#xD;" : "\r");
        else if (chr == '\t')
            sb.Append(isAttribute ? "&#x9;" : "\t");

        else
        {
            if (chr < 32)
                throw new InvalidOperationException("Invalid character in Xml String. Chr " +
                                                    Convert.ToInt16(chr) + " is illegal.");
            sb.Append(chr);
        }
    }

    return sb.ToString();
}

Attributes vs. Elements

Notice that the functions above optionally supports attribute encoding. Attributes need to be encoded differently than elements.
That's because XML Elements are not required to have quotes encoded because there are no string delimiters to worry about in an XML element. This is legal XML:
<doc>This a "quoted" string. So is 'this'!</doc>
However, if you are generating an content for an XML Attribute, you do need to encode quotes because the quotes are the delimiter for the attribute. Makes sense right?
<doc note="This a &quot;quoted&quot; string. So is &apos;this&apos;!"
Actually, the &apos; is not required in this example because the attribute delimiter is ". So this is actually more correct:
<doc note="This a &quot;quoted&quot; string. So is 'this'!"
However, both are valid XML. The string function above will encode single and double quotes when the encodeQuotes parameter is set to true to handle setting attribute values.
In addition attributes can't represent carriage return and linefeeds (and also tabs) because attributes are single line, so those need to be encoded to with:
CR: &xD; LF: &xA; Tab: &x9;
The following LINQPad code demonstrates what XML is generated for values by Elements and Attributes:
void Main()
{
 var doc = new XmlDocument();
 doc.LoadXml("<d><t>This is &amp; a \"test\" and a 'tested' test</t></d>");
 doc.OuterXml.Dump();
 
 var s = "This is &amp; a \"test\" and a 'tested' test</t></d> with breaks \r\n and \t tabs</root>";
 
 var node = doc.CreateElement("d2");
 node.InnerText = s;
 doc.DocumentElement.AppendChild(node);
 var attr = doc.CreateAttribute("note", s);
 node.Attributes.Append(attr);

 doc.OuterXml.Dump();
}
The document looks like this:
<d>
    <t>This is &amp; a "test" and a 'tested' test</t>
    <d2 d2p1:note="" 
        xmlns:d2p1="This is &amp;amp; a &quot;test&quot; and a 'tested' test&lt;/t&gt;&lt;/d&gt;
        with breaks &#xD;&#xA; and    tabs&lt;/root&gt;">This is &amp;amp; a "test" and a 'tested'
        test&lt;/t&gt;&lt;/d&gt; with breaks 
        and    tabs&lt;/root&gt;</d2>
</d>
(attribute is a single line - linebreaks added for readability)
 Bottom line: Elements don't require quotes, line breaks and tabs to be encoded, but attributes do.

Performance

This falls into the pre-mature optimization bucket, but I was curious how well each of these mechanisms would perform relative to each other. It would seem that XElement and especially XmlDocument would be very slow as they process the element as an XML document/fragment that has to be loaded and parsed.
I was very surprised to find that the fastest and most consistent solution in various sizes of text was XElement which was faster than my bare bones string implementation especially for larger strings. For small amounts of text (under a few hundred characters) the string and XElement implementations were roughly the same, but as strings get larger XElement started to become considerably faster.
As an aside, the custom string version also runs considerably faster in Release Mode (in LINQPad run with Optimizations On) with optimizations enabled rather than debug mode. In debug mode performance was about 3-4x slower. Yikes.
Not surprisingly XmlDocument - even the cached version - was the slower solution. With small strings roughly 50% slower, with larger strings many times slower and incrementally getting slower as the string size gets larger.
Surprisingly slowest of them all was SecurityElement.Escape() which was nearly twice as slow as the XmlDocument approach.
Whatever XElement is doing to parse the element, it's very efficient and it's built into the framework and maintained by Microsoft, so I would recommend that solution, unless you want to avoid the XML assembly references in which case the custom solution string works as well with smaller strings and reasonably close with large strings.
Take all of these numbers with a grain of salt - all of them are pretty fast for one off parsing and unless you're using manual XML encoding strings in loops or large batches, the perf difference is not of concern here.
If you want to play around with the different approaches, here's a Gist that you can load into LINQPad that you can just run:

Summary

XML string secret writing are a few things you hopefully will not have to be compelled to do abundant of, howeverit's one factor I've tripped over enough times to require the time to write down over here. Again, in most cases my recommendation is to write down strings victimisation some form of official XML computer program (XmlDocument or XDocument/XElement), however within the few cases wherever you only have to be compelled to jam a handful of values into an outsized document, nothing beats easy string replacement within the document for simplicity and straightforward maintenance and that is the one edge, use-case wherever a operate like XmlString() is sensible.