Parsing XML with DataSet
The fifth and final method we will use to parse an XML file into memory uses the DataSet
class. The example code is shown in Listing Nine.
Listing Nine: Parsing XML using DataSet
using System; using System.Xml; using System.Data; using CommonLib; // Suite class definition using InfoLib; // DisplayInfo() method namespace Run { class Class1 { [STAThread] static void Main(string[] args) { DataSet ds = new DataSet(); ds.ReadXml("..\\..\\..\\..\\testCases.xml"); InfoLib.DataSetInfo.DisplayInfo(ds); // show table, column, relation names CommonLib.Suite s = new CommonLib.Suite(); foreach (DataRow row in ds.Tables["testcase"].Rows) { CommonLib.TestCase tc = new CommonLib.TestCase(); tc.id = row["id"].ToString(); tc.kind = row["kind"].ToString(); tc.expected = row["expected"].ToString(); DataRow[] children = row.GetChildRows("testcase_inputs"); // relation name tc.arg1 = (children[0]["arg1"]).ToString(); // there is only 1 row in children tc.arg2 = (children[0]["arg2"]).ToString(); s.items.Add(tc); } s.Display(); } // Main() } // class Class1 } // ns
We start by reading the XML file directly into a System.Data.DataSet
object using the ReadXml()
method. A DataSet
object can be thought of as an in-memory relational database. The XML data ends up in two tables, "testcase" and "inputs," that are related through a relation "testcase_inputs." The key to using this DataSet
technique is to know the way to determine how the XML data gets stored into the DataSet
object.
Although we could create a custom DataSet
object with completely known characteristics, it is much quicker to let the ReadXml()
method do the work and then examine the result. I wrote a helper function DisplayInfo()
that accepts a DataSet
as an argument and displays the information we need to extract the data from the DataSet
's tables.
To keep the main parse program uncluttered, I put DisplayInfo()
into a class library named "InfoLib." The code is shown in Listing Ten. The output from running the parse program is shown in Figure 5.
Listing Ten: Code to display DataSet information
using System; using System.Data; namespace InfoLib { public class DataSetInfo { public static void DisplayInfo(DataSet ds) // names of tables, columns, relations in ds { foreach (DataTable dt in ds.Tables) { Console.WriteLine("\n==============================================="); Console.WriteLine("Table = " + dt.TableName + "\n"); foreach (DataColumn dc in dt.Columns) { Console.Write("{0,-14}", dc.ColumnName); } Console.WriteLine("\n-----------------------------------------------"); foreach (DataRow dr in dt.Rows) { foreach (object data in dr.ItemArray) { Console.Write("{0,-14}", data.ToString()); } Console.WriteLine(); } Console.WriteLine("==============================================="); } // foreach DataTable foreach (DataRelation dr in ds.Relations) { Console.WriteLine("\n\nRelations:"); Console.WriteLine(dr.RelationName + "\n\n"); } } // DisplayInfo() } // class DataSetInfo } // ns InfoLib
Figure 5 Output from the DataSet technique
The first table, "testcase," holds the data that is one level deep from the XML root: id, kind, and expected. The second table, "inputs," holds data that is two levels deep: arg1
and arg2
. In general, if your XML file is n
levels deep, ReadXml()
will generate n
tables.
Extracting the data from the parent test case table is easy. We just iterate through each row of the table and access by column name. To get the data from the child table inputs, we get an array of rows using the GetChildRows
method:
DataRow[] children = row.GetChildRows("testcase_inputs"); // relation name
Because each <testcase>
node has only one <inputs> child node, the children array will only have one row.
The trickiest aspect of this technique is to extract the child data:
tc.arg1 = (children[0]["arg1"]).ToString(); // there is only 1 row in children
Using the DataSet
class to parse an XML file has a very relational database feel. Compared with other techniques in this article, it operates at a middle level of abstraction. The ReadXml()
method hides a lot of details but you must traverse through relational tables.
Using DataSet
to parse XML files is particularly appropriate when your application program is using ADO .NET classes so that you maintain a consistent look and feel. Using a DataSet
object has high overhead and would not be a good choice if performance is an issue. Because each level of an XML file generates a table, if your XML file is deeply nested then using DataSet
would not be a good choice.
Further Discussion
There are several related issues not yet covered: namespaces, generalization, error handling, validation, filtering, and performance. In the context of parsing XML data files, XML namespaces are a mechanism to prevent name clashes. Each of the techniques we've used can deal with namespaces. The MSDN Library will give you all the information you need to handle XML files with namespaces.
The techniques we have seen were not written to be particularly general. If you have a different XML structure, you will have to write different code. There is always a trade-off between writing code for a specific situation and making the code more generalized.
The code in this article does not have any error handling. Parsing XML files is quite error prone and in a production scenario, you would need to add lots of try-catch blocks to create a robust parser.
Additionally, I didn't address XML validation with schema files, but once again, in a production environment you would need to generate XML schema files and validate your XML data files against them before attempting to parse. It is possible to add validation to your parsing code, but I recommend validating before parsing.
In every example, we have read all the XML data into memory. In many cases, you will want to filter and just read in some data. All the techniques in this article can be modified to provide front-end filtering. The XPathDocument
class has especially nice filtering capabilities by way of XPath syntax.
If performance is an issue usually in the case where you are parsing many small XML files you will have to run some timing measurements to determine if your chosen technique is fast enough. Performance is too tricky to make many general statements and the only way to know if your performance is acceptable is to try your code. As a guideline, however, XmlTextReader
has the best performance characteristics.
A Key Skill
XML data files are a key component of Microsoft's .NET developer environment. The ability to parse data from XML files into memory is a key skill in a .NET setting. Each of the five techniques, based on the XmlTextReader
, XmlDocument
, XPathDocument
, XmlSerializer
, and DataSet
classes, is significantly different in terms of coding mechanics, coding mind set, and scenarios for usage. The .NET Framework gives you great flexibility in parsing XML data files and makes this essential task much easier and less error prone than using non-.NET techniques.
References
XML in .NET Overview, http://msdn.microsoft.com/msdnmag/issues/01/01/xml/xml.asp
Consume XML C# app, http://msdn.microsoft.com/library/en-us/vcedit/html/
vcwlkVisualCApplicationsConsumingXMLData.asp
XML Schema, http://msdn.microsoft.com/msdnmag/issues/02/04/xml/xml0204.asp
XML Namespaces, http://msdn.microsoft.com/msdnmag/issues/01/07/xml/default.aspx
Dr. James McCaffrey works for Volt Information Sciences Inc. where he manages technical training for software engineers working at Microsoft's Redmond, WA campus. He has worked on several Microsoft products, including Internet Explorer and MSN Search.