TechnicalArchitectureWorx

The (Unofficial) ITWorx Technical Architecture Blog

Archive for the ‘Radwa Nada’ Category

Parsing Word Document in C#

Posted by archworx on May 10, 2007


1) Add a Microsoft Word Object Library (Interop.Word.dll) Reference to your project

2) Save the word document as Xml programmatically (Shown Below)

Word.Application WordApp = new Word.ApplicationClass();

object NullObject = System.Reflection.Missing.Value;

object FalseValue = false;object TrueValue = true;

//Document in Word format

object Format = (object)Word.WdSaveFormat.wdFormatXML;

Word.Document Document;Document = WordApp.Documents.Open(ref DocumentPath, ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject,ref TrueValue, ref NullObject, ref NullObject,ref NullObject, ref NullObject, ref FalseValue,ref NullObject, ref NullObject, ref NullObject, ref NullObject);  //Accept all the revisions done on the document first.

Document.AcceptAllRevisions();

//Save document in WordMl Format.

Document.SaveAs(ref XmlPath, ref Format, ref NullObject, ref NullObject, ref NullObject,ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject);

//Close Word DocumentDocument.Close(ref TrueValue , ref NullObject, ref NullObject);}

3) Parse the saved XML file using XSLT style sheets

Note that the When the Word Document is saved as Xml the output is in a WordML format.

This is a simple example of a style sheet that gets all the sentences in the document that are preceded by a certain tag, here [#]. This was used for parsing a Requirements Document.

<?xml version=1.0 encoding=UTF-8 standalone=yes ?><xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:wx=http://schemas.microsoft.com/office/word/2003/wordml xmlns:w=http://schemas.microsoft.com/office/word/2003/wordml>

<xsl:template match=/><doc><xsl:for-each select=//w:p/w:r>

<xsl:if test=w:t!=”>

<xsl:if test=contains(w:t, ‘[#]’)>

<!– check if the text after the ~ is a requirement by checking that the text is preceeded by a number–>

<xsl:choose>

<xsl:when test=starts-with(translate(w:t/parent::node()/parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)> <requirment>

<xsl:choose>

<xsl:when test=not(string-length(translate(w:t,’ ‘,”))>3)>

<reqValue>

<xsl:value-of select=following-sibling::*[1]/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<reqValue>

<xsl:value-of select=w:t/>

</ reqValue>

</xsl:otherwise>

</xsl:choose>

<reqNum>

<xsl:value-of select=parent::node()/w:pPr/w:listPr/*/@*>

</xsl:value-of>

</reqNum>

<!–</xsl:if>–>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–><!–whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:when>

<xsl:otherwise>

<xsl:if test=starts-with(translate(parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)>

<requirment>

<xsl:choose>

< xsl:when test=not(string-length(translate(w:t,’ ‘,”))>3)>

<reqValue>

<xsl:value-of select=following-sibling::*[1]/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<reqValue>

<xsl:value-of select=w:t/>

</reqValue>

</xsl:otherwise>

</xsl:choose>

<!–<xsl:if test=”starts-with(translate(parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)”>–>

<reqNum>

<xsl:value-of select=parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*>

</xsl:value-of>

</reqNum>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–>

<!– whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:if>

</xsl:otherwise>

</xsl:choose>

</xsl:if>

</xsl:if>

</ xsl:for-each></doc></xsl:template></xsl:stylesheet>

4) Link the XSLT style sheet to the XML file saved programmatically

  //The XmlPath is the path of the word document in the XMl formatXPathDocument

XpathDoc = new XPathDocument(XmlPath.ToString());

XslCompiledTransform TransformXml = new XslCompiledTransform();

try

{

DataSet Ds = new DataSet();

//load the style sheet (virtual path)

TransformXml.Load(StyleSheetPath);               

 //The ResultXmlPath is the path of the resultant XMl file after the trasformationXmlTextWriter Writer = new XmlTextWriter(ResultXmlPath, null);

//transform the xml file using xsl sheets:

TransformXml.Transform(XpathDoc, Writer);Writer.Close();

//Read the Resultant Xml file (after style sheet) into a dataset Ds.ReadXml(ResultXmlPath);

log.Debug(“Dataset Count: “ + Ds.Tables[0].Rows.Count);

}

 catch (Exception ex)

{

log.Error(“DocumentToXml : “ + ex.Message);

}

Advertisements

Posted in .NET 2.0, Radwa Nada | 6 Comments »