TechnicalArchitectureWorx

The (Unofficial) ITWorx Technical Architecture Blog

Parsing Word Document in C#

Posted by archworx on May 10, 2007


1) Add a Microsoft Word Object Library (Interop.Word.dll) Reference to your project

2) Save the word document as Xml programmatically (Shown Below)

Word.Application WordApp = new Word.ApplicationClass();

object NullObject = System.Reflection.Missing.Value;

object FalseValue = false;object TrueValue = true;

//Document in Word format

object Format = (object)Word.WdSaveFormat.wdFormatXML;

Word.Document Document;Document = WordApp.Documents.Open(ref DocumentPath, ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject,ref TrueValue, ref NullObject, ref NullObject,ref NullObject, ref NullObject, ref FalseValue,ref NullObject, ref NullObject, ref NullObject, ref NullObject);  //Accept all the revisions done on the document first.

Document.AcceptAllRevisions();

//Save document in WordMl Format.

Document.SaveAs(ref XmlPath, ref Format, ref NullObject, ref NullObject, ref NullObject,ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject);

//Close Word DocumentDocument.Close(ref TrueValue , ref NullObject, ref NullObject);}

3) Parse the saved XML file using XSLT style sheets

Note that the When the Word Document is saved as Xml the output is in a WordML format.

This is a simple example of a style sheet that gets all the sentences in the document that are preceded by a certain tag, here [#]. This was used for parsing a Requirements Document.

<?xml version=1.0 encoding=UTF-8 standalone=yes ?><xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:wx=http://schemas.microsoft.com/office/word/2003/wordml xmlns:w=http://schemas.microsoft.com/office/word/2003/wordml>

<xsl:template match=/><doc><xsl:for-each select=//w:p/w:r>

<xsl:if test=w:t!=”>

<xsl:if test=contains(w:t, ‘[#]’)>

<!– check if the text after the ~ is a requirement by checking that the text is preceeded by a number–>

<xsl:choose>

<xsl:when test=starts-with(translate(w:t/parent::node()/parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)> <requirment>

<xsl:choose>

<xsl:when test=not(string-length(translate(w:t,’ ‘,”))>3)>

<reqValue>

<xsl:value-of select=following-sibling::*[1]/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<reqValue>

<xsl:value-of select=w:t/>

</ reqValue>

</xsl:otherwise>

</xsl:choose>

<reqNum>

<xsl:value-of select=parent::node()/w:pPr/w:listPr/*/@*>

</xsl:value-of>

</reqNum>

<!–</xsl:if>–>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–><!–whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:when>

<xsl:otherwise>

<xsl:if test=starts-with(translate(parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)>

<requirment>

<xsl:choose>

< xsl:when test=not(string-length(translate(w:t,’ ‘,”))>3)>

<reqValue>

<xsl:value-of select=following-sibling::*[1]/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<reqValue>

<xsl:value-of select=w:t/>

</reqValue>

</xsl:otherwise>

</xsl:choose>

<!–<xsl:if test=”starts-with(translate(parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)”>–>

<reqNum>

<xsl:value-of select=parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*>

</xsl:value-of>

</reqNum>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–>

<!– whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:if>

</xsl:otherwise>

</xsl:choose>

</xsl:if>

</xsl:if>

</ xsl:for-each></doc></xsl:template></xsl:stylesheet>

4) Link the XSLT style sheet to the XML file saved programmatically

  //The XmlPath is the path of the word document in the XMl formatXPathDocument

XpathDoc = new XPathDocument(XmlPath.ToString());

XslCompiledTransform TransformXml = new XslCompiledTransform();

try

{

DataSet Ds = new DataSet();

//load the style sheet (virtual path)

TransformXml.Load(StyleSheetPath);               

 //The ResultXmlPath is the path of the resultant XMl file after the trasformationXmlTextWriter Writer = new XmlTextWriter(ResultXmlPath, null);

//transform the xml file using xsl sheets:

TransformXml.Transform(XpathDoc, Writer);Writer.Close();

//Read the Resultant Xml file (after style sheet) into a dataset Ds.ReadXml(ResultXmlPath);

log.Debug(“Dataset Count: “ + Ds.Tables[0].Rows.Count);

}

 catch (Exception ex)

{

log.Error(“DocumentToXml : “ + ex.Message);

}

6 Responses to “Parsing Word Document in C#”

  1. Essam said

    this is a great work actually, but I wonder why you did it all this hybrid way.. can’t u just use Office APIs or implement that as a Word Add-in or use VSTO ?

    Word 2003 has a very good support for XML, and Documents can now be based on an XML Schema so they would look pretty and programmed easy.

  2. Radwa said

    Well I did a lot of research on using VSTO for extracting certain statements from a word document but didnt reach a useful result. So i decided to do it my way, which is pretty hyrbrid as u named it… But for me it was the easisiet way actually at that time. If u have any useful material covering different ways to do the job with better performance and less effort please let us know:)

  3. jdhf said

    ds

  4. storm2k.org

    Parsing Word Document in C# « TechnicalArchitectureWorx

  5. 3 said

    Great site you have here.. It’s difficult to find excellent
    writing like yours these days. I seriously appreciate people like you!
    Take care!!

Leave a comment