Parsing Word Document in C#

May 2007
S	M	T	W	T	F	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Posted by archworx on May 10, 2007

1) Add a Microsoft Word Object Library (Interop.Word.dll) Reference to your project

2) Save the word document as Xml programmatically (Shown Below)

Word.Application WordApp = new Word.ApplicationClass();

object NullObject = System.Reflection.Missing.Value;

object FalseValue = false;object TrueValue = true;

//Document in Word format

object Format = (object)Word.WdSaveFormat.wdFormatXML;

Word.Document Document;Document = WordApp.Documents.Open(ref DocumentPath, ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject,ref TrueValue, ref NullObject, ref NullObject,ref NullObject, ref NullObject, ref FalseValue,ref NullObject, ref NullObject, ref NullObject, ref NullObject); //Accept all the revisions done on the document first.

Document.AcceptAllRevisions();

//Save document in WordMl Format.

Document.SaveAs(ref XmlPath, ref Format, ref NullObject, ref NullObject, ref NullObject,ref NullObject, ref FalseValue, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject, ref NullObject);

//Close Word DocumentDocument.Close(ref TrueValue , ref NullObject, ref NullObject);}

3) Parse the saved XML file using XSLT style sheets

Note that the When the Word Document is saved as Xml the output is in a WordML format.

This is a simple example of a style sheet that gets all the sentences in the document that are preceded by a certain tag, here [#]. This was used for parsing a Requirements Document.

<?xml version=“1.0“ encoding=“UTF-8“ standalone=“yes“ ?><xsl:stylesheet version=“1.0“ xmlns:xsl=“http://www.w3.org/1999/XSL/Transform“ xmlns:wx=“http://schemas.microsoft.com/office/word/2003/wordml“ xmlns:w=“http://schemas.microsoft.com/office/word/2003/wordml“>

<xsl:template match=“/“><doc><xsl:for-each select=“//w:p/w:r“>

<xsl:if test=“w:t!=”“>

<xsl:if test=“contains(w:t, ‘[#]’)“>

<!– check if the text after the ~ is a requirement by checking that the text is preceeded by a number–>

<xsl:choose>

<xsl:when test=“starts-with(translate(w:t/parent::node()/parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)“> <requirment>

<xsl:choose>

<xsl:when test=“not(string-length(translate(w:t,’ ‘,”))>3)“>

<xsl:value-of select=“following-sibling::*[1]“/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<xsl:value-of select=“w:t“/>

</ reqValue>

</xsl:otherwise>

</xsl:choose>

<xsl:value-of select=“parent::node()/w:pPr/w:listPr/*/@*“>

</xsl:value-of>

</reqNum>

<!–</xsl:if>–>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–><!–whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:when>

<xsl:otherwise>

<xsl:if test=“starts-with(translate(parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)“>

<xsl:choose>

< xsl:when test=“not(string-length(translate(w:t,’ ‘,”))>3)“>

<xsl:value-of select=“following-sibling::*[1]“/>

</reqValue>

</xsl:when>

<xsl:otherwise>

<xsl:value-of select=“w:t“/>

</reqValue>

</xsl:otherwise>

</xsl:choose>

<!–<xsl:if test=”starts-with(translate(parent::node()/w:pPr/w:listPr/*/@*, ‘0123456789’, ‘9999999999’), ‘9’)”>–>

<xsl:value-of select=“parent::node()/parent::node()/parent::node()/preceding-sibling::*[1]/w:tc/w:p/w:pPr/w:listPr/*/@*“>

</xsl:value-of>

</reqNum>

<!–in case of a section number, the substring-before function returns an actual number, so it can be converted to a number–>

<!– whereas in case of text, the whole string returns and it cant be converted to a number, result is NaN–>

</requirment>

</xsl:if>

</xsl:otherwise>

</xsl:choose>

</xsl:if>

</ xsl:for-each></doc></xsl:template></xsl:stylesheet>

4) Link the XSLT style sheet to the XML file saved programmatically

//The XmlPath is the path of the word document in the XMl formatXPathDocument

XpathDoc = new XPathDocument(XmlPath.ToString());

XslCompiledTransform TransformXml = new XslCompiledTransform();

try

{

DataSet Ds = new DataSet();

//load the style sheet (virtual path)

TransformXml.Load(StyleSheetPath);

//The ResultXmlPath is the path of the resultant XMl file after the trasformationXmlTextWriter Writer = new XmlTextWriter(ResultXmlPath, null);

//transform the xml file using xsl sheets:

TransformXml.Transform(XpathDoc, Writer);Writer.Close();

//Read the Resultant Xml file (after style sheet) into a dataset Ds.ReadXml(ResultXmlPath);

log.Debug(“Dataset Count: “ + Ds.Tables[0].Rows.Count);

}

catch (Exception ex)

{

log.Error(“DocumentToXml : “ + ex.Message);

}

This entry was posted on May 10, 2007 at 5:29 pm and is filed under .NET 2.0, Radwa Nada. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to “Parsing Word Document in C#”

Essam said

June 7, 2007 at 7:56 am
this is a great work actually, but I wonder why you did it all this hybrid way.. can’t u just use Office APIs or implement that as a Word Add-in or use VSTO ?

Word 2003 has a very good support for XML, and Documents can now be based on an XML Schema so they would look pretty and programmed easy.

Reply
Radwa said

June 11, 2007 at 2:01 pm
Well I did a lot of research on using VSTO for extracting certain statements from a word document but didnt reach a useful result. So i decided to do it my way, which is pretty hyrbrid as u named it… But for me it was the easisiet way actually at that time. If u have any useful material covering different ways to do the job with better performance and less effort please let us know:)

Reply
jdhf said

December 19, 2011 at 9:38 am
ds

Reply
- asim said
  
  April 17, 2012 at 11:31 am
  Radwa i m in facing a big problem.Can u help me?
  
  Reply
storm2k.org said

August 13, 2014 at 9:49 am
storm2k.org

Parsing Word Document in C# « TechnicalArchitectureWorx

Reply
3 said

June 11, 2017 at 9:37 pm
Great site you have here.. It’s difficult to find excellent
writing like yours these days. I seriously appreciate people like you!
Take care!!

Reply

TechnicalArchitectureWorx

The (Unofficial) ITWorx Technical Architecture Blog

Pages

Archives

Recent Posts

Categories