Filter for WordML vs Filter for XML vs Filter for Text

Discussion in 'microsoft.public.sqlserver.fulltext' started by Brian, Apr 3, 2007.

  1. Brian

    Brian Guest

    We currently are saving WordML (Word 2003's XML format rather than binary
    format) docs into a column of our SQL Server 2005 database and full text
    indexing on it. Works okay, but not great. For example, all the XML tags
    are indexed... so you find many words that are not in the document (from the
    user persepctive).

    The fix for that is clear... use an XML filter. (Though how to do that is
    not so clear... we tried moving the WordML from a Text column to an XML
    column, but it gives errors on illegal characters in the Word doc.)

    However, it would seem that even an XML filter will not do nearly as well as
    a filter designed for WordML.

    Can anyone point me to a WordML full-text-search filter for SQL Server
    2005??

    Thanks,
    Brian
     
    Brian, Apr 3, 2007
    #1
    1. Advertisements

  2. Have you tried storing them as doc's in varbinary (max) or image columns and
    indexing them with the Word iFilter.
     
    Hilary Cotter, Apr 3, 2007
    #2
    1. Advertisements

  3. Brian

    Brian Guest


    No, for our app we need to keep them as .xml files, not .doc files.
    But we could put the Word .xml file in a varbinary column if there is a
    WordML iFilter out there that would act on the .xml file.

    Thanks,

    Brian
     
    Brian, Apr 3, 2007
    #3
  4. Brian

    denistc Guest

    Hi Brian,
    have you tried storying your WordML files in an a column of type XML? That
    should invoke the XML filter.
    Best regards,
    -Denis.
     
    denistc, Apr 14, 2007
    #4
  5. Brian

    Brian Guest


    Yes, though we've had trouble with some WordML files being rejected.
    We've narrowed the issue on that front... any WordML file that has "UTF-8"
    in the docheader gets rejected... not sure why though. Also not sure why
    Word is creating some files with that encoding. Simply removing the tag
    from the docheader (without any other re-encoding) seems to work fine, oddly
    enough.

    We're a bit nervous to move to the XML column until we understand what
    conditions might cause XML to reject a WordML doc coming out of Word...
    because without that understanding, we can't be sure it won't happen to our
    customers.

    Thanks for the suggestion,

    Brian
     
    Brian, Apr 16, 2007
    #5
  6. Brian

    Mike C# Guest

    Haven't done much with WordML myself, but assuming the WordML file has
    "UTF-8" in the docheader, I'd be interested to know if it actually is a
    UTF-8 file? Or is it possible it is being saved with an incorrect BOM or
    invalid (non-UTF-8-encoded) characters in the doc? One of those two would
    be my first guess. If so, that might be a bug that needs to be reported to
    MS. You might try opening that WordML file in a hex editor to verify what
    is actually being stored.
     
    Mike C#, May 7, 2007
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.