Class WordExtractor

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public final class WordExtractor
    extends POIOLE2TextExtractor
    Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
    • Constructor Detail

      • WordExtractor

        public WordExtractor​(java.io.InputStream is)
                      throws java.io.IOException
        Create a new Word Extractor
        Parameters:
        is - InputStream containing the word file
        Throws:
        java.io.IOException
      • WordExtractor

        public WordExtractor​(POIFSFileSystem fs)
                      throws java.io.IOException
        Create a new Word Extractor
        Parameters:
        fs - POIFSFileSystem containing the word file
        Throws:
        java.io.IOException
      • WordExtractor

        public WordExtractor​(DirectoryNode dir)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • WordExtractor

        public WordExtractor​(HWPFDocument doc)
        Create a new Word Extractor
        Parameters:
        doc - The HWPFDocument to extract from
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.IOException
        Command line extractor, so people will stop moaning that they can't just run this.
        Throws:
        java.io.IOException
      • getParagraphText

        public java.lang.String[] getParagraphText()
        Get the text from the word file, as an array with one String per paragraph
      • getFootnoteText

        public java.lang.String[] getFootnoteText()
      • getMainTextboxText

        public java.lang.String[] getMainTextboxText()
      • getEndnoteText

        public java.lang.String[] getEndnoteText()
      • getCommentsText

        public java.lang.String[] getCommentsText()
      • getHeaderText

        @Deprecated
        public java.lang.String getHeaderText()
        Deprecated.
        3.8 beta 4
        Grab the text from the headers
      • getFooterText

        @Deprecated
        public java.lang.String getFooterText()
        Deprecated.
        3.8 beta 4
        Grab the text from the footers
      • getTextFromPieces

        public java.lang.String getTextFromPieces()
        Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
      • getText

        public java.lang.String getText()
        Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().
        Specified by:
        getText in class POITextExtractor
        Returns:
        All the text from the document
      • stripFields

        public static java.lang.String stripFields​(java.lang.String text)
        Removes any fields (eg macros, page markers etc) from the string.