Class TesseractOCRParser

  • All Implemented Interfaces:
    java.io.Serializable, Initializable, Parser

    public class TesseractOCRParser
    extends AbstractParser
    implements Initializable
    TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, create a TesseractOCRConfig object and pass it through a ParseContext. Tesseract-ocr must be installed and on system path or the path to its root folder must be provided:

    TesseractOCRConfig config = new TesseractOCRConfig();
    //Needed if tesseract is not on system path
    config.setTesseractPath(tesseractFolder);
    parseContext.set(TesseractOCRConfig.class, config);

    See Also:
    Serialized Form
    • Constructor Detail

      • TesseractOCRParser

        public TesseractOCRParser()
    • Method Detail

      • getSupportedTypes

        public java.util.Set<MediaType> getSupportedTypes​(ParseContext context)
        Description copied from interface: Parser
        Returns the set of media types supported by this parser when used with the given parse context.
        Specified by:
        getSupportedTypes in interface Parser
        Parameters:
        context - parse context
        Returns:
        immutable set of media types
      • parse

        public void parse​(java.awt.Image image,
                          org.xml.sax.ContentHandler handler,
                          Metadata metadata,
                          ParseContext context)
                   throws java.io.IOException,
                          org.xml.sax.SAXException,
                          TikaException
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
        TikaException
      • parse

        public void parse​(java.io.InputStream stream,
                          org.xml.sax.ContentHandler handler,
                          Metadata metadata,
                          ParseContext parseContext)
                   throws java.io.IOException,
                          org.xml.sax.SAXException,
                          TikaException
        Description copied from interface: Parser
        Parses a document stream into a sequence of XHTML SAX events. Fills in related document metadata in the given metadata object.

        The given document stream is consumed but not closed by this method. The responsibility to close the stream remains on the caller.

        Information about the parsing context can be passed in the context parameter. See the parser implementations for the kinds of context information they expect.

        Specified by:
        parse in interface Parser
        Parameters:
        stream - the document stream (input)
        handler - handler for the XHTML SAX events (output)
        metadata - document metadata (input and output)
        parseContext - parse context
        Throws:
        java.io.IOException - if the document stream could not be read
        org.xml.sax.SAXException - if the SAX events could not be processed
        TikaException - if the document could not be parsed
      • parseInline

        public void parseInline​(java.io.InputStream stream,
                                XHTMLContentHandler xhtml,
                                ParseContext parseContext,
                                TesseractOCRConfig config)
                         throws java.io.IOException,
                                org.xml.sax.SAXException,
                                TikaException
        Use this to parse content without starting a new document. This appends SAX events to xhtml without re-adding the metadata, body start, etc.
        Parameters:
        stream - inputstream
        xhtml - handler
        config - TesseractOCRConfig to use for this parse
        Throws:
        java.io.IOException
        org.xml.sax.SAXException
        TikaException
      • setTesseractPath

        @Field
        public void setTesseractPath​(java.lang.String tesseractPath)
      • setTessdataPath

        @Field
        public void setTessdataPath​(java.lang.String tessdataPath)
      • setLanguage

        @Field
        public void setLanguage​(java.lang.String language)
      • setPageSegMode

        @Field
        public void setPageSegMode​(java.lang.String pageSegMode)
      • setMaxFileSizeToOcr

        @Field
        public void setMaxFileSizeToOcr​(long maxFileSizeToOcr)
      • setMinFileSizeToOcr

        @Field
        public void setMinFileSizeToOcr​(long minFileSizeToOcr)
      • setTimeout

        @Field
        public void setTimeout​(int timeout)
      • setOutputType

        @Field
        public void setOutputType​(java.lang.String outputType)
      • setPreserveInterwordSpacing

        @Field
        public void setPreserveInterwordSpacing​(boolean preserveInterwordSpacing)
      • setEnableImageProcessing

        @Field
        public void setEnableImageProcessing​(int enableImageProcessing)
      • setImageMagickPath

        @Field
        public void setImageMagickPath​(java.lang.String imageMagickPath)
      • setDensity

        @Field
        public void setDensity​(int density)
      • setDepth

        @Field
        public void setDepth​(int depth)
      • setColorspace

        @Field
        public void setColorspace​(java.lang.String colorspace)
      • setFilter

        @Field
        public void setFilter​(java.lang.String filter)
      • setResize

        @Field
        public void setResize​(int resize)
      • setApplyRotation

        @Field
        public void setApplyRotation​(boolean applyRotation)