Class ForkParser

  • All Implemented Interfaces:
    java.io.Closeable, java.io.Serializable, java.lang.AutoCloseable, Parser

    public class ForkParser
    extends AbstractParser
    implements java.io.Closeable
    See Also:
    Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      ForkParser()  
      ForkParser​(java.lang.ClassLoader loader)  
      ForkParser​(java.lang.ClassLoader loader, Parser parser)  
      ForkParser​(java.nio.file.Path tikaBin, ParserFactoryFactory factoryFactory)
      If you have a directory with, say, tike-app.jar and you want the child process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.
      ForkParser​(java.nio.file.Path tikaBin, ParserFactoryFactory parserFactoryFactory, java.lang.ClassLoader classLoader)
      EXPERT
    • Method Summary

      All Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      void close()  
      java.lang.String getJavaCommand()
      Deprecated.
      since 1.8
      java.util.List<java.lang.String> getJavaCommandAsList()
      Returns the command used to start the forked server process.
      int getPoolSize()
      Returns the size of the process pool.
      java.util.Set<MediaType> getSupportedTypes​(ParseContext context)
      Returns the set of media types supported by this parser when used with the given parse context.
      void parse​(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata, ParseContext context)
      This sends the objects to the server for parsing, and the server via the proxies acts on the handler as if it were updating it directly.
      void setJavaCommand​(java.lang.String java)
      Deprecated.
      since 1.8
      void setJavaCommand​(java.util.List<java.lang.String> java)
      Sets the command used to start the forked server process.
      void setMaxFilesProcessedPerServer​(int maxFilesProcessedPerClient)
      If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted.
      void setPoolSize​(int poolSize)
      Sets the size of the process pool.
      void setServerParseTimeoutMillis​(long serverParseTimeoutMillis)
      The maximum amount of time allowed for the server to try to parse a file.
      void setServerPulseMillis​(long serverPulseMillis)
      The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.
      void setServerWaitTimeoutMillis​(long serverWaitTimeoutMillis)
      The maximum amount of time allowed for the server to wait for a new request to parse a file.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • ForkParser

        public ForkParser​(java.nio.file.Path tikaBin,
                          ParserFactoryFactory factoryFactory)
        If you have a directory with, say, tike-app.jar and you want the child process/server to build a parser and run it from that -- so that you can keep all of those dependencies out of your client code, use this initializer.
        Parameters:
        tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
        factoryFactory -
      • ForkParser

        public ForkParser​(java.nio.file.Path tikaBin,
                          ParserFactoryFactory parserFactoryFactory,
                          java.lang.ClassLoader classLoader)
        EXPERT
        Parameters:
        tikaBin - directory containing the tika-app.jar or similar -- full jar including tika-core and all desired parsers and dependencies
        parserFactoryFactory - -- the factory to use to generate the parser factory in the child process/server
        classLoader - to use for all classes besides the parser in the child process/server
      • ForkParser

        public ForkParser​(java.lang.ClassLoader loader,
                          Parser parser)
        Parameters:
        loader - The ClassLoader to use
        parser - the parser to delegate to. This one cannot be another ForkParser
      • ForkParser

        public ForkParser​(java.lang.ClassLoader loader)
      • ForkParser

        public ForkParser()
    • Method Detail

      • getPoolSize

        public int getPoolSize()
        Returns the size of the process pool.
        Returns:
        process pool size
      • setPoolSize

        public void setPoolSize​(int poolSize)
        Sets the size of the process pool.
        Parameters:
        poolSize - process pool size
      • getJavaCommand

        @Deprecated
        public java.lang.String getJavaCommand()
        Deprecated.
        since 1.8
        Returns the command used to start the forked server process.
        Returns:
        java command line
        See Also:
        getJavaCommandAsList()
      • getJavaCommandAsList

        public java.util.List<java.lang.String> getJavaCommandAsList()
        Returns the command used to start the forked server process.

        Returned list is unmodifiable.

        Returns:
        java command line args
      • setJavaCommand

        public void setJavaCommand​(java.util.List<java.lang.String> java)
        Sets the command used to start the forked server process. The arguments "-jar" and "/path/to/bootstrap.jar" or "-cp" and "/path/to/tika_bin" are appended to the given command when starting the process. The default setting is {"java", "-Xmx32m"}.

        Creates a defensive copy.

        Parameters:
        java - java command line
      • setJavaCommand

        @Deprecated
        public void setJavaCommand​(java.lang.String java)
        Deprecated.
        since 1.8
        Sets the command used to start the forked server process. The given command line is split on whitespace and the arguments "-jar" and "/path/to/bootstrap.jar" are appended to it when starting the process. The default setting is "java -Xmx32m".
        Parameters:
        java - java command line
        See Also:
        setJavaCommand(List)
      • getSupportedTypes

        public java.util.Set<MediaType> getSupportedTypes​(ParseContext context)
        Description copied from interface: Parser
        Returns the set of media types supported by this parser when used with the given parse context.
        Specified by:
        getSupportedTypes in interface Parser
        Parameters:
        context - parse context
        Returns:
        immutable set of media types
      • close

        public void close()
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
      • setServerPulseMillis

        public void setServerPulseMillis​(long serverPulseMillis)
        The amount of time in milliseconds that the server should wait before checking to see if the parse has timed out or if the wait has timed out The default is 5 seconds.
        Parameters:
        serverPulseMillis - milliseconds to sleep before checking if there has been any activity
      • setServerParseTimeoutMillis

        public void setServerParseTimeoutMillis​(long serverParseTimeoutMillis)
        The maximum amount of time allowed for the server to try to parse a file. If more than this time elapses, the server shuts down, and the ForkParser throws an exception.
        Parameters:
        serverParseTimeoutMillis -
      • setServerWaitTimeoutMillis

        public void setServerWaitTimeoutMillis​(long serverWaitTimeoutMillis)
        The maximum amount of time allowed for the server to wait for a new request to parse a file. The server will shutdown after this amount of time, and a new server will have to be started by a new client.
        Parameters:
        serverWaitTimeoutMillis -
      • setMaxFilesProcessedPerServer

        public void setMaxFilesProcessedPerServer​(int maxFilesProcessedPerClient)
        If there is a slowly building memory leak in one of the parsers, it is useful to set a limit on the number of files processed by a server before it is shutdown and restarted. Default value is -1.
        Parameters:
        maxFilesProcessedPerClient - maximum number of files that a server can handle before the parser shuts down a client and creates a new process. If set to -1, the server is never restarted because of the number of files handled.