Class TextStatistics


  • public class TextStatistics
    extends java.lang.Object
    Utility class for computing a histogram of the bytes seen in a stream.
    Since:
    Apache Tika 1.2
    • Constructor Summary

      Constructors 
      Constructor Description
      TextStatistics()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addData​(byte[] buffer, int offset, int length)  
      int count()
      Returns the total number of bytes seen so far.
      int count​(int b)
      Returns the number of occurrences of the given byte.
      int countControl()
      Counts control characters (i.e.
      int countEightBit()
      Counts eight bit characters, i.e.
      int countSafeAscii()
      Counts "safe" (i.e.
      boolean isMostlyAscii()
      Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e.
      boolean looksLikeUTF8()
      Checks whether the observed byte stream looks like UTF-8 encoded text.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextStatistics

        public TextStatistics()
    • Method Detail

      • addData

        public void addData​(byte[] buffer,
                            int offset,
                            int length)
      • isMostlyAscii

        public boolean isMostlyAscii()
        Checks whether at least one byte was seen and that the bytes that were seen were mostly plain text (i.e. < 2% control, > 90% ASCII range).
        Returns:
        true if the seen bytes were mostly safe ASCII, false otherwise
        See Also:
        TIKA-483, TIKA-688
      • looksLikeUTF8

        public boolean looksLikeUTF8()
        Checks whether the observed byte stream looks like UTF-8 encoded text.
        Returns:
        true if the seen bytes look like UTF-8, false otherwise
        Since:
        Apache Tika 1.3
      • count

        public int count()
        Returns the total number of bytes seen so far.
        Returns:
        count of all bytes
      • count

        public int count​(int b)
        Returns the number of occurrences of the given byte.
        Parameters:
        b - byte
        Returns:
        count of the given byte
      • countControl

        public int countControl()
        Counts control characters (i.e. < 0x20, excluding tab, CR, LF, page feed and escape).

        This definition of control characters is based on section 4 of the "Content-Type Processing Model" Internet-draft (draft-abarth-mime-sniff-01).

         +-------------------------+
         | Binary data byte ranges |
         +-------------------------+
         | 0x00 -- 0x08            |
         | 0x0B                    |
         | 0x0E -- 0x1A            |
         | 0x1C -- 0x1F            |
         +-------------------------+
         
        Returns:
        count of control characters
        See Also:
        TIKA-154
      • countSafeAscii

        public int countSafeAscii()
        Counts "safe" (i.e. seven-bit non-control) ASCII characters.
        Returns:
        count of safe ASCII characters
        See Also:
        countControl()
      • countEightBit

        public int countEightBit()
        Counts eight bit characters, i.e. bytes with their highest bit set.
        Returns:
        count of eight bit characters