Skip to content
/ tika Public

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

License

Notifications You must be signed in to change notification settings

apache/tika

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10,066 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Welcome to Apache Tika https://tika.apache.org/

license Jenkins Jenkins tests Maven Central

Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika is a project of the Apache Software Foundation.

Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation.

Quick Start

Parse a file in Java:

import org.apache.tika.Tika;

Tika tika = new Tika();
String text = tika.parseToString(new File("document.pdf"));
System.out.println(text);

From the command line:

java -jar tika-app-*.jar --text document.pdf

Maven dependency:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>4.x.y</version>
</dependency>

Getting Started

Pre-built binaries of Apache Tika standalone applications are available from https://tika.apache.org/download.html . Pre-built binaries of all the Tika jars can be fetched from Maven Central or your favourite Maven mirror.

Tika 2.X and support for Java 8 reached End of Life (EOL) in April, 2025. See Tika Roadmap 2.x, 3.x and beyond.

Tika is based on Java 17 and uses the Maven 3 build system. N.B. Docker is used for tests in tika-integration-tests. If Docker is not installed, those tests are skipped.

To build Tika from source, use the following command in the main directory:

./mvnw clean install

The Maven wrapper (mvnw) is included in the repository and will automatically download the correct Maven version if needed. On Windows, use mvnw.cmd instead.

The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:

java -jar tika-app/target/tika-app-*.jar --help

To build a specific project (for example, tika-server-standard):

./mvnw clean install -am -pl :tika-server-standard

If the ossindex-maven-plugin is causing the build to fail because a dependency has now been discovered to have a vulnerability:

./mvnw clean install -Dossindex.skip

Faster Builds

Fast profile - Use -Pfast to skip tests, checkstyle, and spotless:

./mvnw clean install -Pfast

Parallel builds - Add -T1C to build with 1 thread per CPU core:

./mvnw clean install -Pfast -T1C

Maven Daemon (mvnd) - Keeps a warm JVM running for 2-3x faster rebuilds:

# Install: https://github.com/apache/maven-mvnd
# macOS: brew install mvndaemon/tap/mvnd

# Use exactly like mvn
mvnd clean install -Pfast
mvnd test -pl :tika-core

Combine both for maximum speed during development:

mvnd clean install -Pfast -T1C

Reproducible Builds

Apache Tika supports reproducible builds. This means that building the same source code with the same JDK version should produce byte-for-byte identical artifacts, regardless of the build machine or time.

Key configuration:

  • project.build.outputTimestamp is set in tika-parent/pom.xml
  • All Maven plugins are configured to produce deterministic output

To verify the build plan supports reproducibility:

./mvnw artifact:check-buildplan

To verify two builds produce identical artifacts:

./mvnw clean install -DskipTests
mv ~/.m2/repository/org/apache/tika tika-build-1
./mvnw clean install -DskipTests
diff -r tika-build-1 ~/.m2/repository/org/apache/tika

Maven Dependencies

Apache Tika provides Bill of Material (BOM) artifact to align Tika module versions and simplify version management. To avoid convergence errors in your own project, import this bom or Tika's parent pom.xml in your dependency management section.

If you use Apache Maven:

<project>
  <dependencyManagement>
    <dependencies>
      <dependency>
       <groupId>org.apache.tika</groupId>
       <artifactId>tika-bom</artifactId>
       <version>4.x.y</version>
       <type>pom</type>
       <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>

  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers-standard-package</artifactId>
      <!-- version not required since BOM included -->
    </dependency>
  </dependencies>
</project>

For Gradle:

dependencies {
  implementation(platform("org.apache.tika:tika-bom:4.x.y"))

  // version not required since bom (platform in Gradle terms)
  implementation("org.apache.tika:tika-parsers-standard-package")
}

Migrating to 4.x

TBD

Contributing

See CONTRIBUTING.md and https://tika.apache.org/contribute.html

contributors

Building from a Specific Tag

Let's assume that you want to build the 3.0.1 tag:

0. Download and install hub.github.com
1. git clone https://github.com/apache/tika.git
2. cd tika
3. git checkout 3.0.1
4. ./mvnw clean install

If a new vulnerability has been discovered between the date of the tag and the date you are building the tag, you may need to build with:

4. ./mvnw clean install -Dossindex.skip

If a local test is not working in your environment, please notify the project at dev@tika.apache.org. As an immediate workaround, you can turn off individual tests with e.g.:

4. ./mvnw clean install -Dossindex.skip -Dtest=\!UnpackerResourceTest#testPDFImages

License (see also LICENSE.txt)

Collective work: Copyright 2011 The Apache Software Foundation.

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See http://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See http://www.bouncycastle.org/ for more details on Bouncy Castle.

Mailing Lists

Subscribe by sending a message to {list}-subscribe@tika.apache.org.

Issue Tracker

https://issues.apache.org/jira/browse/TIKA

Security

See SECURITY.md and https://tika.apache.org/security.html

About

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 151