GitHub - skyformat99/tispark: TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV

What is TiSpark?

TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster, at the same time, seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) to serve as a one-stop solution for online transactions and analysis.

TiSpark Architecture

TiSpark integrates with Spark Catalyst Engine deeply. It provides precise control of the computing, which allows Spark read data from TiKV efficiently. It also supports index seek, which improves the performance of the point query execution significantly.
It utilizes several strategies to push down the computing to reduce the size of dataset handling by Spark SQL, which accelerates the query execution. It also uses the TiDB built-in statistical information for the query plan optimization.
From the data integration point of view, TiSpark + TiDB provides a solution runs both transaction and analysis directly on the same platform without building and maintaining any ETLs. It simplifies the system architecture and reduces the cost of maintenance.
In addition, you can deploy and utilize tools from the Spark ecosystem for further data processing and manipulation on TiDB. For example, using TiSpark for data analysis and ETL; retrieving data from TiKV as a machine learning data source; generating reports from the scheduling system and so on.

TiSpark depends on the existence of TiKV clusters and PDs. It also needs to setup and use Spark clustering platform.

A thin layer of TiSpark. Most of the logic is inside tikv-java-client library. https://github.com/pingcap/tikv-client-lib-java

Uses as below

./bin/spark-shell --jars /wherever-it-is/tispark-0.1.0-SNAPSHOT-jar-with-dependencies.jar


import org.apache.spark.sql.TiContext

// suppose your TiDB's Placement Driver is 127.0.0.1:2379
val ti = new TiContext(spark, List("127.0.0.1:" + 2379)) 

// Mapping all TiDB tables from database tpch as Spark SQL tables
ti.tidbMapDatabase("tpch")

spark.sql("select count(*) from lineitem").show

Quick start

Read the Quick Start.

License

TiSpark is under the Apache 2.0 license. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bin		bin
docs		docs
src/main/scala		src/main/scala
tikv-client-lib-java @ 3df8930		tikv-client-lib-java @ 3df8930
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is TiSpark?

TiSpark Architecture

Quick start

Follow us

Twitter

Mailing list

License

About

Releases

Packages

Languages

skyformat99/tispark

Folders and files

Latest commit

History

Repository files navigation

What is TiSpark?

TiSpark Architecture

Quick start

Follow us

Twitter

Mailing list

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages