Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

1/8/2020

Reading time:1 min

Cassandra Data Modeling Best Practices for efficient JOIN operation of Cassandra tables in Spark layer

by John Doe

Apache Cassandra and Apache Spark product integration is one of the emerging trends in big data world today. Together, these two products can offer several advantages. Much has already been said about Cassandra and Spark integration. There are several products in the marker place today offering enterprise grade products.This article is aimed at providing few modeling suggestions when you have a need to join two or more Cassandra tables using Spark. This ability to join Cassandra tables using Spark will give your several data modeling advantages for ETL/ELT process, ability to balance data redundancy and query flexibility, data analysis using Spark data frame API and Spark SQL.Apache Spark is a distributed SQL Engine framework that allows Joining of several data sources such as Hadoop Files, Hive, Cassandra, JDBC/ODBC data sources and others. This list is continuously growing.Cassandra is a popular NoSQL database widely used in OLTP applications. Cassandra database has CQL language interface which looks similar to SQL language, but it is not quite the same.While traditional relational data sources store their data in row format, Cassandra stores its data in row partitions using column families. Cassandra data arrangement inside partition is very similar to pivot spreadsheet like format.The concept of denormalized data model is heavily emphasized until now due to Cassandra’s inability  to join tables. This still the case for pure Cassandra based on OLTP applications. However, new options are opening up for enterprises that are planning to integrate Apache Spark and Cassandra products.Since this a heavy topic I want to release this in multiple sessions. First here is the explanation of how Spark SQL works.https://intelligentinsight.wordpress.com/2016/07/05/optimizing-spark-sql-join-statements-for-high-performance/Here is the link for Cassandra modeling best practices for Spark SQL joins.https://intelligentinsight.wordpress.com/2016/07/09/cassandra-data-modeling-principles-for-spark-sql-joins/

Illustration Image

Apache Cassandra and Apache Spark product integration is one of the emerging trends in big data world today. Together, these two products can offer several advantages. Much has already been said about Cassandra and Spark integration. There are several products in the marker place today offering enterprise grade products.

This article is aimed at providing few modeling suggestions when you have a need to join two or more Cassandra tables using Spark. This ability to join Cassandra tables using Spark will give your several data modeling advantages for ETL/ELT process, ability to balance data redundancy and query flexibility, data analysis using Spark data frame API and Spark SQL.

Apache Spark is a distributed SQL Engine framework that allows Joining of several data sources such as Hadoop Files, Hive, Cassandra, JDBC/ODBC data sources and others. This list is continuously growing.

Cassandra is a popular NoSQL database widely used in OLTP applications. Cassandra database has CQL language interface which looks similar to SQL language, but it is not quite the same.

While traditional relational data sources store their data in row format, Cassandra stores its data in row partitions using column families. Cassandra data arrangement inside partition is very similar to pivot spreadsheet like format.

The concept of denormalized data model is heavily emphasized until now due to Cassandra’s inability  to join tables. This still the case for pure Cassandra based on OLTP applications. However, new options are opening up for enterprises that are planning to integrate Apache Spark and Cassandra products.

Since this a heavy topic I want to release this in multiple sessions. First here is the explanation of how Spark SQL works.

https://intelligentinsight.wordpress.com/2016/07/05/optimizing-spark-sql-join-statements-for-high-performance/

Here is the link for Cassandra modeling best practices for Spark SQL joins.

https://intelligentinsight.wordpress.com/2016/07/09/cassandra-data-modeling-principles-for-spark-sql-joins/

Related Articles

sstable
cassandra
spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra