An Introduction to Cassandra

akhil anand
5 min readJan 6, 2023
Image 1

Overview

Cassandra is a No-SQL Database management system designed to handle a large amount of data. It is beneficial for faster read and write operations, developed by Facebook which was later on released as an open-source system. Whenever we talk about a No-SQL database system it should follow the CAP theorem.

CAP Theorem

C:- Consistency
What does consistency mean?
Suppose User A and User B both are firing the same query then the result will be different for both users if the database does not follow the consistency part. Attaching a simple query for reference.

Select Count(Distinct userid)
from users
Image 2

A:- Availability
The system should be available 24*7. It should also be important for the system to be fault-tolerant. A single point of failure should not be responsible for the loss of the whole operation. Cassandra has this feature which helps it to do faster read and write operations.

P:- Partition Tolerance
Suppose a Cluster has multiple nodes connected with the network and if the network would be broken then the system should still operate and serve the request. Cassandra has this partition tolerance feature.

Cassandra does not support the Join & aggregate operations. It is not being used to drive analytical operations. It is used for the construction of real-time data pipelines where data comes at a very high velocity.

Let’s Dig Dipper into Cassandra

As Cassandra is a NO-SQL database we can add a column at any instance of time it does not matter whether the column has been previously defined or not. Cassandra fills the data row-wise where each row is uniquely identified by the primary key.

  1. Keyspace:- Keyspace is similar to a data container in the Cassandra database which is similar to a schema in the database. Keyspace consists of core objects e.g:- column families, rows indexed by keys(primary key), data types, data center, and replications factors.

2. Keys in Cassandra

Partition Key:- It helps Cassandra to uniquely identify the rows and also helps in data placement at a Particular Node.
Suppose the user has fired a write query in a multi-node cluster(Consider the below figure) then its partition key’s responsibility is to assign the hash value which further results in the placement of data into an appropriate node.

Image: Cassandra Cluster
/* Simple example to assign partition key */
create table application_log( id int,
user_name varchar,
user_age as int,
user_login as timstamp
user_ordered as varchar
Primary Key(user_name)
);

Composite Partition Key:- When a single column is not helpful to identify the unique partition key, then with the help of more than one column partition key is formed.

/* Simple example to assign composite partition key */
create table application_log( id int,
user_name varchar,
user_age as int,
user_login as timstamp
user_ordered as varchar
Primary Key(user_name,user_age)
);

Clustering Key:- Suppose a user has written data in the Cassandra cluster but the same user wants the latest data to be on top and the oldest at the bottom, this arrangement of the data in ascending/descending order is possible because of the clustering key.

/* Simple example to assign composite partition key */
create table application_log( id int,
user_name varchar,
user_age as int,
user_login as timstamp
user_ordered as varchar
Primary Key((user_name,user_age),user_ordered)
)
with clustering order by (user_ordered desc);

Primary Key:- Primary key consists of one or more partition keys and zero or more clustering keys. It uniquely identifies each row which becomes helpful in further read and write operations.

3. Data Replication in Cassandra

Cassandra needs replicas of the data to ensure availability and fault tolerance. Suppose the user has sent a read query to the node but due to some reason node has crashed, then, in that case, the user won’t be able to read data, and the system is not considered fault-tolerant. To avoid this issue whenever the user fires a write query the Cassandra cluster makes replicas of the data on different nodes with the help of the replication factor.
i.e:- Replication Factor =3 means three data replicas will be created.

Image 3

Simple Strategy:- When the Cassandra cluster with a single data center is present then we use a simple strategy to create the replica.
Suppose a user fires the write request, the coordinator node will direct the request to the node with a relevant hash value, and then replicas will be created in the clockwise direction to the corresponding nodes.

Image: Simple Replication Strategy

Network Topology:- In Network topology, the Cassandra cluster has more than one data center. Each data center has its individual replication factor means once the write operation would be initiated one data center can make only two replicas and the other can make 4 replicas or vice versa.
In the Network topology, the coordinator node, and replicas of the data center are created in such a manner that replication would continue after completing once data center.

Image: Network topology Replication Strategy

4. Coordinator Node:- Coordinator node helps the read and write request to land on the appropriate node, Based on the hash value range assigned to each node.
5. Commit Log:- With every read operation there is a log written, so in the future, if the system crashes the commit log helps to recover the data.
6. Memtable:- Once the data write operation becomes successful the data is stored inside the RAM(in Memory).
7. SST (Sorted String Table):- Once the data storage reaches a threshold level in the memtable, the data would be flushed into SS Table.

Conclusion

If you are interested in more data-related content you can visit here. If you have any suggestions or stuck somewhere kindly comment below.

--

--