Maximum capacity per node (node density)
DataStax recommends up to 2 TB of data per node with appropriate free disk space for the compaction strategy being used. |
Exceeding the recommended node density may have the following effects:
Compactions may get behind depending on write throughput, hardware, and compaction strategy.
Streaming operations such as bootstrap, repair, and replacing nodes will take longer to complete.
Higher capacity nodes work best with low to moderate write throughput and no indexing.
One special case is for time-series data. Use the TimeWindowCompactionStrategy to scale larger than these limits if:
The time-series data is written once and never updated.
The data has a clustering column that is time based.
Read queries cover specific time-bounded ranges of data rather than its full history.
The TWCS windows are configured appropriately.
Vector Search node capacity
See Capacity Planning for Vector Search.
Per-node data density factors
Node density is highly dependent on the environment.Determining node density depends on many factors, including:
Data frequency change and access frequency.
SLAs (service-level agreements) and ability to handle outages.
Data compression.
Compaction strategy: These strategies control which SSTables are chosen for compaction and how the compacted rows are sorted into new SSTables.Each strategy has its own strengths.To determine your compaction strategy, see compaction strategy.
Network performance: remote links likely limit storage bandwidth and increase latency.
Replication factor: See About Data Distribution and Replication.
Free disk space
Disk space depends on usage, so it is important to understand the mechanism.
The database writes data to disk when appending data to the commit log for durability and when flushing memtables to SSTable data files for persistent storage.The commit log has a different access pattern (read/writes ratio) than the pattern for reading data from SSTables.This is more important for spinning disks than for SSDs.
SSTables are periodically compacted.Compaction improves performance by merging and rewriting data and discarding old data.However, depending on the type of compaction and size of the compactions, disk utilization and data directory volume temporarily increase during compaction.For this reason, be sure to leave an adequate amount of free disk space available on a node.
The following table provides guidelines for the minimum disk space requirements based on the compaction strategy:
Compaction strategy | Minimum requirements |
---|---|
| Worst case: 20% of free disk space at the default configuration. |
| Sum of all the SSTables compacting must be smaller than the remaining disk space. Worst case: 50% of free disk space.This scenario can occur in a manual compaction where all SSTables are merged into one giant SSTable. |
| Generally 10%. Worse case: 50% when the Level 0 backlog exceeds 32 SSTables (LCS uses STCS for Level 0). |
| TWCS requirements are similar to STCS. TWCS requires a maximum disk space overhead of approximately 50% of the total size of SSTables in the last created bucket. To ensure adequate disk space, determine the size of the largest bucket or window ever generated and calculate 50% of that size. |
Cassandra periodically merges SSTables and discards old data to keep the database healthy through a process called compaction.See How data is maintained for more information.
Estimating usable disk capacity
To estimate how much data your nodes can hold, calculate the usable disk capacity per node and then multiply that by the number of nodes in your cluster.
Start with the raw capacity of the physical disks:
raw_capacity = disk_size * number_of_data_disks
Calculate the usable disk space accounting for file system formatting overhead (roughly 10 percent):
formatted_disk_space = (raw_capacity * 0.9)
Calculate the recommended working disk capacity:
usable_disk_space = formatted_disk_space * (0.5 to 0.8)
During normal operations, the database routinely requires disk capacity for compaction and repair operations.For optimal performance and cluster health, DataStax recommends not filling your disks, but running at 50% to 80% capacity. |
Capacity and I/O
When choosing disks for your nodes, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate).
Some workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).
Number of disks - HDD
DataStax recommends using at least two disks per node: one for the commit log and the other for the data directories.
At a minimum, the commit log should be on its own partition.
Commit log disk - HDD
The disk need not be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).
Commit log disk - SSD
Unlike spinning disks, there is less of a penalty for sharing commit logs and data directories on SSD than there is on HDD.
DataStax recommends separating commit logs and data for highest performance and resiliency.
Data disks
Use one or more disks per node and make sure they are large enough for the data volume and fast enough to satisfy both (1) reads that are not cached in memory, and (2) to keep up with compaction.
RAID on data disks
It is generally not necessary to use a Redundant Array of Independent Disks (RAID) for the following reasons:
Data is replicated across the cluster based on the replication factor you have chosen.
Cassandra includes a JBOD (just-a-bunch-of-disks) feature for disk management.Because the database responds according to your availability/consistency requirements to a disk failure either by stopping the affected node or by denylisting the failed drive, you can deploy nodes with large disk arrays without the overhead of RAID 10.You can configure the database to stop the affected node or denylist the drive according to your availability/consistency requirements.Also see Recovering from a single disk failure using JBOD.
RAID on the commit log disk
Generally RAID
is not needed for the commit log disk.Replication adequately prevents data loss.If you need extra redundancy, use RAID 1
.
Extended file systems
DataStax recommends deploying on XFS
or ext4
.On ext2
or ext3
, the maximum file size is 2TB
even using a 64-bit kernel.
Because the database can use almost half your disk space for a single file when using SizeTieredCompactionStrategy
(STCS), use XFS when using large disks, particularly if using a 32-bit kernel.XFS file size limits are 16TB
max on a 32-bit kernel, and essentially unlimited on 64-bit.