Cassandra Developer Certification Study Notes – Data Modeling

1. Conceptual Data Model
Entity type hierarchy
  • Entity, relationships, attributes, keys, cardinality constraints
  • Transitivity – multi-level subtype in IsA relationship
  • Inheritance – attribute inheritance in IsA relationship
  • Disjointness constraint – IsA relationship that you specify when relationship is disjoint
    • video is a full video
    • video is a trailer
    • video cannot be both
  • Covering constraint – a union of all entities that belong to an entity type equals to the set of entities of main type
    • Using the disjointness constraint example…
      • Full video + Trailer covers all of the possible roles for a video
      • Not covering means other video types like TV show and movies exist
2. Application Workflow Model
  • tasks and casual dependencies form a graph
  • access patterns determine how data is accessed – know what queries you will run
3. Mapping Conceptual to Logical
Process to design logical model
  • use top down approach
  • can be algorithmically defined
  • effective in a long run
Query driven data modeling
  • Uses conceptual data model, application workflow, and access patterns as inputs to generate logical data model
  • Described using the Chebotko diagram notation
    • visual diagram for tables and access patterns = schema
    • documents the logical and physical data model
Logical diagram
Physical diagram
Data modeling principles
  • know your data
  • know your queries
    • partition per query – ideal
    • partition+ per query – acceptable
      • ex. find movies that match one of multiple genres
    • table scan – anti-pattern
    • multi-table – anti-pattern
      • ex. retrieve all data in a database
  • nest data – denormalization
    • organizes multiple entities into a single partition
    • supports partition per query access
    • mechanisms
      • clustering columns
      • collection columns
      • user-defined type columns
  • duplicate data – denormalization
    • better than join
    • join on write
Mapping rules for the query driven methodology
  • ensure that logical data model is correct
  • each query has a corresponding table
  • tables are designed to allow query to execute properly
  • table returns data in correct order
  • Mapping Rule 1: entities and relationships
    • entity and relationship types map to tables
    • entity and relationships map to partition or rows
    • partition may have data about one or more entities and relationships
    • attributes are represented by columns
  • Mapping Rule 2: equality search attributes
  • Mapping Rule 3: inequality search attributes
  • Mapping Rule 4: ordering attributes
  • Mapping Rule 5: key attributes

Cassandra Developer Certification Study Notes – CQL

What is it?

  • tabular presentation of results
  • selecting returns result set in order defined by clustering primary key
  • every writes get a timestamp; the data with the latest timestamp gets served; – how scaling out w/ more nodes work
  • terminologies
    • table = Thrift:column family; they’re called column family because partitions are just key-value and not a table
    • table is a set of partitions and a partition may be single or multiple row = Thrift:row (skinny or wide)
      • skinny: simple partition
      • wide: partition with multiple hits
    • partition key uniquely identifies a partition and may be simple or composite = Thrift:row key
    • column uniquely identifies a cell in a partition, and may be regular or clustering = Thrift:column key
    • primary key is comprised of a partition key and clustering columns if any; and uniquely identifies a row in both partition and table
    • composite keys
      • composite partition key: PRIMARY((title, year))
      • one partition key with composite clustering column declarations: PRIMARY(tack_title, performer, year, album_title)
      • composite partition key with composite clustering columns
    • keyspace: top level namespace where you set replication
  • size of column family is only limited to the size of the cluster
  • data from one row must fit on one node
  • maximum columns per row is 2 billion
    • in practice – up to 100,000
  • maximum data size per cell (column value) is 2GB
    • in practice – 100MB
  • if you want to change your primary key, you have to create a new table. you cannot modify primary key.
  • can only reverse the specified order by traversing the rows within a partition
Shell command:
  • .cqlsh
    • select * from performer limit 3; – shows rows
  • ccm node1 cli
    • list performer limit 3 – shows partitions with values in hexadecimals
column types:
  • can declare static: only one row would exist for the static column; cannot be part of primary key
  • counter
    • performance (a read is required before update) and accuracy considerations (counter update is not an idempotent operation)
    • could become problem during retries so client cannot retry when counter update times out
  • set: guarantees uniqueness; ordered by value
  • list: maintains sequence and appends; map: key-value
  • user-defined types
    • “CREATE TYPE track(album_title TEXT, album_year INT, track_title TEXT);
    • alter and drop available
  • tuple: holds fixed length sets of typed positional fields
    • convenient alternative to user defined types
    • up to 32768 fields but generally only use a few
    • useful when prototyping
    • must use “frozen” keyword in C*2.1
    • tuples can be nested in other tuples
    • CREATE TABLE user (

      equilizer frozen<tuple<float, int, float>>,
  • frozen: indicates that values must be written in its entirety for user-defined and tuple types
  • where
  • order by
  • limit
secondary indexes
  • searching only on secondary index requires basically searching all nodes (scatter gather operation)
  • when do you want it?
    • searching convenience
    • use with low-cardinality columns – columns that may contain relatively small number of distinct values – result set will have more manageable size
    • don’t use:
      • on high cardinality columns
      • on counter column tables
      • on a frequently updated or deleted columns
      • to look for a row in a large partition unless narrowly queried – e.g. search on both partition key and indexed column
  • generated using time (60bits), a clock sequence number (14bits), and MAC address (48bits)
  • now() generates TIMEUUID
  • dateOf() extracts the embedded time
  • can be ordered by time using TIMEUUID
  • requires PK
  • atomic – all values of a row are inserted or none
  • isolated – two inserts with same values in pk columns don’t interfere – just gets executed one after another
  • light-weight transactions via “IF NOT EXISTS” to replace the behavior of overwriting – Paxos consensus algorithm that requires multiple roundtrips
  • TTL available (INSERT INTO blah (foo) VALUES (bar) USING TTL 86400)
  • collection
    • set: insert into band (name, members) values (‘the beatles’, {‘paul’, ‘john’, ‘george’}); – no dupes
    • list: insert into song (id, title, songwriters) values (‘123’, ‘Let it be’, [‘John’, ‘Paul’]); – dupes and ordered
  • tuple: insert into user (id, equalizer) values (1, (3, 6, 7, 3.9));
  • requires full PK to identify column
  • PK cannot be updated
  • a new value is added if not exist (upsert)
  • atomic
  • isolated
  • use “IF” to verify the value for columns before execution
  • TTL available (UPDATE blah USING TTL 30…)
  • collection
    • set: update band set members = member + {‘pete’} where name = ’the beatles’;
    • set: update band set members = member – {‘pete’} where name = ’the beatles’;
    • list: update song set songwriters = songwriters + [‘Paul’] where…; // append
    • list: update song set songwriters = [‘Paul’] + songwriters where…; // prepend
    • list: update song set songwriters[3] = ‘Ringo’ where…;
    • list: update song set songwriters – [‘Patrick’] where…;
  • udf: use json – insert into table1 (udf1) values ({blah: ‘foo’, ice: ‘cream’}); // upsert works too
  • you can delete a single column (or cell) with DELETE column_name FROM …
  • if  you insert a null… that is same as deleting
  • “TRUNCATE table_name – deletes all records in a table – if the default setting of auto snapshot is on, snapshot gets created before the truncate so you can easily recover
  • set
  • list: delete songwriters[0], songwriters[1] from song where id = blah
  • combines multiple inserts, update, delete statements into a single logical operation
  • atomic operation – all it does is all statements end up with same exact time stamp and makes sure that the whole group will be completed; atomic including the lightweight transaction result
  • no batch isolation – other transactions can read and write data being affected by a partially executed batch
  • no rollback
  • statements not ran in order; also they’ll all get a same timestamp so you cannot put statement that relies on each other within a batch
  • essentially supports ACID but not like RDBMS
  • count(*)
  • IN is only allowed on the last clustering column of a primary key or partition key
  • to retrieve one partition, values for all partition key columns must be specified
  • to retrieve a subset of rows in a partition, values for all partition key columns and one or more but not all clustering columns must be specified
  • range search
    • < <= >= > is allowed on only one clustering column in a predicate, not on partition key
    • slice of a partition – PRIMARY KEY (track_title, minutes, seconds);
    • allows certain kinds of full table scan – on clustering column but not on regular fields
    • must have preceding column to make it work – CLUSTERING ORDER BY(performer, year, title) – must have performer to search on year and performer and year to search title
    • LIMIT clause recommended
  • secondary index
    • only applies to equality matches but not on range
    • set list map – search for value
      • create index on user(preferences)
      • select id from user where preferences CONTAINS ‘Rock’;
    • map – search for key
      • select title from album where tracks CONTAINS KEY 20;
      • can’t create index on both value and key
    • UDF and tuples
      • column is treated as blob and and must search on all fields
      • select * from track where song = {title: ‘beatles’, year: 1964’};
      • select * from user where equalizer = (1.0, 2.0, 3.0);
  • dateOf()
  • now()
  • minTimeuuid(), maxTimeuuid(): The min/maxTimeuuid example selects all rows where the timeuuid column, t, is strictly later than 2013-01-01 00:05+0000 but strictly earlier than 2013-02-02 10:00+0000. The t >= maxTimeuuid(‘2013-01-01 00:05+0000’) does not select a timeuuid generated exactly at 2013-01-01 00:05+0000 and is essentially equivalent to t > maxTimeuuid(‘2013-01-01 00:05+0000’).
  • unixTimestampOf() – more precision than dateOf()
  • typeAsBlob(), blobAsType() – conversion
  • token() – token access // only using for traversing entire token range on every node
Data modeling – query centric design
  • Conceptual data model – same thing as what you always’ve done
    • technology independent unified view of data
    • entity-relationship model (ER diagram), dimensional model, etc
  • Logical data model
    • Unique for Cassandra
    • Data is viewed and organized into column families/tables
    • Column family diagrams / Chebotko diagram
      • K: partition key
      • C: clustering column
      • S: static column
      • IDX: index column


  • Physical data model
    • Unique for Cassandra
    • CQL definitions
    • considerations
      • natural or surrogate keys? – prefer natural
      • are write conflicts (overwrite) possible?
      • data type?
      • how large are partitions?
      • how much data duplication is required?
      • are client side joins required and at what cost?
      • are data consistency anomalies possible?
      • how to enable transactions and data aggregation?


  • Framework
    • defines transition between models
      • query driven methodology between conceptual and logical
      • formal analysis and validation between logical and physical
    • approach
      • modeling rules
      • mapping patterns
      • schema optimization
  • executes a file containing SQL statements
  • SOURCE ‘./myscript.cql’;

Cassandra Developer Certification Study Notes – Compaction

Delete in Cassandra is a write of a delete marker (tombstone)
  1. a tombstone is applied to the table in Memtable
  2. subsequent queries treat this column as deleted
  3. at the next Memtable flush, the tombstone passes to the new SSTable
  4. at each compaction, tombstone columns older than gc_grace_seconds (10 days by default) are evicted from the newly compacted SSTAble
    1. gc_grace_seconds is 10 days for those nodes that are off at the time
    2. zombie columns happen that restores deleted column after gc_grace_seconds – the cure is ‘nodetool repair’
Efficient because:
  • SSTables are sorted by partition key
  • no random IO required
Necessary because:
  • SSTables are immutable, so updates fragments data over time
  • deletes are writes and must be cleared periodically
  • during compaction
    • disk IO
    • off-cache read performance
  • after
    • read perf increase as less SSTables are read for off-cache reads
    • disk utilization drops
  • size-tiered (default) – compaction triggers as number SSTables reach a threshold
    • compacts set number of similarly sized SSTables to a large SSTable
    • fast to complete each compaction b/c relatively few SSTables are compacted at once
    • inconsistent read latency
    • requires significant disk space (2x free disk space as largest CQL table)
    • preferable for write-heavy and time-series data apps
    • full compaction (compacts all or specified SSTables into one) is not recommended for production use
      • uses considerable disk IO and disk space
      • if the SSTable gets big, it would take a lot for other SSTables to reach the same size so the compaction on that SSTable gets latency
  • leveled – uniform size SSTables organized and compacted by successive levels
    • uses less disk (10%) and the read latency is more consistent and predictable than size-tiered – but uses double the disk IO during writes (should use SSD)
  • date-tiered – data written within a certain time window is saved together
    • ideally suited for time-series data
  • ALTER TALBE… WITH compaction = {‘class’: ‘SizeTieredCompactionStrategy’};

Cassandra Developer Certification Study Notes – Read Path

  • in-memory
    • MemTable: serves data as part of the merge process
    • Row Cache: stores recently read rows (optional)
      • enabled in CQL – CREATE TABLE… ‘WITH CACHING = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘1’}
        • ALL
        • n – cache the first n rows for a partition key
        • NONE – default
      • row_cache_size_in_mb – max row cache size, set to 0 to disable (0 by default)
    • Bloom Filters: reports if a partition key may be in its corresponding SSTable (check if it’s not there)
      • each SSTable has a Bloom Filter on disk; used from off-heap memory
      • false positives are possible, false negatives are not
      • larger tables have higher possibility of false positives
      • can trade off memory to get better accuracy via config setting: bloom_filter_p_chance
        • 0.0: no false positive; greatest memory use
        • 0.1: max recommended setting, diminishes returns when it’s higher
        • 1.0: bloom filter disabled
        • default setting depends on compaction strategy
          • 0.0.1: sized tiered compaction (STC)
          • 0.1: leveled compaction (LCS)
    • Key Caches: maps recently read partition keys to specific SSTable offsets
      • enabled in CQL – CREATE TALBE… ‘WITH CACHING = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘1’}
        • ALL – default
        • NONE
      • key_cache_size_in_mb
      • counter_cache_size_in_mb: counter cache is similar to key cache
    • Partition Summaries: sampling from partition index
      • used to locate a key’s approximate location on the full index
      • default sample ratio is 1 per 128 partition keys in the index
      • CREATE TABLE… WITH min_index_interval (default 128)
      • CREATE TABLE… WITH max_index_interval (default 2048)
      • held in off-heap memory but in-memory
  • disk
    • Partition Indexes: sorted partition keys mapped to their SSTable offsets
    • SSTables: static files periodically flushed from Memtable
Read path flow
  • Best case scenario: hit row cache – row cache is optional b/c typically row cache is duplicate of OS file system cache (both goes to RAM)
  • Second best case scenario: hit key cache
  • Worst case scenario: don’t hit key cache and goes through partition summary and partition index
  • “tracing on'”
  • tells you where in the query has bottleneck and which nodes are serving what and how fast

Cassandra Developer Certification Study Notes – Write Path

Cassandra is a log structured engine
  • Memtables – in-memory tables corresponding to CQL tables with indexes – changes can happen
  • CommitLog – append-only log, replayed to restore downed node’s Memtables
  • SSTable – Memtable snapshots periodically flushed to disk, cleaning heap – immutable
  • Compaction – periodic process to merge and streamline SSTable
When a node receives a request – no disk seek necessary during write as everything gets written sequentially
  1. appends to the CommitLog and to the Memtable for this record’s target CQL table
  2. Periodically, Memtable flush to SSTables, clearing JVM heap and CommitLog
  3. Periodically, Compaction runs to merge and streamline SSTable
  • commitlog_total_space_in_mb (default: 1024)
  • commitlog_segment_size_in_mb (default: 32)
  • commitlog_directory – best practice is to locate on its own disk to minimize write head movement or on SSD
  • commitlog_sync (default: periodic) – batch is the alternative
    • batch: commitlog_sync_batch_window_in_ms (default: 50) – how long to wait for more writes before fsync
    • periodic: commitlog_sync_period_in_ms (default: 10000) – how long to wait between fsync of log to disk – 10 second latency is ok considering the replications on other nodes
  • memtable flushes when
    • memtable_total_space_in_mb is reached (default 25% of heap)
    • commitlog_total_space_in_mb is reached
    • “nodetool flush” is issued
SSTable – sorted string table
  • an immutable file of sorted partitions
  • written to disk through sequential IO
  • contains the state of a Memtable when flushed
  • CQL table = corresponding Memtables + SSTables
  • periodically compacted from many to one
  • structure fo SSTable
    • partition index
    • partition summary
Data directories
  • created by keyspace and table_name-table_id – …/data/keyspace/tablename-tableid
  • Memtable flush or compaction generates the following:
    • -CompressionInfo.db: metadata for Data file compression
    • -Data.db: base SSTable
    • -Filter.db: SSTable partition keys Bloom Filter to optimize reads
    • -Index.db: index for this SSTable
    • -Statistics.db:stats for this SSTable
    • -Summary.db: sampling from index file
    • -TOC.db: component list for this SSTable
  • sstable2json: exports SSTable in JSON

Cassandra Developer Certification Study Notes – Cluster

Consistency level
  • Can vary for each request
  • Levels
    • any
    • one
    • quorum (RF / 2) + 1
    • all
Consistent hashing
  • Partition – storage location on a row (table row)
  • Token – int value generated by hashing algorithm – identifies the location of partition in the cluster
    • +/- 2 to the 64 value range
  • Murmur3Partitioner (default, best practice) – uniform distribution based on Murmur 3 hash
  • RandomPartitioner – uniform distribution based on md5 hash
  • ByteOrderedPartitioner (legacy only) – lexical distribution based on key bytes
Virtual nodes (vnodes)
  • Multiple smaller primary range segments (virtual nodes) owned by each machine instead of one large range
  • Available in Cassandra 1.2+
  • default is 256 nodes per machine
  • not applicable to nodes that combines Solr and Hadoop
  • advantages
    • when new node is added, the tokens aren’t unbalanced
    • machines bootstrap faster b/c token ranges are distributed
    • virtual node failure impact is spread across the cluster
    • token range assignment is automated
  • You can have cluster use vnodes and regular nodes across DCs but not within a single DC.
  • Enabled in cassandra.yaml: num_tokens – setting it to a value bigger than 1
  • Advantage
    • Disaster recovery
    • Replicate data closer to the user for less latency
    • Workload segregation – can create node for reporting that doesn’t impact end users
  • Organization
    • node
    • rack: logical grouping of physically related nodes
    • data center: set of racks
  • Gossip is used to communicate cluster topology
    • once per second, each node contacts 1 to 3 others, requesting and sharing updates
    • node states (heart beats), node locations
    • when a nod joins a cluster, it gossips with seed nodes that are specified in cassandra.yaml
      • assign the same seed node to each node in a data center
      • if more than one DC, include seed node from each DC
  • The snitch
    • informs its partitioner of their node’s rack and DC topology
    • enables replication that avoids duplication in a rack – to prevent data loss risk from all replicas residing in a single rack
    • cassandra.yaml: endpoint_snitch: SimpleSnitch (various snitches available – typically GossipingPropertyFileSnitch used but Ec2Snitch available for AWS users, etc)
    • dc
    • rack
  • keyspace
    • replication factor: how many to make of each partition
    • replication strategy: on which node should each replica get placed
      • SimpleStrategy (for learning only) – one factor for entire cluster. 3 is the recommended minimum. creates replica on nodes subsequent to the primary range node based on token.
      • NetworkTopologyStrategy (enables multiple DCs) – separate factor for each data center in cluster. distributes replicas across racks and data centers
        WITH REPLICATION = {‘class’:’NetworkTopologyStrategy’, ‘dc-east:2’, ‘dc-west:3’}

        • doeson’t simply create replica on nodes subsequent to the primary range node based on token. it distributes replicas across racks so not all replicas reside in a single rack.
        • for replication across DC, “remote coordinator” is picked and single replication communication is sent instead of multiple when multiple repcas to nodes are supposed to be copied over to other DC
        • ideally, you want to have the same number or a number divisible by the replication factor number (when replication factor is 3, have 3 racks or 9, 27, etc)
    • all partitions are replicas, there are no “originals”
  • Hinted handoff
    • recovery mechanism for writes targeting offline nodes (known to be down or fails to acknowledge)
    • the coordinator can store a hinted handoff in “system.hints” table and it’s replayed when the target comes back online
    • could be applied to a whole DC going down, too
    • configurable in cassandra.yaml
      • hinted_handoff_enabled (default: true)
      • max_hint_window_in_ms (default: 3 hours): after this consecutive outage period, the hints are no longer generated until target node comes back online. node offline for longer can be made consistent with “repair” or other ops
  • “nodetool getendpoints mykeyspace mytable $token_id”: token id can be looked up by “select token(id) from mytable”
Consistency level
  • write: how many nodes must acknowledge they received and wrote the request?
  • read: how many nodes must acknowledge by sending their most recent copy of the data?
  • tuning consistency
    • write all, read one – for system doing many reads
    • write quorum, read quorum
    • write one, read all – for system doing many writes
    • immediate consistency formula: (nodes_written + nodes_read) > replication_factor
  • Cassandra tries to provide consistency at read time
  • As part of read “digest query” is sent to replica nodes and nodes with stale data are updated
  • digest query: returns a hash to check the current data state
  • read_repair_chance: set as table property value between 0 and 1, to set the probability with which read repairs should be invoked on non-quorum reads – fixes data for the next query
  • nodetool repair
    • manual repair with “nodetool repair” is the last line of defense – might want to run every7 days
    • repair = synchronizing data
    • when to run it
      • recovering failed node
      • bringing a down node back up
      • periodically on nodes with infrequent read data
      • periodically on nodes with write or delete activity
    • if run periodically, do so at least every gc_grace_seconds
      • tombstone garbage collection periods (default: 864000 = 10 days)
      • tombstone is a marker placed on a deleted column within a partition. failure to repair within the period can lead to deleted column resurrection
    • incremental repair – only repair new data – data that was introduced the system since the last repair
      • tools/bin/sstablemetadata and sstablerepairedset used for enabling incremental repair if you were using 2.0. no need if you start w/ 2.1.
      • recommended if using leveled compaction
      • run daily
      • avoid anticompaction by not using a partitioner range or subrange

Cassandra Developer Certification Study Notes – Tools

nodetool: command-line cluster management utility
  • status: display cluster state, load, host id, token
  • info: display node memory use, disk load, uptime, etc
  • ring: display node status and cluster ring state
cassandra-stress: benchmarks your cassandra instance
Cassandra Cluster Manager (CCM): Creates multiple nodes on local machine for dev and testing
Auto snapshot