Apache Cassandra
via Docker
Chris Ballance
Diligent Corporation
@ballance
What we’ll cover
Docker Fundamentals
CAP Theorem
Cassandra Fundamentals
Demo of Cassandra deployed with Docker
Docker
Hosting & Coordination
Docker Hub
“GIT for Virtual Machines”
https://hub.docker.com/
Available Docker Images
• Ubuntu, Centos, Debian, Fedora
• Cassandra, mySQL, mongoDB
• Node, Java, Erlang, Ruby, Rails
• Wordpress
• Redis
• Hipache, NGINX
• Create your own!
CAP THEOREM
Project Delivery Triangle
Good
Fast Cheap
Good
Fast Cheap
Project Delivery Triangle
Consistent
Available Partition Tolerant
CAP Theorem
CAP Theorem
Consistent
Available Partition Tolerant
Available Partition Tolerant
Consistent
Consistent
Partition TolerantAvailable
Consistent
Available Partition Tolerant
Consistent
Available Partition Tolerant
Partition Tolerance is a myth!
We can guarantee Consistency, but not Availability.
We may time out or fail to return anything
Consistent
Available Partition Tolerant
We can guarantee Availability, but our data
may not be Consistent with other nodes
Consistent
Available Partition Tolerant
Our business needs will drive whether
we choose Consistency or Availability
Partition Tolerant
Implementations
Consistent
Available Partition TolerantAP
Cassandra, CouchDB
CP
MongoDB,
Big Table (GFS)
CA
RDBMS
SQL Server
A contrived example for
Consistency
A B
User writes S1 to Node B
A B
S0 S0
S1
User queries Node A
Nodes B & A have not sync’d
A B
S1S0
Query is blocked
until B syncs with A
A B
S0 S1
?
Once B syncs with A,
the query on A is unblocked
and returns S1 as expected
A B
S1 S1S1
S1 :-)
* The query could potentially time out
A contrived example for
Availability
A B
User writes S1 to Node B
A B
S1
S0 S0
User queries Node A for S1
A B
S1S0
Query returns current state of
A, but is not consistent with B
A B
S1
S0
S0 S1
?
A later query of A yields S1 previously written to B.
Eventual consistency has been achieved.
Idempotence
Event Sourcing
Local Caching
Queueing
A B
S1
S1
S1 S1
:-)
Cassandra
History
Developed for Facebook Inbox Search
Created by one of the authors of Amazon Dynamo
Released as Open Source in 2008
Became an Apache Incubator project in 2009
Graduated to an Apache Top Level Project
Who is using Cassandra
in production today?
Twitter
Netflix
Credit Suisse
Cisco
Many more…
http://PlanetCassandra.org/companies/
Benefits
Linear Scalability
Data Sets can be larger than available memory
Multi-master
Built-in support for handling multiple data centers
Decentralized & Distributed - No single point of failure
Integrated caching
Consistency options can be tuned through configuration
Supports MapReduce
Familiar query syntax - CQL (Cassandra Query Language)
Designed for sparse loading of loosely typed data
Linear Scalability
Challenges
JOINs are not supported! (not unique to Cassandra)
Not for financial data (eventual consistency)
Tooling is not yet as mature as Oracle or SQL Server
Why NOSQL?
Multiple persistence
strategies
Use the right tool for the job. Sometimes more than
one tool is appropriate. NoSQL can work in
conjunction with SQL solutions. For example, you
might have trasactions you continue to store in a
Relational Database, and build your new user social
graph data in a NoSQL solution. Can’t decide on a
local database for a mobile app? Sometimes the
best route is to just persist the whole thing as JSON
on disk until you need something faster.
Total Cost of Ownership
(TCO)
Let’s face it, SQL Server is expensive. If you need it,
and can fully justify the cost, then it might be right for
you. But as we’ve discussed, RDBMS can be a
crutch and a default choice for persistence layers.
Defaults are just that, default. They’re a catch-all that
is rarely the best choice unless you’re resolving
generic default problems
Write (consistency two)
Write (consistency two)
Client
Write (consistency two)
Client
Write (consistency two)
Client
Coordinator
Write (consistency two)
Client
Coordinator
Replicant
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Confirm
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Time-out
Confirm
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Time-out
Confirm
Replicant
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Time-out
Confirm
Confirm Replicant
Write (consistency two)
Client
Coordinator
Replicant
Replicant
Success
Time-out
Confirm
Confirm Replicant
Read (consistency two)
Client
Read (consistency two)
Client
Read (consistency two)
Client
Coordinator
Read (consistency two)
Client
Coordinator
Read (consistency two)
Client
Coordinator
Chosen Node
Chosen Node
Read (consistency two)
Client
Coordinator
Chosen Node
Chosen Node
Read (consistency two)
Client
Coordinator
Inconsistent
Node
Chosen Node
Read (consistency two)
Client
Coordinator
Chosen Node
Inconsistent
Node
Read (consistency two)
Client
Coordinator
Chosen Node
Chosen Node
Inconsistent
Node
Read (consistency two)
Client
Coordinator
Chosen Node
Chosen Node
Inconsistent
Node
Read (consistency two)
Client
Coordinator
Chosen Node
Chosen Node
Inconsistent
Node
Success
On-premise + Cloud
Consistency Modes
ALL - Every node must have the data
QUORUM - Most nodes must have the data
ONE - At least one node must have the data
TWO - At least two nodes must have the data
THREE - At least three Nodes must have the data
ANY - Any node has the data
EACH_QUORUM - Each datacenter must have a quorum
LOAL_QUORUM - Each node in the datacenter handling the request must have
a quorum
*Quorum = (replication_factor / 2 ) + 1
Calculating Consistency
R + W > N
R —> Read level consistency
W —> Write level consistency
N —> Number of replicas of the data
Data Replication
SimpleStrategy - Single data center
NetworkTopologyStrategy - Recommended
strategy for multiple data centers. Provides
Cassandra with info about the location of nodes by
rack and datacenter
Deploying Cassandra with Docker
Demo
Additional Resources
Apache Cassandra
http://cassandra.apache.org/
Docker
https://www.docker.com/what-docker
Docker Hub
https://hub.docker.com/explore/
Cassandra for Developers - Paul O’Fallon
https://www.pluralsight.com/courses/cassandra-developers
DBA’s Guid to NoSQL: Apache Cassandra (Free eBook)
http://is.gd/CassandraFreeEbook
Cassandra 3.0 - DataStax PDF
http://docs.datastax.com/en/cassandra/3.0/pdf/cassandra30.pdf
Questions?

Cassandra via-docker

Editor's Notes

  • #7 Before I get into CAP theorem, I’d like to start off with another very similar triangle you’re already familiar with…
  • #8 Before I get into CAP theorem, I’d like to start off with another very similar triangle you’re already familiar with…
  • #10 CAP Theorem is similar. We have three competing interests
  • #11 Choose two
  • #12 Consistency - Always reads the most recent write - One version of the truth - But, if nodes are partitioned, we cannot guarantee consistency. - We can wait until we’re consistent, but in the meantime we may timeout and cannot continue.
  • #13 Every (unfailed) node always returns query results Eventually consistent No errors or timeouts, but data may be stale If partitioned, a node can return its current state, but it may be an old version of the truth
  • #14 Are all nodes always consistent? What is the lag time between node updates? One version of the truth Transactions are atomic
  • #15 -Be dramatic… PARTITION TOLERANCE IS A MYTH!!1! -This is just the nature of networks. Nothing is guaranteed. -So really, we just have two competing interests, Consistency, and Availability. -Choosing which to favor cannot be an entirely technical decision, but something to discuss with your stakeholders and it depends on the type of app you’re building.
  • #16 Once choice would be to favor Consistency. While this isn’t as popular with most workflows, it has its place.
  • #17 -Another choice would be to favor Availability. You’ll find this is very common in social media, and media in general.
  • #20 Here’s an example where we’re favoring consistency. We have a stick guy with data to write, and two nodes to write to and read from.
  • #21 S1 is going to be our payload, and we send it to node B.
  • #22 After we’ve sent our write (S1) to node B, we query Node A for the value of S. We might not even know that we’re being routed to node A for the read when we just wrote to node B. This can be even more confusing to even savvy users if this redirection is fully transparent behind a load balancer (and it typically is).
  • #23 Stick man is not a patient fellow and he’s tapping his foot, waiting for his query to return. He’s probably just going to assume the system is broken after a few seconds…
  • #24  Finally, Node A is able to synchronize with Node B to get S1 and return it to Stick guy. There’s a real risk of starving Stuck guy, and he’s already looking mighty thin.
  • #25 Now let’s look at a different approach. This time, we will favor Availability over Consistency.
  • #26 Our buddy, Stick guy, again sends S1 to Node B.
  • #27 Now he queries Node A.
  • #28 He immediately gets a reply, but it is stale data, since Nodes A & B have not yet synchronized. We don’t have to wait for the query to return, but we do get stale data that is older than what we have already sent out.
  • #29 On the later query, stick guy gets back the value S1 that he is expecting, consistent with what he previously sent to Node B.