Skip to main content

System Design Notes

Requirements clarification

  1. Users/customers
  • Who will use the system?
  • How the sysem will be used?
  1. Scale - how to handle growing amount of data
  • How many read queries per second?
  • Hos much data is queried per second?
  • How many video views are processed per second?
  • Can there be spikes in traffic?
  1. Performance - how fast
  • What is expected write-to-read data delay?
  • What is expected p99 latency for read queries?
  1. Cost - budget for the system
  • Should the design minimize the cost of development?
  • Should the design minimize the cost of maintenance?

Functional requirements - API

  1. Start with specific function names
  2. Make several iterations to generalize API like the diagram below. functional_requirements-api

Non-functional requirements

  1. Scalable (tens of thousands of video views per second)
  2. Highly performant (few tens of milliseconds to return total view counts for a video)
  3. Highly available (survives hardware/network failures, no single point of failure)
  4. Consistency
  5. Cost

Architecture

  1. Starting point start-architecture

  2. Data

What we store

Individual event vs aggregated data in real-time

  • individual event includes timestamp, country, device type, operating system, etc.
  • When we aggregate data, we calculate a total count per some time interval (e.g. 1 minute) and we lose details of each individual event. Question - why do we have to loes data for aggregate data? Is it through kinesis/kafka?
---Invidividual eventsAggregated data
Pros1. Fast writes (2) Can slice and dice data however we need (3) Cal reclaculate numbers if needed.(1) Fast reads (2) data is ready for decision making
Cons(1) slow reads (2) Costly for a large scale (many events)(1) Can query only the way data was aggregated. Question - why? (2) Requires data aggregation pipeline. We need to pre-aggregate data in memory before storing it in the database. (3) Hard or even impossible to fix errors why ?
  • Ask the interviewer about the expected data delay: time between when the event happened and when it was processed. If the latency is no more than several minutes, we must aggregate data on the fly (stream data processing). If several hours are okay, then we can store raw events and process them in the background (batch data processing).

Where we store

  • Questions to ask
  1. How to scale writes?
  2. How to scale reads?
  3. How to make both writes and reads fast?
  4. How not to lose data in case of hardware faults and network partitions?
  5. How to achieve strong consistency? What are the tradeoffs?
  6. How to recover data in case of an outage?
  7. How to ensure data security?
  8. How to make it extensible for data model changes in the future?
  9. Where to run (cloud vs. on-premises data centers)?
  10. Cost of the solution?

SQL Database (MYSQL)

mysql

Scalability (Answer - partitioning) / Reliability (Answer - replicate data)

  • Processing service - the service to store data in the database
  • Qeury service - the service to retrieve data from the database
  • Cluster proxy - knows about all databses and routes traffic to the correct shard. By having this layer, the processing and query services don't have to know everything about the database.
  • Configuration service inside the proxy: Proxy needs to know when some shard dies or become unavailble due to network partition. If new shard has been added to the databse cluster, proxy should become aware of it. Therefore, configuration service does health checks to all the database shards.
  • Shard proxy: cache query results, monitor database instance health, publish metrics, terminate queries that tke too long to return

Availability

  • Master shard - writes go through the master shard
  • Read replica (followers) - read go through both read replica and mastershard
  • Put master shards and read replica in different data centers so if the whole data center goes down, we still have a copy of data available.
  • Eventual consistency - With leader/follower paradigm, different users see different total count for a video. This inconsistency is temporary, over time all writes will propagate ot replicas.

Consistency

  • Eventual consistency

Speed

  • Keep things in memory
  • Minimize disk reads

NoSQL database (Cassandra)

  • Cassandra's read and write throughputs increase linearly as new machines are added.
  • MongoDB is a document-oriented database and uses leader-based replication.
  • HBase is a column-oriented data and uses leader-based replication.

Availability

quorum

  • Gossip protocol - Instead of having the configuration service that knows everything about all other databases, the nodes themselves knows about maximum 3 other nodes. Soon, the information about every node propagates.
  • Quorum - A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system.
  • Quorum reads/writes - query from 3 out of 5 nodes and get votes on which data to read or where to write to.

Consistency

  • Eventual consistency

Processing Service

Questions

  • How to scale?
  • How to achieve high throughput?
  • How not to lose data when processing node crashes?
  • What to do when database is unavailable or slow?

aggregate

  • The second option of saving conuters in the memory is better
  • Question - is pull always better than push? aggregate2
  • Checkpointing is if a queue or node fails, another machine can pick up the job from the checkpoint.
  • Partitioning - events are partitioned so they can be processed distributedly.

Processing Question - does streaming events involve video live streaming?

  • Parition consumer - it converts byte array into the actual object. Maintains TCP connections. Single-threaded component. If this component is made multi-threaded, checkpointing becomes difficult and hard to preserve order of events.
  • Deduplication cache - partition consumer does deduplication cache to remove duplicate data that has arrived to the partition.
  • Aggregator - aggregates view counts in the hashtable in-memory. When it creates a new hashtable to do further job, it sends the old hashtable to the internal queue for further processing
  • Internal queue - Processing of the events. Another queue is added so that we can separate aggregation of the events and processing the events.
  • Database writer - single threaded or multithreaded components. It sends the aggregated data to the database.
  • Dead letter queue - if database becomes slow, db writer can push letters to the dlq. And there's a separate process that reads from this queue to send to the db.
  • Data enrichment - An alternative to DLQ. Saving undelivered message on the disk of the processing service
  • Embedded database - event comes with minimal information. If you want to fetch detailed information with that event, you can use the local database inside the processing service. It has to be local to avoid remote calls.
  • State store - in case in-memory aggregated data is lost, store in-memory data to a durable storage. (Question - is this storage database? or just any saving on disk?)

Data ingestion path

Processing

  • API Gateway - it represents a single-entry point into a video content delivery system.

Partitioner service client (question - is it API gateway?)

  • Blocking vs. nonblocking (question - what are some use cases to use blocking IO?)
  • Timeouts - connection time out (how much time a client is willing to wait for a connection to establish 10 milliseoncds), request timeout (latency of 1 percent of slowest requests)
  • Retries - exponential backoff (wait longer with every retry attempt) and jitter (adds randomness to retry intervals to srpead out the load)
  • Circuit breaker pattern - how many errors have failed and stop calling the downstream. It makes the system more difficult to test.

Load balancer

  • DNS - it maintains the list of domain names and translates to the IP addresses

Paritioner service and partitions

  • Partition strategy - (hot-partition) based on the time
  • question - what is client-side server discovery pattern? Processing

Rollup count

  • 1 minute becomes 1 hour then it becomes 1 day worth of data

Hot storage vs. Cold storage

Hot storage - fast acces. cold storage - slow access

Processing TechStack