URL Shortener System Design
STEP 1
Functional Req.
Given a Long URL, generate a unique short URL.
Given a short URL, redirect to a long URL
Customized URL,(User should be able to pick a custom short link)
TTL (User should be able to specify the expiration time)
Given a Long URL, generate a unique short URL.
Given a short URL, redirect to a long URL
Customized URL,(User should be able to pick a custom short link)
TTL (User should be able to specify the expiration time)
Non-Functional Req.
The system should be highly available.
With min latency
Non-predictable
Analytics
The system should be highly available.
With min latency
Non-predictable
Analytics
Design Constraint
Generated per second
Length of the short URL, Let’s start with 7
Char set in the short URL; A-Z, a-z, 0-9
Generated per second
Length of the short URL, Let’s start with 7
Char set in the short URL; A-Z, a-z, 0-9
STEP - 2
Microservice Design Decision
URL Shortener Microservices
All requirements can handle the same team
A single microservice - A depth-oriented problem
URL Shortener Microservices
All requirements can handle the same team
A single microservice - A depth-oriented problem
STEP - 3
Drawing logic architecture
STEP - 4A
TIERS
Default we can multiple every tier 3X
App server tier, In-memory Tier, Storage Tier
Data Model
Short URL/unique ID, Long URL, TTL, creation time
K: Short URL/unique id, V: Rest of the stuff
APIs
create(Long URL) -> create(V)
read(short URL) -> read(K)
Similar to CRUD APIs
create (K,V), read(K), update(K,V), delete(K)
How to organize data
Hash Map - In memory
Row Oriented with primary key index - in storage
Algorithms - not a dumb K-V store
APIs :
create (Long URL)
Comes to app-server tier
Send directly to the storage tier
Give a unique integer id to the long URL
Store unique id: long URL in the storage tier
Store unique id: long URL in the cache tier - Cache Invalidation: write around caching
Comes to app-server tier
Send directly to the storage tier
Give a unique integer id to the long URL
Store unique id: long URL in the storage tier
Store unique id: long URL in the cache tier - Cache Invalidation: write around caching
read (short URL)
App server gets the request by the short URL
Converts back to unique Id
Send to cache tier
Cache tier check hashmap and returns if the cache hit
Has URL expired or user does not any permission = Return error 401
Goes storage tier and retrieves the record from storage tier
If LRU, keep the K-V in cache
App server gets the request by the short URL
Converts back to unique Id
Send to cache tier
Cache tier check hashmap and returns if the cache hit
Has URL expired or user does not any permission = Return error 401
Goes storage tier and retrieves the record from storage tier
If LRU, keep the K-V in cache
Logical Block Diagram
Algorithm Encoding System
We need an encoding system for encoding long URL to short URL
Convert a unique id to 7 chars long string
[A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding.
We can use Base 64 encoding, 7 positions = 64 ^ 7 unique short URL possible ~ 4 Trillion, would suffice for our system.
We need an encoding system for encoding long URL to short URL
Convert a unique id to 7 chars long string
[A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding.
We can use Base 64 encoding, 7 positions = 64 ^ 7 unique short URL possible ~ 4 Trillion, would suffice for our system.
Key Generation Service - (KGS)
We can have a separate Key Generation Service (KGS) that generates random seven-letter strings beforehand and stores them in the database.
Whenever we want to shorten a URL, we will take one of the already-generated keys and use it.
Simple and fast.
KGS will make sure all the keys inserted into key-DB are unique.
We can have a separate Key Generation Service (KGS) that generates random seven-letter strings beforehand and stores them in the database.
Whenever we want to shorten a URL, we will take one of the already-generated keys and use it.
Simple and fast.
KGS will make sure all the keys inserted into key-DB are unique.
Purging Or DP Cleanup
Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
A separate Cleanup service can run periodically to remove expired links from our storage and cache.
We can put a default TTL (2 years)
After removing an expired link, we can put the key back in the key-DB to be reused.
Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
A separate Cleanup service can run periodically to remove expired links from our storage and cache.
We can put a default TTL (2 years)
After removing an expired link, we can put the key back in the key-DB to be reused.
STEP - 4B
Capacity Req - How to scale
2 - 3 billion short links created every year. We gonna find query per second - qps = ~ 73 qps
20 billion clicks per month. 7700 qps click short URL and redirects to long URL.
Hundred times more than the number of new mappings inserted into the hash table.
Reads: Writes ~ 100:1 So read-heavy system.
If needs detail : (
We calculated ~ 73 qps in write.
Q = ~73 qps
We can think to store data for at least 3 years.
3yrs = 365 x 3 days = ~ 1000 days
# of seconds per day = 24 x 60 x 60 = 86400 = ~ 100,000 roughly friedly number.
# of seconds in 3 years = 1000 x 100,000 = 10^8
# of K-V pairs that will be inserted = Q x 10^8. We can say Q = ~ 100.
Will be Q = 10^2 x 10^8 = 10^10 = 10 Billion.
We need to plan 10 Billion K-V pairs to store over 3 years.
Long Url = 2048 chars = 2 KB of space
Short Url = 6 chars = 6 Bits. We can say totally K-V pairs ~ 2KB.
2Kb = 2000 bayts = 2 x 10^3
Totally 10 Billion (10^10) x 2 x 10^3 bytes = 20 x 10^12 = 20 TB
)
Need to scale for storage tier? Yes
K (short URL ) + V (Long Url ) = ~ 2 KB
We can think to store data for at least 3 years.
Roughly Data in 3 years 20 TB
Rule of thumb: 1-2 TB disk space per commodity machine.
20 TB / 2 TB = 10 commodity machines each machine has 2 TB of disk space.
Default 10 x 3 replica = 30 replication / sharding
Need to scale for throughput? No
We have a K-V store there is no high response time. There is no bulky system.
Default 3 replica app tier would be enough
Need to scale for cache tier? Yes
Let's say we are storing the 20% data in cache: 20% of 20 TB = 4TB (2 Billion K-V pairs in the cache)
The commodity cache server machine has 128 MB Ram. 4 TB / 128 GB = ~ 30 partitioning.
30 x 3 = 90 sharded & replicated servers.
Need to scale API parallelization? No
Availability? Replication factor
Hotspot? No
Geo-location? No
4C Architecture Layout
App Tier :
LB: Using IP Anycast for DNS Load Balancing. They will share typically the same IP address. When one of the LB is failing and passive, the other LB becomes active. They make a health check between both all the time.
App Tier servers are stateless servers. So we don’t need to keep server information and session details. To handle transactions very fast. The implement is very easy on the internet.
Just a round-robin request. This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning.
Req1 > Server A , Req2 > Server B end of the server start again A
Cluster manager: Heartbeat requests are being sent to a separate cluster manager and informing the load balancer. LB knows which server is currently healthy.
Cache or Storage Tier:
DB layer : (K,V) -> partition / shard id -> servers
Config store maps shard id -> servers
Config store: Keeping track of how many servers are active and where are getting map does.
Which shards belongs to which server?
Like K: 4 → Server5, Server37, Server25.
This information pass to the LB. LB will go to Server5,or Server 37 or Server25
SHARDING
● Sharding
Horizontal sharding by key
Shard 0 → Servers A, C, E (replication factor 3)
● Placement of shards in servers
○ Consistent hashing
Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down.
In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is the total number of servers. Recall that in a caching system using the ‘mod’ as the hash function, all keys need to be remapped.
Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down.
In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is the total number of servers. Recall that in a caching system using the ‘mod’ as the hash function, all keys need to be remapped.
Comments
Post a Comment