Lecture 10 Distributed File Systems

Introduction to Distributed FS
- Client Perspective
- Server Perspective
Case Studies

Introduction to Distributed FS

File system that is shared by distributed clients
Communication trhoguh shared files
Shared data remains available for a long time (persistent)
Distributed FS is the basic layer for distributed systems and applications
Usually in a Client/Server model
- Clients access files and directories that server provides
- Servers allow clients to perform operations on files/directories:
  - Add
  - Remove
  - Read
  - Write
- Servers can provide different view of files to different clients (e.g. different access control attirbutes)

Challenges

Transparency:
- Location
- Migration
- Replication
- Concurrency
Flexibility:
- Servers can be added/replaced without impacting the production environment
- Support for multiple underlying file systems (abstracts away the implementation of file system on specific servers/clients).
Dependability:
- Consistency (conflicts with replication and concurrency)
- Security (users may have different access rights on clients/network transmission)
- Fault Tolerance (server crash, availability of files)
Performance:
- Requests may be distributed across servers
- Combining servers allow high storage capacity
Scalability
- Handle increasing files/users
- Growth over geographic and administrative areas (how do you map users to file permissionse etc.)
- Growth of storge space
- No centralised naming service (who is responsible for naming?)
- No centralised locking, no centralised file store

Client Perspective

Ideally, the client would perceive remote files the same as local files (transparency

Standard File Service Interface:
- File: uninterpreted sequence of bytes
- Attributes: owner, size, dates, permissions
- Protection: Access Control Lists or capabilities
- Immutable files: simplifies caching and replication
Upload/Download model vs. Remote Access model
- Upload/Download: check out entire file, and upload changed file
- Remote Access: Client performs remote operations on file server

File Access Semantics

UNIX Semantics:
- A read after a write returns the value just written
- When two write’s follow in quick succession, the second persists
- Caches are needed for performance (write-through cache expensive)
- UNIX semantics are too strong for distributed file systems (caching is too hard)
Session Semantics:
- Essentially an Upload/Download model
  - Changes to an open file are only locally visible
  - Wen a file is closed, changes are propagated to the server
- Merge conflicts (simultanesous writes)
- Parent/child processes can’t share file pointers if they are running on different machines
Immutable file semantics:
- Only allowed to create and read files (can’t write)
  - write = read, create, remove/rename
- Directories can be updated (move, remove, rename etc.)
- Race condition when two clients replace the same file
- How to handle readers of a file when it’s replaced?
Atomic Transaction semantics:
- A sequence of file manipulations is executed indivisibly
- Two transactions can never interfere
- (this is the standard semantics for databases)
- Expensive to implement

Server Perspective

Design: What semantics are going to be used?
Design depends on the use
- Unix Use (1980’s study):
  - Most files are small
  - Reading is much more common than writing
  - Access is usually sequential
  - Most files have short lifetime (e.g. temp files)
  - File sharing is unusual (most processes only use a few files, and don’t share them)
  - Distinct file classes with different properties exist (executables, documents etc)
- Is this still valid today?
  - It depends on the use case
  - There are many different use cases for distributed FS now
- Varying use cases:
  - Big file system, many users
  - High performance
  - Fault tolerance
Stateless vs. Stateful Servers
- Advantages of Stateless Servers
  - Fault Tolerance
  - No open/close calls needed
  - No server space needed for tables
  - No limits on number of open files
  - No problems if server or client crashes
- Advantages of Stateful Servers
  - Shorter request messages
  - Better performance
  - Read-ahead easier
  - File locking is possible
Caching
- There are three locations that caching can occur:
  - Main memory of the server (easy, transparent)
  - Disk of the client
  - Main memory of the client (process local, kernel or dedicated cache process)
- Cache consistency:
  - No UNIX semantics without centralised control
  - Plain write-through is too expensive
    - alternatives: delay write’s and agglomerate multiple writes’s
  - Write-on-close, possibly with delay (file may be deleted)
  - Invalid cache entries may be accessed if server is not contacted whenever a file is opened
Replication
- Prevent data loss
- Protect system against down time of a single server
- Distribute workload
- Designs:
  - Explicit replication: The client explicitly writes files to multiple servers (not transparent).
  - Lazy file replication: Server automatically copies files to other servers after file is written.
  - Group file replication: write’s simultaneously go to a group of servers

Case Studies

NFS

Network File System

First developed by Sun
Fits nicely into the Unix idea of mount points
- Does NOT implement Unix semantics
- So even though it looks like a normal file, it isn’t
Multiple clients and servers (single machine can be client and server)
Stateless server (this changed in version 4 to reduce network traffic etc.)
File locking through separate server
No replication
Uses remote procedure calls (RPC) for communication
Caching: local file copies
- Consistency through pollig and timestamps
- Asynchronous update of file after close

AFS

Andrew File System, successor is Coda

Developed by CMU in 1980’s
Idea was to develop a campus-wide file system (scalability was significant factor)
Global name space for file system
Unix API
Gives Unix semantics for processes on one machine, but globally, it uses write-on-close semantics
Architecture:
- Client: runs user-level process (venus AFS daemon)
- Clients cache on local disk
- Group of trusted services (vice)
Scalability:
- Servers serve entire files (clients cache files)
- Servers invalidate cached files with callback (stateful servers track all client caches)
- Clients do not validate cache (except on first use after boot)
- This means there is very little cache traffic
Doesn’t support replication

Coda

Supports disconnected mobile operation of clients
Supports replication
Disconnection operation:
- Client updates are logged in a Client Modification Log file
- On reconnection, CML is sent to the server
- Trickle reintegration tradeoff:
  - Immediately reintegration of log puts heavy load on servers
  - Late reintegration leads to increased risk of conflicts
- File hoarding:
  - System/user can build a hoard database which is used to update frequently-used files in a hoard walk
- Conflicts automatically resolve if possible, otherwise requires manual intervention
Servers:
- Read/write replication is organised per-volume (replicate entire volumes)
- Group file replication (mluticast remote procedural calls): read from any server
- Version stamps are used to recognise server with outdated files

GFS

Google File System

Designed for commercial/R&D applications
- Aim to support 10’s of clusters, with 1000’s of nodes each,
- 300TB+ file systems, and 500Mb/s load
Assumptions:
- Failure occurs often
- Huge files
- Large streaming read's
- Small random read’s
- Concurent appends
- Bandwidth more important than latency
Interface:
- No common standard like POSIX.
- Provides familiar file system interface
- Operations:
  - create
  - delete
  - open
  - close
  - read
  - write
  - snapshot: low-cost copy of an entire file (copy-on-write)
  - record append: atomic append operation. Concurrent appents isolated.
System Design
- Files split into large (64MB) chunks
- Chunks stored on chunk servers (replicated)
- GFS master manages the name space
- Clients interract with master to get chunk handle
- Clients interract with chunk servers for reads/writes
- No explicit caching
- GFS Master:
  - Single point of failure
  - Keeps data structures in memory
  - Mutations logged to the operation log (replicated)
  - Checkpoint state when log is too large (checkpoint is same form as memory - quick recovery)
  - Locations of hcunks not stored at master (master periodically asks chunk servers for lists of chunks)
- Chunk Servers
  - Checksum blocks of chunks
  - Verify checksums before data is delivered, and of seldom-used blocks when idle
Data Mutations (write, record append, snapshot)
- Master grants hcunk lease to a chunk replica
- Replica with chunk becomes primary
- Primary defines serial order for all mutations
- Leases typically expire after 60s (usually extended)
- If primary fails, master chooses another replica after lease expires
Evaluating GFS (after 10 years of use)
- Single Master Problem
  - Too many requests (overloading)
  - Single point of failure
  - Solutions:
    - Tune performance
    - Multiple cells
    - Develop distributed masters
- File Counts
  - Too much metadata for single master
  - Applications changed to rely on “Big Table” instead
- File size
  - Smaller than expected
  - Reduced block size to 1MB
- Throughput vs Latency
  - Too much latency for interactive applicatoins
  - Automated master failover introduced
  - Applications modified to hide latency (e.g. multi-homed model)

Other File Systmes from Google

Chubby

Lock service
Simple FS
Name service
Synchronisation/consensus service
Implements Paxos
Architecture:
- Defines cells consisting of 5 replicas
- Master:
  - Gets all client requests
  - Master elected using paxos - given lease
- write: Paxos agreement on replicas
- read: performed local by master
API:
- Lock services defined as path names
- Operations: open, close, read, write, delete,
  - Locks: acquire, release`
  - Events: lock acquired, file modified` etc.

Simple leader election using Chubby:

Everyone who wants to become master tries to open the same file - only one will succeed

if (open("/ls/cell/TheLeader", W)) {
  write(my_id);
} else {
  wait until "/ls/cell/TheLeader" modified;
  leader_id = read();
}

Colossus

Follow up to GFS

BigTable

Distributed, sparse, storage map (key-value data)
Uses Chubby for consistency
Uses GFS/Colossus for actual storage

Megastore

Semi-relational database (provides ACID transactions)
Uses BigTable for storage (synchronous replication using Paxos)
Poor write latency and throughput

Spanner

SQL-like, structured storage
Transactions with TrueTime (synchronous replication using Paxos)
Optimised for low latency