My Computer Network Notes

60 minute read

Published: November 12, 2024

These are the notes I took when I speed through Computer Networking: A Top-Down Approach during the spring break in my sophomore year.

Computer Networking: A Top Down Approach

Chapter 1

What is the Internet?

a request and a reply
Internet: “network of networks”
- Interconnected ISPs
protocols are everywhere
- control sending, receiving of messages
- e.g., HTTP (Web), streaming video, Skype, TCP, IP, WiFi, 4G, Ethernet
Internet standards
- RFC: Request for Comments
- IETF: Internet Engineering Task Force

What is a protocol?

Protocols define the format, order of messages sent and received among network entities, and actions taken on msg transmission, receipt

Network Edge

Network edge:
- hosts: clients and servers
- servers often in data centers
Access networks and physical media
- How to connect end systems to edge router?
  - residential access nets
  - institutional access networks (school, company)
  - mobile access networks (WiFi, 4G/5G)
What to look for?
- transmission rate (bits per second) of access network?
- shared or dedicated access among users?
Access networks: cable-based access (later in chapter 6 & 7)
- a physical cable connect multiple houses to a single cable headend
- data, TV transmitted at different frequencies over shared cable distribution network
- frequently division multiplexing (FDM): different channels transmitted in different frequency bands
- HFC: hybrid fiber coax
  - assymetric: up to 40 Mbps - 1.2 Gbs downstream transmission rate, 30 -100 Mbps upstream transmission rate (more of a consumer of data than producer of data)
- network of cable, fiber attaches homes to ISP router
Access networks: digital subscriber line (DSL)
- voice, data transmitted at different frequencies over dedicated line to central office
- use existing telephone line to central office DSLAM
  - data over DSL phone line goes to Internet
  - voice over DSL phone line goes to telephone net
- assymetric
  - 24-52 Mbps dedicatd downstream transmission rate
  - 3.5-16 Mbps dedicated upstream transmission rate
Access networks: home networks
- DSL coming in to/from headend or central office
- cabel or DSL modem
- router, firewall, NAT: wired Ethernet
- wifi wireless access point
- wifi, router, modem are often combined in one single box
Wireless access networks
- Shared wireless access network connects end system to router via base station aka “access point”
- wireless local area networks (WLANs)
  - typically within or around building (~100ft)
  - 802.11b/g/n (Wifi): 11, 54, 450 Mbps transmission rate
  - standardised by IEE not ITF
- wide-area cellular access
  - provided by mobile, cellular netowrk operator (10’s km)
  - 10’s Mbps
  - 4G, 5G
Access networks: enterprise network
- companies, universities
- mix of wired, wireless link technologices, connecting a mix of switches and routers
  - Ethernet: wired access at 100Mbps, 1Gbps, 10Gbps
  - WiFi: wireless access points at 11, 54, 450 Mbps
Access networks: data center networks
- high-bandwidth links (10s to 100s Gbps) connect hundreds to thousands of servers together, and to Internet
Host: sends packets of data
- host sending function:
  - takes application message
  - breaks into smaller chunks, known as packets, of length L bits, (typically 1500 bytes) for data + header
    - protocol detects what information there is in the header
  - transmits packet into access network at transmission rate R, depends on the type of access networks, R: bits/sec
    - link transmission rate, aka link capacity, aka link bandwidth
    - packet transmission delay = time needed to transmit L-bit packet into link = $L (bits) / R (bits/sec)$
Links: physical media
- bit: propagates between transmitter/receiver pairs
- physical link: what lies between transmitter & receiver
- guided media: signals propagate in solid media: copper, fiber coax
- unguided media: signals propagate freely, e.g. radio
- twisted pair (TP): two insulated copper wires
  - category 5: 100 Mbps, 1 Gbps Ethernet
  - category 6: 10 Gbps Ethernet
- Coaxial cable:
  - two concentric copper conductors
  - bidirectional
  - broadband
    - multiple frequency channels on cabel
    - 100’s Mbps per channel
- Fiber optic cable:
  - glass fiber carrying light pulses, each pulse a bit
  - high-speed operation:
    - high-speed point-to-point transmission (10’s-100’s Gbps)
  - low error rate:
    - repeaters spaced far apart
    - immune to electromagnetic noise
- Wireless radio:
  - signal carried in various “bands” in electromagnetic spectrum
  - no physical “wire”
  - broadcast, “half-duplex” (sender to receiver)
  - propagation environment effects:
    - reflection
    - obstruction by objects
    - interference/noise
- Radio link types:
  - Wireless LAN (WiFi)
    - 10-100’s Mbps, 10’s of meters
  - wide-area (e.g. 4G)
    - 10’s Mbps over ~10km
  - Bluetooth: cable replacement
    - short distances, limited rates
  - terrestrial microwave
    - point-to-opint; 45 Mbps channels
  - satellite
    - up to 45 Mbps per channel
    - 270 msec end end delay

Network Core

mesh of interconnected routers
packet-switching: host break application-layer messages into packets
- network forwards packets from one router to the next, across links on path from source to destination
- Forwarding
  - aka “switching”
  - local action: move arriving packets from router’s input link to appropriate router output link
- Routing
  - global action: determine source-destination paths taken by packets
  - routing algorithm：
    - compute local per-router forwarding table needed to realize this source-destination forward path
- packet transmission delay: takes L/R seconds to transmit (push out) L-bit packet into link at R bps
- store and forward: entire packet must arrive at router before it can be transmitted on next link
- queueing: what happens when packets arrive at a router for forward
  - queueing occurs when work arrives faster than it can be serviced
  - Packet queueing and loss: if arrival rate (in bps) to link exceeds transmission rate (bps) of link for some period of time:
    - packets will queue, waiting to be transmitted on output link
    - packets can be dropped (lost) if memory (buffer) in router fills up
- statistically multiplexing game
circuit switching
- end-end resources allocated to, reserved for “call” between source and destination
  - dedicated resources: no sharing
    - circuit-like (gauranteed) performance
  - circuit segment idle if not used by call (no sharing)
  - commonly used in traditional telephone networks
- Implementation:
  - Frequency Division Multiplexing (FDM)
    - optical, electromagnetic frequencies divided into (narrow) frequency bands
    - each call callocated its own band, can transmit at max rate of that narrow band
  - Time Division Multiplexing (TDM)
    - time divided into slots
    - each call allocated periodic slot(s), can transmit at maximum rate of (wider) frequency band (only) during its time slot(s)
Packet switching vs circuit switching
- Is packet switching a “slam dunk winner”?
  - great for “bursty” data – sometimes has data to send, but at other times not
  - excessive congesion possible: packet delay and loss due to buffer overflow
    - protocols needed for reliable data transfer, congestion control
  - How to provide circuit-like behavior with packet-switching
    - complicated, cover more later.
structure of today’s Internet
- Global transit ISP
  - peering networks/Internet exchange point: connect different global transit ISP
- regional ISP
- At “center”: small # of well-connected large networks
  - “tier-1” commercial ISPs (e.g. Level 3, Sprint), national & international coverage
  - content provider networks (e.g. Google, Facebook): private network that connects its data centers to internet, often bypassing tier-1, regional ISPs

Performance

How do packet delay and loss occur?
- packets queue in router buffers, waiting for turn for transmission
  - queue length grows when arrival rate to link (temporarily) exceeds output link capacity
- packet loss occurs when memory to hold queued packets fills up
Packet delay: 4 sources
- nodal processing
  - check bit errors
  - determine output link
  - typically < microsecs
- queueing delay
  - time waiting at output link for transmission
  - depends on congesion level of router
  - it’s about arrival rate of bits vs service rate of bits
- transmission delay
  - L: packet length (bits)
  - R: link transmission rate (bps)
  - d_{trans} = L/R
- propagation delay
  - d: length of physical link
  - s: propagation speed (~2*10^8 m/sec)
  - d_{prop} = d/s
Traceroute
- provides delay measuremenet from source to router along end-end Internet path towards destination. For all i:
  - sends 3 packets that will reach router i on path towards destination (with time-to-live field value of i)
  - router i will return packets to sender
  - sender measures time interval between transmission and reply
Throughput
- rate (bits/time unit) at which bits are being sent from sender to receiver
  - instantaenous: rate at given point in time
  - average: rate over longer period of time
- in practice, bottlenet link is at network edge

Layering, encapsulation

Software Architecture Considerations
- It determines key properties such as
  - Scalability
    - how to handle increasing demand and amount of resources
  - Efficiency
    - how to achieve no resource waste
  - Reliability
    - how to handle component failures
  - Modularity
    - how to allow reuse of components
  - Evolvability
    - how to allow reuse of components in time
layer: each layer implements a service
- via its own internal-layer actions
- relying solely on services provided by layer below
Example Building Layering: Enhancement
- Enhancement: enhancing a basic communication service with a new capability
  - Not reliable -> add seq to make it reliable
  - Not secure -> add encr/authen to make it secure
  - Just sending -> add pacing
Layered Internet protocol stack
- application
  - supporting network application
  - HTTP, IMAP, SMTP, DNS
  - transport unit: messages
- transport
  - process to process data transfer
  - The transport layer ensures the communication between the process of a host to the process of another host.
  - TCP, UDP
  - take the application-layer message and add a transport layer layer header H_t to create a transport-layer segment, unit of data transferred
- network
  - host to host
  - It’s responsible for source-to-destination or host-to-host delivery of packets across multiple networks.
  - routing of datagrams from source to destination, IP, routing protocols
  - does not provide reliable, best effort service
  - encapsulate the tranport-layer segment with network layer-layer header H_n to create a network-layer datagram
- link
  - data transfer between neighboring network elements
  - Ethernet, 802.11 (WiFi), PPP
  - link-layer header, H_l, to create frame
- physical
  - bits “on the wire”
- Data flows down the stack, adding more headers, and flows back up the stack and removing headers.
Service - says what a layer does
Inteface - says how a higher layer access the service provided by a lower layer
Protocol - specifies how the service in a layer is implemented
Multiplexing
- If a layer n has multiple entities (e.g., different applications or implementations), the lower layer (n-1) needs a multiplexing field to indicate the higher-layer entity requiring the service
What does the End-to-End Arguments mean?
- The application knows the requirements best, place functionalities as high in the layer as possible
- Think twice before implementing a functionality at a lower layer, even when you believe it will be useful to an application
- If a higher layer can do it well, don’t do it at a lower layer – the higher the layer, the more it knows about the best what it needs
- Add functionality in lower layers iff it
  - is used by and improves performance of a large number of (current and potential future) applications,
  - does not hurt (too much) other applications, and
  - does not increase (too much) complexity/overhead
- Practical tradeoff
  - allow multiple interfaces at a lower layer (one provides the function; one does not)

Networks under attack, Network Security

What can bad actors do? What defenses designed, deployed? How to design architecture that are immune to attacks?

Internet not originally designed with (much) security in mind
Bad guys: packet interception
- packet “sniffing”
  - broadcast media (shared Ethernet, wireless)
  - promiscuous network interface reads/records all packets passing by
- IP spoofing: injection of packet with false source address
Bad guys: denail of service
- attackers make resources (server, bandwidth) unavailable to legitimate traffic by overwhelming resource with bogus traffic
  - select target
  - break into hosts around the network (see botnet)
  - send packets to target from compromised hosts
Lines of defense
- authentication
  - prove who you are, SIM hard: hardware identity
- confidentiality: via encription
- integrity checks: digital signatures prevent/detect tampering
- access restrictions: password-protected VPNs
- firewalls

Chapter 2: Application Layer

Overview
conceptual and implementation aspects of application-layer protocols
transport-layer service
client-server paradigm
peer-to-peer paradigm
learn about protocols by examining popular application-layer protocols and infrastructure
HTTP
SMTP, IMAP
DNS
video streaming systems, CDNs
programming network applications
socket API

Principles of the Application Layer

Client-server paradigm
- server:
  - always-on host
  - permanent IP address
  - often in data centers, for scaling
- clients:
  - contact, communicate with server
  - may be intermittently connected
  - may have dynamic IP addresses
  - do not communicate directly with each other
  - examples: HTTP, IMAP, FTP
Peer-peer architecture
- no always-on server
- arbitrary end systems directly
- peers request service from other peers, provide service in return to other peers
  - self scalability - new peers bring new service capacity, as well as new service demands
- peers are intermittently connected and change IP addresses
  - complex management
- example: P2P file sharing
Process communicating
- process: program running within a host
- within the same host, 2 processees communicate using inter-process communication (defined by OS)
- processes in different hosts communicate by exchanging messages
- client process: process that initiates communication
- server process: process that waits to be contacted
- note: applications with P2P architectures have client processes & server processes
Sockets
- process sends/receives messages to/fram its socket (door)
- analogous to door
  - sending process shoves messages out door and the transport infrastructure (i.e. the layers below) will deliver the message to socket (door) at receiving process
  - receiving process receives messages through door
  - 2 sockets involved: one on each side
Identifier needed to receive messages
- identifier includes
  - IP address
  - port numbers
An application-layer protocol defines:
- types of messages exchanged (e.g. request, response)
- message syntax (what fields in messages & how fields are delineated)
- message semantics (meaning of information in fields)
- rules for when and how processes end & respond to messages
- open protocols:
  - defined in RFCs, everyone has access to protocol definition
  - allows for interoperability
  - e.g., HTTP, SMTP
- proprietary protocols:
  - e.g. Skype, Zoom
Transport layer services
- data integrity
  - some apps (e.g. file transfer) requires 100% reliable data transfer
  - other apps (e.g. audio) can tolerate some loss
- timing
  - some apps (e.g. Internet telephony, interactive games) require low delay to be “effective”
- throughput
  - some apps (e.g. multimedia) require minimum amount of throughput to be “effective”
  - other apps (“elastic apps”) make use of whatever throughput they get
- security
  - encryption, data integrity

Web, HTTP

Overview
Web, HTTP overview
HTTP connections * TCP, “stateless” * persistent, non-persistent
HTTP messages * requests, responses
HTTP cookies
Web caches
Conditional HTTP GET
HTTP/2, HTTP/3

HTTP overview:
- HTTP uses TCP
  - HTTP messages exchanged between browser (HTTP client) and Web server (HTTP server)
- HTTP is “stateless”
  - server maintains no information about past client requests
HTTP connections: 2 types
- Non-persistent HTTP : 2*RTT + file transmission
- Persistent HTTP HTTPv1.1 : 1*RTT + file transmission
HTTP request messages
- POST: data is not displayed in the URL but in the HTTP message body
- GET (for sending data to server): data parameters are included in the URL and visible to everyone.
- PUT: uploads new file (object) to server, completely replaces file that exists at specified URL with content in entity body of POST HTTP request message
- HEAD: request header
Maintaining user/server state: cookies
- remember all cookie-specific action
- could be used for remembering authorization, shopping carts, recommendation
Web caches → improve user display latency
- Goal: satisfy client requests without involving origin server
Conditional GET → improve user display latency
- Goal: don’t send object if cache has up-to-date cached version
  - no object transmission delay (or use of network resources)
- implementation:
  - client: specify date of cached copy in HTTP request, if-modified-since: <data>
  - server: response contains no object if cached copy is up-to-date: HTTP/1.0 304 Not Modified
HTTP/2
- obects divided into frames, frame transmission interleaved.

Email

Overview
infrastructure: user agents, servers, mailboxes
SMTP: simple mail transfer protocol

3 major components:
- user agents
- mail servers
- SMTP: simple mail transfer protocol

Domain Name System (DNS)

Overview
DNS structure, function
resolving DNS queries
DNS record format
DNS protocol messages

Internet hosts, routers:
- IP address (32 bit) - used for addressing datagrams
- “name”, e.g., cs.umass.edu - used by humans
Domain Name System (DNS):
- distributed database implemented in hierarchy of many name servers
  - Root server → Top level domain (e.g. .com) → authoritative (e.g. amazon.com)
  - Root name servers
    - official, contact-of-last resort by name servers that cannot resolve name (I dont understsand this part)
  - Top-Level Domain (TLD) servers
    - responsible for .com, .org, .edu, … and all top-level country domain, e.g. .cn, .uk, .fr
    - Network solutions: authoritative registry for .com, .net TLD
    - Educause: .edu TLD
  - authoritative DNS servers
    - organization’s own DNS server(s), providing authoritative hostname to IP mappings for organization’s named hosts
    - can be mained by organization or service provider
- application-layer protocol: hosts, DNS servers communicate to resolve names (address/name translation)
  - note: core internet function, implemented as application-layer protocol
  - complexity at network’s edge
DNS services, structure
- hostname-to-IP-address translation
- host aliasing
  - canonical, alias names
- mail server aliasing
- load distribution
  - replicated Web servers: many IP addresses correspond to one name
- Why not centralize DNS?
  - single point of failure
  - traffice volume
  - distant centralized database
  - maintenance
Thinking about the DNS
- humongous distributed database
  - ~ billion records, each simple
- handles many trillions of queries/day
  - many more reads than writes
  - performance matters: almost every Internet transaction interacts with DNS - msecs count!
- organizationally, physically decentralized
  - millions of different organizations responsible for their records
- “bulletproof”: reliability, security
Local DNS name servers
- when host makes DNS query, it is sent to its local DNS server
  - Local DNS server returns reply, answering:
    - from its local cache of recent name-to-address translation pairs (possibly out of date!)
    - forwarding request into DNS hierarchy for resolution
DNS name resolution
- Iterated query
  - contacted server replies with name of server to contact
  - “I don’t know this name, but ask this server”
- Recursive query
  - puts burden of name resolution on contacted name server
  - heavy load at upper levels of hierarchy
Caching DNS information
- once (any) name server learns mapping, it caches mapping, and immediately returns a cached mapping in response to a query
  - caching improves response time
  - cache entries timeout (disappear) after some time (TTL)
  - TLD servers typically cached in local name servers
- cahed entries may be out-of-date
  - if named host change IP address, may not be known Internet-wide until all TTLs expire!
  - best-effort name-to-address translation
DNS records
- DNS: distributed database storing resource records (RR)
  - RR format: (name, value, type, ttl)
- type=A
  - name is hostname
  - value is IP address
- type=NS
  - name is domain (e.g. foo.com)
  - value is hostname of authoritative name server for this domain
- type=CNAME
  - name is alias name for some “canonical” (the real) name
  - www.ibm.com is really servereast.backup2.ibm.com
  - value is canonical name
- type=MX
  - value is name of SMTP mail server associated with name
DNS protocol messages
- DNS query and reply messages, both have same format:
  - message header:
    - identification: 16 bit # for query, reply to query uses same #
  - flags:
    - query or reply
    - recursion desired
    - recursion available
    - reply is authoritative
  - name, type fields for a query
  - RRs in reponse to query
  - records for authoritative servers
  - additional “helpful” info that may be used
Getting your info into the DNS
- example: new startup “Network Utopia”
- register name networkutopia.com at DNS registrar (e.g., Network Solutions)
  - provide names, IP addresses of authoritative name server (primary and secondary)
  - registrar inserts NS, A RRs into .com TLD server:
    - (networkutopia.com, dns1.networkutopia.com, NS)
    - (dns1.networkutopia.com, 212.212.212.1, A)
  - create authoritative server locally with IP address 212.212.212.1
    - type A record for www.networkutopia.com
    - type MX record for networkutopia.com
      Chapter 3: Transport Layer
      Introduction
Overview
- understand principles behind transport layer services:
  - multiplexing, demultiplexing
  - reliable data transfer
  - flow control
  - congestion control
- learn about Internet transport layer protocols:
  - UDP: connectionless, best-effort service
  - TCP: reliable, flow- and congestion-controlled connection-oriented transport
Logical communication
- abstract away all implementation, the information from one host is sent logiclaly to another host.
- network layer deals with logical communication between hosts
- tranposrt layer deals with logical communication between processes (i.e. within a host)
  - relies on, enhances, network layer services
TCP: Transmission Control Protocol
- reliable, in-order delivery
- congestion control
- flow control
- connection setup
UDP: User Datagram Protocol
- unreliable, unordered delivery
- no-frills extension of “best-effort” IP
services not available:
- delay guarantees
- bandwidth guarantees
- What is the minimum amount of services we need to provide effective communication?
  Multiplexing and demultiplexing
demultiplexing
- the process of the payload directed to the appropriate processes/applications
- At receiver: use header info to deliver received segments to correct socket
multiplexing
- taking those messages and funneling them down into IP
- At sender: handle data from multiple sockets, add transport header (later used for demultiplexing)
How demultiplexing works
- host receives IP datagrams
  - each datagram has source IP address, destination IP address
  - each datagram carries one transport-layer segment
  - each segment has source, destination port number
- host uses IP addresses & port numbers to direct segment to appropriate socket
Connectionless demultiplexing
- when creating socket, must specify host-local port #
- when creating datagram to send into UDP socket, must specify
  - destination IP address
  - destination port #
- when receiving host receives UDP segment:
  - checks destination port # in segment
  - directs UDP segment to socket with that port #
  - IP/UDP datagrams with same dest port #, but different source IP addresses and/or source port numbers will be directed to the same socket at receiving host.
Connection-oriented demultiplexing
- TCP socket identified by 4-tuple
  - source IP address
  - source port number
  - dest IP address
  - dest port number
    Connectionless Transport: UDP
UDP: User Datagram Protocol
- no need for the sender and receiver to have shared state -> connectionless
- each UDP segment handled independently of others
- Why is there a UDP?
  - no connection establishment (which can add RTT delay)
  - simple: no connection state at sender, receiver
  - small header size
  - no congestion control
    - UDP can blast away as fast as desired!
    - can function in the face of congestion
- use:
  - streaming multimedia apps (loss tolerant, rate sensitive -> so we cannot congestion control it too strongly)
  - DNS
  - SNMP
  - HTTP/3
UDP segment header
- source port #, dest port #
- length: UDP needs to know how long the application payload is
- checksum
UDP checksum
- Goal: detect errors (i.e. flipped bits) in transmitted segment
- sender
  - treat contents of UDP segment (including UDP header fields and IP addresses) as sequence of 16-bit integers
  - checksum: addition (one’s complement sum) of segment content
  - checksum value put into UDP checksum field
- receiver
  - compute checksum of received segment
  - check if computed checksum equals checksum field value:
    - Not equal - error detected
    - Equal - no error detected. But maybe errors nonetheless? More later…
      Principles of Reliable Data Transfer
How do 2 distributed entities reliably communicate over a channel which itself is unreliable?
Reliable data transfer:
- consider only unidirection data transfer
- finite state machine
rdt1.0: reliable transfer over a reliable channel
- underlying channel perfectly reliable
  - no bit errors
  - no loss of packets
- seperate FSMs for sender, receiver:
  - sender sends data into underlying channel
  - receiver reads data from underlying channel
rdt2.0: channel with bit errors
- underlying channel may flip bits in packet
  - checksum to detect bit errors
- the question: how to recover from errors?
  - acknowledgements (ACKs): receiver explicitly tells sender that pkt received OK
  - negative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errors
  - sender retransmits pkt on receipt of NAK
  - stop and wait: sender sends one packet, then waits for receiver response
- fatal flaw
  - what happends if ACK/NAK corrupted?
    - sender doesn’t know what happened at receiver!
    - can’t just retransmit: possible duplicate
- handling duplicates
  - sender retransmits current pkt if ACK/NAK corrupted
  - sender adds sequence number to each pkt
  - receiver discards (doesn’t deliver up) duplicate pkt
rdt3.0: channels with errors and loss
- New channel assumption: underlying channel can also lose packets (data, ACKs)
- setting a timer at the sender
pipelining
- sender allows multiple, “in-flight”, yet-to-be-acknowledged packets
  - range of sequence numbers must be increased
  - buffering at sender and/or receiver
selective repeat
- receiver individually acknowledges all correctly received packets
  - buffers packets, as needed, foe eventual in-order delivery to upper layer
- sender times-out/retransmits individually for unACKed packets
  - sender maintains timer for each unACKed pkt
- ACK(n) in [sendbase, sendbase+N]
  - mark packet n as received
  - if n is smallest unACKed packet, advance window base to next unACKed seq # -> core of cumulative acknowledgement
cumulative acknowledgement
- Cumulative acknowledgement is a process in which the receiver sends a single acknowledgement in response to a finite number of frames received. Through this, the receiver acknowledges that it has correctly received all previous frames or packets. When the sender receives an acknowledgement for frame n, it understands correct delivery of frames n – 1, n – 2 and so on.

Connection-oriented transport: TCP

TCP: Transport Control Protocol
- point-to-point
  - one sender, one receiver
- reliable, in-order byte stream
  - no “message boundaries”
- full duplex data
  - bi-directional data flow in same connection
  - MSS: maximum segment size
- cumulative ACKs
  - the core is in adjusting window base to the next unACKed seq #
- pipelining
  - TCP congestion and flow control set window size
- connection-oriented
  - handshaking (exchange of control messages) intitializes sender, receiver state before data exchange
- flow controlled
  - sender will not overwhelm receiver
  - speed match
TCP segment structure
- source port #, dest port #
  - used for multiplexing & de-multiplexing
- sequence number, 32 bit
  - counting bytes of data into bytestream (not segments!)
  - we can use this to check duplicates
  - we can use this to shift window base
  - core of pipelining
  - byte stream “number” of first byte in segment’s data
- acknowledgement number, 32 bit
  - seq # of next expected byte from other side;
  - cumulative ACK
- A bit: this is an ACK
- Internet checksum
- TCP options (variable length)
- length (of TCP header)
- C,E: congestion notification
- flow control: # bytes receiver willing to accept
- application data (variable length)
  - data sent by application into TCP socket
- urg data pointer (not really used)
- Note:
  - receiver handles out-of-order segments up to the implementor (not specified by the TCP spec)
TCP round trip time, timeout
- how to set TCP timeout value?
  - longer than Round Trip Time (RTT), but RTT varies!
  - too short: premature timeout, unnecessary retransmissions
  - too long: slow reaction to segment loss
- how to estimate RTT?
  - SampleRTT: measured time from segment transmission until ACK receipt
    - ignore retransmissions
  - SampleRTT will vary, want estimated RTT “smoother”
    - average several recent measurements, not just current SampleRTT
  - EstimatedRTT = (1-alpha) * EstimatedRTT + alpha * SampleRTT
    - exponential weighted moving average (EWMA)
      - give data at different position different weight
    - influence of past sample decreases exponentially fast
    - typical value: alpha = 0.125
  - timeout interval: EstimatedRTT + 4 * DevRTT
    - 4*DevRTT: safety margin in case the estimatedRTT has large variation/fluctuation
    - DevRTT: EWMA of SampleRTT deviation from EstimatedRTT
      - DevRTT = (1 - beta)*DevRTT + beta * SampleRTT-EstimatedRTT
      - typically, beta = 0.25
TCP Sender (simplified)
- event: data received from application
  - create segment with seq #
  - seq # is byte-stream number of first data byte in segment
  - start timer if not already running
    - think of timer as for oldest unACKed segment
    - expiration interval: TimeOutInterval
- event: timeout
  - retransmit segment that caused timeout
  - restart timer
- event: ACK received
  - if ACK acknowledges previously unACKed segments
    - update what is known to be ACKed
    - start timer if there are still unACKed segments
TCP Receiver | Event at receiver | TCP receiver action| | —— | —— | | arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed | delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK | | arrival of in-order segment with expected seq #. One other segment has ACK pending | immediately send single cumulative ACK, ACKing both in-order segmetns | | arrival of out-of-order segment with higher-than-expected seq #. Gap detected | immediately send duplicate ACK, indicating seq. # of next expected byte | | arrival of segment that partially or completely fills gap | immediately send ACK, provided that segment starts at lower end of gap |
TCP fast retransmit
- receipt of 3 duplicate ACKs indicates 3 segments received after a missing segment – lost segment is likley. So retransmit the unACKed segment with the smallest seq # and do not wait for timeout
TCP flow control
- high-level-idea: have the receiver tell the sender how much space there is left in the receiver’s buffer
- implementation:
  - TCP receiver “advertises” free buffer space in rwnd field in TCP header
    - RcvBuffer size set via socket options (typically default is 4096 bytes)
    - many operating systems autoadjust RcvBuffer
  - sender limits amount of unackED (“in-flight”) data to received rwnd
  - guarantees receive buffer will not overflow
TCP Connection Management
- before exchanging data, sender/receiver “handshake”
  - agree to establish connection (i.e. whether the other is willing to establish connection)
  - agree on connection parameters (e.g. starting seq #）
- 2-way handshake not always works
  - variable delays
  - retransmitted messages (e.g. req_conn(x) due to message loss)
  - message reordering
  - can’t “see” other side
- TCP uses 3-way handshake
  - both client and server have 3 states
    - LISTEN, SYNSENT, ESTAB
  - start: client: LISTEN; server: LISTEN
    1. client → server
    - SYNbit = 1, Seq = x
    - client: LISTEN → SYNSENT
    - server: (upon receipient) LISTEN → SYN RCVD 2. server → client
    - SYNbit = 1, Seq = y
    - ACKbit = 1, ACKnum = x+1 3. client (upon receipient) → server
    - client received SYNACK(x) indicates server is live; send ACK for SYNACK; this segment may contain client-to-serverdata.
    - ACKbit = 1, ACKnum = y+1
    - client: SYNSENT → ESTAB
    - server (upon receipient of ACK(y), indicating client is live): SYN RCVD → ESTAB
- Closing a TCP connection
  - client, server each close their side of connection
    - send TCP segment with FIN bit = 1
  - respond to received FIN with ACK
    - on receiving FIN, ACK can be combined with own FIN
  - simultaneous FIN exchanges can be handled

Principles of Congestion Control

Causes and costs of congestion: 3 scenarios
Approaches towards congestion control:
- end-end congestion control
  - no explicit feedback from network
  - congestion inferred from observed loss, delay
  - approach taken by TCP
- network-assited congestion control
  - routers provide direct feedback to sending/receiving hosts with flows passing through congested router (could happen before any excessive congestion happens)
  - may indicate congestion level or explicitly set sending rate
  - TCP ECN, ATM, DECbit protocols
Congestion
- too many sources sending too much data too fast for network to handle
- manifestations:
  - long delays (queueing in router buffers)
  - packet loss (buffer overflow at routers)
- note:
  - multiple source → congestion
  - flow control → between single sender and single receiver
Causes and costs of congestion: scenario 1
- Simplest scenario:
  - 1 router, infinite buffer
  - input, output link capacity: R
  - 2 flows
  - no retransmissions needed
- As arrival rate $\lambda_{in}$ approaches R/2, the throughput $\lambda_{out}$ is capped at R/2, meaning the maximum per-connection throughput is R/2.
- large queueing delays as arrival rate approaches capacity (i.e. R/2)
Causes and costs of congestion: scenario 2
- one router, finite buffer
- sender retransmits lost, timed-out packet
  - application-layer input = application-layer output: $\lambda_{in} = \lambda_{out}$
  - transport-layer input includes retransmissions: $\lambda_{in}’ \geq \lambda_{in}$
- idealization: some perfect knowledge
  - packets can be lost (dropped at router) due to full buffers
  - sender knows when packet has been dropped: only resends if packet known to be lost (when a packet arrives at router and finds out that there are no buffer space available)
- Overall arrival rate approaches R/2 (full capacity), the maximum throughput $\lambda_{out}$ is less than R/2.
- Moreover, sender time can time out prematurely, sending 2 copies, both of which are delivered → lowering the maximum throughput rate $\lambda_{out}$ further.
- Cost of congestion
  - (lost) more work (retransmission) for given receiver throughput
  - (time out) unneeded retransmissions: link carries multiple copies of a packet
    - decreasing maximum achievable throughput
Causes and costs of congestion: scenario 3
- 4 senders
- multi-hop paths (multiple routers)
- timeout/retransmit
- another cost of congestion
  - when packet dropped, any upstream transmission capacity and buffering used for that packet was wasted!
Summary
- throughput can never exceed capacity
- delay increases as capacity approached
- loss/retransmission decreases effective throughput
- un-needed duplicates further decreases effective throughput
- upstream transmission capacity/buffering wasted for packets lost downstream

TCP Congestion Control

Overview:
“Classic” TCP: loss-based, end-end
additive increase, multiplicative > decrease
“slow” start
CUBIC
Enhanced TCPs:
delay-based congestion control TCP
explicit congestion notification
TCP fairness

TCP congestion control: AIMD: probing for bandwidth
- Additive increase
  - increase sending rate by 1 maximum segment size every RTT until loss detected
- Multiplicative Decrease
  - cut sending rate in half on loss detected by triple duplicate ACK (TCP Reno)
  - cut to 1 MSS (maximum segment size) when loss detected by timeout (TCP Tahoe)
- slow start
  - when connection begins, increase rate exponentially (i.e. doubling) until 1st loss event
  - initially cwnd = 1 MSS
  - double cwnd every RTT
  - done by incrementing cwnd for every ACK received
  - switch from exponential increase to linear when cwnd gets to 1/2 of its value before timeout
- Why AIMD?
  - distributed, asynchronous algorithm
  - optimize congested flow rates network wide
  - have desirable stability properties
- Implementation
  - adjust/control window size cwnd
    - make LastByteSent - LastByteAcked $\leq$ cwnd
  - TCP sending behavhior:
    - roughly: send cwnd bytes, wait RTT for ACKs, then send more bytes TCP rate $\approx$ $\frac{cwnd}{RTT}$ bytes/sec
TCP congestion control: CUBIC:
- basiclaly another way of incrementing the sending rate:
  - initially ramp to W_{max} faster, but then approach W_{max} more slowly
TCP congestion control: delay-based:
- keep bottleneck link busy transmitting, but avoid high delays/buffering (i.e. keep sender-to-receiver pipe “just full enough, but no fuller”), which ensures no loss incurred
- Implementation:
  - measured throughput = # bytes sent in last RTT interval / RTT measured
  - RTT_min: minimum observed RTT (uncongested path)
  - uncongested throughput = cwnd/RTT_min
  - if measured throughput “very close” to uncongested throughput
    - increase cwnd linearly
  - else if measured throughput “far below” uncongested throughput
    - decrease cwnd linearly
TCP congestion control: Explicit Congestion Notification (ECN):
- TCP deployments often implement network-assisted congestion control:
  - 2 bits in IP header (ToS field) marked by network router to indicate congestion
    - policy to determine marking chosen by network operator
  - congestion indication carried to destination
  - destination sets ECE bit on ACK segment to notify sender of congestion
  - involves both IP (IP header ECN bit marking) and TCP (TCP header C,E bit marking)
Not all network apps must be fair
- Fairness and UDP
  - multimedia app often do not use TCP,
    - do not want rate throttled by congestion control
  - instead use UDP
    - send audio/video at constant rate, tolerate packet loss
- Fairness, parallel TCP connections
  - application can open multiple parallel connections between 2 hosts
    - web browswers do this
    - ask for more connections, higher rate
TCP Tahoe

BBR: Bottleneck Bandwidth and Round-trip propagation time

Overview
packet loss is only weakly correlated with congestion

Model network path
- Dynamically estimate windowed max BW and min RTT on each ACK
Control sending based on the model, to…
- Sequentially probe max BW and min RTT, to feed the model samples
- Pace near estimated BW
- Vary pacing rate to keep inflight near BDP
Seek high throughput with a small queue
- Approaches maximum available throughput for random losses up to 15%
- Maintains small queue independent of buffer depth
Core idea in BBR
- Estimating optimal point
  - BDP = (max BW) * (min RTT)
  - Est min RTT = windowed min of RTT samples
  - Est max BW == windowed max of BW samples
BBR operational state
BBR state machine
- BBR congestion control computes the sending rate based on the delivery rate (throughput) estimated from ACKs. In a nutshell: On each ACK, update our model of the network path: bottleneck_bandwidth = windowed_max(delivered / elapsed, 10 round trips) min_rtt = windowed_min(rtt, 10 seconds) pacing_rate = pacing_gain * bottleneck_bandwidth cwnd = max(cwnd_gain * bottleneck_bandwidth * min_rtt, 4)
- The core algorithm does not react directly to packet losses or delays, although BBR may adjust the size of next send per ACK when loss is observed, or adjust the sending rate if it estimates there is a traffic policer, in order to keep the drop rate reasonable.
- pacing_gain controls how fast packets are sent relative to BtlBw and is key to BBR’s ability to learn.
  - pacing_gain > 1: increases inflight and decreases packet inter-arrival time, moving the connection to the right on figure 1.
  - pacing_gain < 1: opposite
- With gain cycling, BBR cycles through a sequence of values for the pacing_gain. It uses an eight-phase cycle with the following pacing_gain values: 5/4, 3/4, 1, 1, 1, 1, 1, 1. Each phase normally lasts for the estimated RTprop.
  - This design allows the gain cycle first to probe for more bandwidth with a pacing_gain above 1.0, then drain any resulting queue with a pacing_gain an equal distance below 1.0, and then cruise with a short queue using a pacing_gain of 1.0.
  - The average gain across all phases is 1.0 because ProbeBW aims for its average pacing rate to equal the available bandwidth and thus maintain high utilization, while maintaining a small, well-bounded queue.
  - Note that while gain cycling varies the pacing_gain value, the cwnd_gain stays constant at two, since delayed and stretched acks can strike at any time (see the section on Delayed and Stretched Acks).
BBR fairness problem
- in a queue, flow with larger RTT would takes up the space generated by cnwd controlled flow of shorter RTT, therefore, larger RTT flow would have greater share of bottleneck bandwidth
BBR v2
- lower packet loss rate
- time-scale of bandwidth probing
What’s new in BBR v2: a summary

Characterisation of BBR
- low queueing delay, despite bloated buffers
  - WAIT, QUESTION, what is controled here. Figure 10. As buffer increases the latency remains constant for BBR (ofc, since the sendint rate is independent of queue length) whereas the latency increases for CUBIC
  - BUT, it says nothing about arrival rate. Even though the latency could increase for CUBIC, it cuts down the time for retransmission possibly. So does this suggest that decrease in transmission delay is smaller than queueing delay, thereby suggesting a better performance?
Questions:
- congestion control algorithm only runned on the sender?
- scenarios where BBRv1 not doing well
  - bottleneck was a WiFi link, RTT is low, BBRv1 poor, BBRv2 better
  - BBRv1 causes higher packet loss rate than desired, multipl BBR flows competing with each other in a shallow buffer, improved in BBRv2
- Programmable switches
  - ECN
  - providing richer signal to trnsport, presence and extent (queue, depth) of congestion (INT: In-band Network Telemetry, In-band network telemetry is an extensible framework that records a packet’s journey as metadata is added to the live traffic packets.)
BBRv3
- inflight_hi: maximum amount of inflight data observed during the current bandwidth probing cycle.
  - inflight_hi is used to calculate the upper bound of the cwnd, preventing the sender from overshooting the available bandwidth and causing excessive queuing.
- inflight-lo: minimum amount of inflight data observed during the current bandwidth probing cycle.
  - inflight_lo is used to calculate the lower bound of the cwnd, ensuring that the sender does not underestimate the available bandwidth.
BBR Forum

Chapter 4: Network Layer

Introduction

Overview
Data plane
Overview of network layer
What’s inside a router?
The Internet Protocol: IPv4, addressing, NAT, IPv6
Generalized forwarding
Middleboxes
Summary
Control plane

Network-layer services and protocols
- transport segment from sending to receiving host
  - sender: encapsulates segments into datagrams, passes to link layer
  - receiver: delivers segments to transport layer protocol
- network layer protocols in every Internet device: hosts, routers
- routers:
  - examines header fields in all IP datagrams passing through it
  - moves datagrams from input ports to output ports to transfer datagrams along end-end path
2 key network-layer functions
- forwarding: (local action) move packets from a router’s input link to appropriate router output link
- routing: (global action) determine route taken by packets from source to destination
Network Layer
- Data plane
  - local, per-router, per-IP device
  - how is a packet moved from input port to the appropriate output port
- Control plane
  - global, network-wide
  - how a datagram is routed among routers along end-end path from source host to destination host
  - network management, device management
  - 2 implementation
    - distributed routing algorithm (traditional alg): in routers
      - forwarding table in per-router control plane
      - routing algorithm in one router communitates with that in all the other routers to compute the forwarding table
    - software-defined networking (SDN): in (remote) servers
      - a physically seperate remote controller calculates and distributes the forwarding table to each and every individual router
      - this remote controller often is in datacenter
Network service model
- best effort service
- No guarantees on
  - successful datagram delivery to destination
  - timing or order of delivery
  - bandwidth available to end-end flow
- Simplicity allows for wide-adoption, simple maintenaince
- sufficient provisioning of bandwidth: good enough for most of the times, allows performance of real-time applications
- replicated, application-layer distributed services allows services to be provided from multiple locations
- congestion control of “elastic” services helps

What’s inside a router?

Overview
architecture overview
input ports
switching
output ports
packet buffering, scheduling
sidebar: network neutrality

Input port functions
- line termination
  - physica layer: bit-level reception
- link layer protocol
  - where queing happens
- lookup, forwarding
  - using header field values, lookup output port using forwarding table in input port memory (“match plus action”)
  - goal: complete input port processing at ‘line speed’
  - input port queueing: if datagrams arrive faster than forwarding rate into switch fabric
  - destination-based forwarding: forward based only on destination IP address (traditional)
  - generalized forwarding: forward based on any set of header field values
Longest prefix matching
- longest prefix match → use this address for output address
- implementation
  - ternary content addressable memories (TCAMs)
    - content addressable: present address to TCAM: retrieve address in one clock cycle, regardless of table size
switching fabrics
- transfer packet from input link to appropriate output link
- switching rate: rate at which packets can be transfer from inputs to outputs
  - often measured as multiple of input/output line rate
  - N inputs: switching rate N times line rate desirable
- 3 major types
  - swtiching the memory
  - … bus
  - … interconnection network
    - multistage switch network: nxn switch from multiple stages of smaller switches
    - exploiting parallelism
      - fragment datagram into fixed length cells on entry
      - switch cells through the fabric, reassemble datagram at exit
Input port queuing
- if switch fabric slower than input ports combined → queueing may occur at input queues
  - queueing delay and loss due to input buffer overflow
- Head-of-the-Line (HOL) blocking:
  - if only 1 packet at a time can be tranferred from an input port to an output port, if there are k packets at the same time want to go to the same output port, then k-1 packets would be waiting on the input side → could possibly block the packets behind them
  - queued datagram at front of queue prevents others in queue from moving forward
Output port queueing
- Buffering required when datagrams arrive from switch fabric faster than link transmission rate. → datagrams can be lost due to congestion, lack of buffers
- Drop policy: which datagram to drop if no free buffers?
- Scheduling discipline which datagram should be transmitted next?
How much buffering?
- RFC 3439 rule of thumb: average buffering equal to “typical” RTT (Say 250 msec) times link capacity C
  - e.g., C = 10 Gpbs, link: 2.5 Gbit buffer
- most recent recommendation: with N flows, buffering equal to $RTT * C / \sqrt{N}$
- but too much buffering can increase delays (particularly in home routers)
  - long RTTs: poor performance for real-time applications, sluggish TCP response
  - recall delay-based congestion control: “keep bottleneck link just full enough (busy) but no fuller”
Buffer management
- drop: which packet to add, drop when buffers are full
  - tail drop: drop arriving packet
  - priority: drop/remove on priority basis
- marking
  - which packets to makr to signal congestion
Packet scheduling
- first come, first served
- priority
  - arriving traffic classified, queued by class
    - any eader fields can be used for classification
  - send packet from the highest priority queue that has buffered packets
    - FCFS within priority class
- round robin
  - arriving traffic classified, queued by class
    - any eader fields can be used for classification
  - server cyclically, repeatedly scans class queues, sending one complete packet from each class (if available) in turn
- weighted fair queueing
  - generalized round robin
  - each class, i, has weight, w_i, and gets weighted amount of service in each cycle: $w_i/\sum_j w_j$
Network Neutrality
- no blocking
- no throttling
- no paid prioritization

The Internet Protocol

Overview
IPv4
addressing
NAT
IPv6

Network Layer:
- Path-selection algorithms implemented in
  - routing protocols (OSPF, BGP)
  - SDN controller
  - which generates the forwarding table
- IP protocol
  - datagram format
  - addressing
  - packet handling conventions
- ICMP protocol
  - error reporting
  - router “signaling”
IP Datagram format
- version
- header length: how many bytes in the header, know where the payload begins. If header doesn not have option, the header is 20 byte.
- length (16 bit): total datagram length (header + payload). Theoretical max size is 64K bytes, typically: 1500 bytes or less
- type of service:
  - diffserv: used for classify classes to provide different services (such as priority)
  - ECN: sets to indicate congestion
- TTL (Time to Live): counter. Time to Live to ensure packets do not loop forever.
- upper layer protocol: TCP, UDP. Transport layer protocol the payload data is going to be passed. e.g. 6: TCP; 17: UDP
- flags & fragment offset when a single large datagram is fragmented into multiple smaller datagrams (these fields not show up in IPv6)
- header checksum: need to be recomputed at each router the datagram passes through, since some fields change (e.g. TTL) (removed in IPv6)
- source IP address (32 bit)
- destination IP address (32 bit)
- option optional fields (e.g. timestamp, record route taken)
- payload data: transport layer segment
- All in all, overhead
  - 20 bytes of TCP
  - 20 bytes of IP
  - = 40 bytes + app layer overhead
IP addressing
- link-layer interface on a host/router
- IP address: 32-bit identifier associated with each host or router interface
  - Interface: connection between host/router and physical link
    - router’s typically have multiple interfaces (multiple input/output links)
    - host typically has 1 or 2 interfaces (e.g. wired Ethernet, wireless 802.11)
  - dotted-decimal IP address notation
    - 223.1.1.1 = 11011111 00000001 00000001 00000001
  - how are interfaces actually connected?
    - one local thing to connect
Subnet
- device interfaces that can physically reach each other without passing through an intervening router
  - how to define subnets
    - detach each interface from its host or router, creating “islands” of isolated networks
    - each isolated network is called a subnet
- IP addresses have structure:
  - subnet part: devices in same subnet have common high order bits (subnet mask: /24 – high-order 24 bits)
  - host part: remaining low order bits (remaining 8 bits)
- /24 has a name: CIDR: Classless InterDomain Routing
  - subnet portion of address of arbitary length
  - address format: a.b.c.d/x, where x is # bits in subnet portion of address
How does one get the IP address?
- How does a host get IP address within its netowrk?
  - DHCP: Dynamic Host Configuration Protocol: dynamically get address from as server
    - plug-and-play
    - goal: host dynamically obtains IP address from network server when it “joins” network
      - can renew its lease on address in use
      - allows reuse of addresses (only hold address while connected/on)
      - support for mobile users who join/leave network
    - overview:
      - host broadcasts DHCP discover msg [optional]
      - DHCP server responds with DHCP offer msg [optional]
      - host requests IP address: DHCP request msg
      - DHCP server sends address: DHCP ack msg
    - typically, DHCP server will be co-located in router, serving all subnets to which router is connected
    - DHCP: more than IP addresses
      - DHCP can return more than just allocated IP address on subnet:
        address of first-hop router for client
        name and IP address of DNS server
        network mask (indicating network versus host portion of address)
- How does a network get IP address for itself?
  - gets allocated portion of its provider ISP’s address space
Hierarchical addressing: route aggregation
- Fly-By-Night-ISP
- ISPs-R-Us
How does an ISP get block of addresses?
- ICANN: Internet Corporation for Assigned Names and Numbers
  - allocates IP addresses, through 5 regional registries
  - manages DNS root zone, including delegation of invididual TLD (.com, .edu, …) management
Are there enough 32-bit IP addresses?
- ICANN allocated last chunk of IPv4 addresses to RRs in 2011
- NAT (next) helps IPv4 address space exhaustion
- IPv6 has 128-bit address space
NAT: Network Address Translation
- all devices in local network share just one IPv4 address as far as outside world is concerned → the NAT router → the NAT router then take in charge of distributing the messages from outside world to different designated local network addresses
- advantages:
  - 1 IP address needed from provider ISP for all devices
  - can change address of host in local network without notifying outside world
  - can change ISP without changing addresses of devices in local network
  - security: devices inside local net not directly addressable, visible by outside world
- implementation:
  - outgoing datagram: replace (source IP address, port #) of every outgoing datagram to (NAT IP address, new port #)
  - remember (in NAT translation table) every (source IP address, port #) to (NAT IP address, new port #) translation pair
  - incoming datagrams: replace (NAT IP address, new port #) in destination fields of every incoming datagram with corresponding (source IP address, port #) stored in NAT table
- controversal
  - routers “should” only process up to layer 3
  - address shortage should be solved by IPv6
  - violates end-to-end argument (port # manipulation by network-layer device)
  - NAT traversal: what if client wants to connect to server behind NAT?
IPv6
- motivation:
  - larger address space: 32 → 128 bit
  - speed processing/forwarding: 40-byte fixed length header
  - enable different network-layer treatment of “flow”
- IPv6 datagram format
  - source/dest address (128 bit)
  - flow label: identify datagrams in same “flow”
  - priority: priority among datagrams in flow
  - What’s missing?
    - no checksum (to speed processing at routers)
    - no fragmentation (can be done at end point)
    - no options (available as upper-layer, next-header protocol at router)
- Transition from IPv4 to IPv6
  - not all routers can be upgraded simultaneously
    - no “flag days”
    - how will network operate with mixed IPv4 and IPv6 routers? → COMPATABILITY
  - tunneling
    - IPv6 datagram carried as the payload in IPv4 datagram among IPv4 routers

Generalized Forwarding

Overview
Match + action
OpenFlow

generalized forwarding:
- many header fields can determine action
- action space much bigger than just forwarding
Flow table abstraction
- flow: defined by header field values (in link-, network-, transport-layer fields)
- generalized forwarding: simple packet-handling rules
  - match: pattern values in packet header fields
  - actions: for matched packet: drop, forward, modify, matched packet or send matched packet to controller
    - allows for layer 3 routing, layer 2 swithcing, firewalling
  - priority: disambiguate overlapping patterns
  - counters: # bytes and # packets
OpenFlow abstraction
- Router
  - match: longest destination IP prefix
  - action: forward out a link
- Switch
  - match: destination MAC address
  - action: forward or flood
- Firewall
  - match: IP addresses and TCP/UDP port numbers
  - action: permit or deny
- NAT
  - match: IP address and port
  - action: rewrite address and port
Orchestrated tables can create network-wide behavior by controlling per-router action

Middleboxes

Overview
middlebox functions
architectural principles and evolution of Internet

Middlebox
- any intermediary box performing functions apart from normal, standard functions of an IP router (i.e. any destination-based forwarding of IP datagram) on the data path between a source host and destination host (i.e. happen at network core instead of edge hosts)
- e.g. NAT, Firewalls, IDS, Load balancers, Caches (storage), Application-specific
Internet architectural principles
- The goal is connectivity, the tool is the Internet Protocol, and the intelligence is end to end rather than hidden in the network

Control plane (more network-wide view)

Routing algorithms: link state routing (computing the least cost path)

routing algorithm goal:
- determine “good” paths from sending hosts to receiving hosts through network of routers
Dijkstra’s link-state routing algorithm
- centralized: network topology, link costs known to all nodes
  - accomplished via “link state broadcast”
  - all nodes have same info
- computes least cost paths from one node (“source”) to all other nodes
  - gives forwarding table for that node
- iterative: after k iterations, know least cost path to k destinations
Bellman Ford Distance Vector Routing Algorithm
- Based on Bellman-Ford (BF) equation (dynamic programming)
  - Let $D_x(y)$: cost of least-cost path from x to y. Then $D_x(y) = min_v{(c_{x,v}+D_v(y))}$
  - min taken over all neighbors v of x
  - direct cost of link from x to v
  - v’s estimated least-cost-path cost to y
- key idea:
  - from time-to-time, each node sends its own distance vector estimate to neighbors
  - when x receives new DV estimate from any neighbor, it updates its own DV using B-F equation: $D_x(y) \leftarrow min_v{(c_{x,v} + D_v(y))}$ for each node $y \in N$
- for each node, 3 steps
  - wait for (change in local link cost or msg from neighbor)
  - recompute my DV estimates using DV received from neighbor
  - if my DV to any destination has changed, send my new DV to my neighbors, else do nothing
- nodes can iterate asynchronously
- distributed, self-stopping
  - each node notifies its neighbors only when its DV changes
  - no notification received, no actions taken!
- “good news travels fast”
  - if a edge cost decreases, neighbors are updated fast
- “bad news travels slow”
  - count-to-infinity problem
Comparision of Link State and Distance Vector algorithm
- Message complexity
  - Link State: n nodes, E links, O(nE) messags
  - Distance Vector: exchange between neighbors only
- Speed of Convergence
  - Link State: O(nlogn)
    - may have oscillations
  - Distance Vector: convergence time vaires
    - may be routing loops
    - count-to-infinity problem
    - (faster with triggered updates)
- Space requirement
  - LS maintains entire topology
  - DV maintains only neighbor state

Open Shortest Path First (OSPF)

Overview
Internet routing in practice
Scale, autonomy
OSFP: Open Shortest Path First Routing Protocol
BGP: Border Gateway Protocol
implement routing policy, control packets traffic

scale: billions of destinations
- can’t store all destinations in routing table!
- routing table exchange would swamp links!
- aggregate routers into regions known as “autonomous systems” (AS) (aka “domains”)
- intra-AS (aka intra-domain)
  - routing among routers within the same domain
    - all routers in AS must run same intra-domain protocol
    - routers in different AS can run different intra-domain routing protocols
    - gateway router: at “edge” of its own AS, has link(s) to router(s) in other AS’es
  - e.g. RIP: Routing Information Protocol
  - e.g. OSPF, classic link-state routing
    - each router floods OSPF link-state advertisements (directly over IP rather than using TCP/UDP) to all other routers in entire AS
    - multiple link costs metrics possible: bandwidth, delay
    - each router has full topology, uses Dijkstra’s algorithm to compute forwarding table
    - security: all OSPF messages authenticated
    - There is also hierarchical OSPF
      - 2-level hierarchy: local area, backbone
        link-state advertisements flooded only in area, or backbone
        each node has detailed area topolocy; only knows direction to reach other destination
  - EIGRP: Enhanced Interior Gateway Routing Protocol
    - DV based
- inter-AS
  - gateways perform inter-domain routing (as well as participating in intra-domain within its own AS)
  - BGP (Border Gateway Protocol)
    - BGP provides each AS a means to
      - obtain destination network reachability info from neighboring ASes (eBGP)
      - determine routes to other networks based on reachability information and policy
      - propagate reachability information to all AS-internal routers (iBGP)
      - advertise (to neighboring networks) destination rechability info
    - BGP protocol messages
      - BGP messages exchanged between peers over TCP connection
      - BGP messages
        OPEN: opoen TCP connection to remote BGP peer and authenticates sending BGP peer
        UPDATE: advertises new path (or withdraws old)
        KEEPALIVE: keeps connection alive in absence of UPDATES; also ACKs OPEN request
        NOTIFICATION: reports errors in previous msg; also used to close connection
    - Path attributes and BGP routes
      - BGP advertised path: prefix + attributes
        path prefix: destination being advertised
        2 important attributes
        AS-PATH: list of ASes through which prefix advertisement has passed
        NEXT-HOP: indicates specific internal-AS router to next-hop AS
      - policy-based routing
        policy prefered over performance
        router receiving route advertisement to destination X uses policy to accept/reject a path (e.g., never route through AS W, or country Y).
        router uses policy to decide whether to advertise a path to neighboring AS Z (does router want to route traffic forwarded from Z destined to X?)
    - Hot potato routing
      - choose local gateway that has least intra-domain cost (i.e. get xx away as quickly as possible)
- forwarding table configured by intra- and inter-AS routing algorithms
  - intra-AS routing determine entries for destinations within AS
  - inter-AS and intra-AS determine entries for external destinations
adminnistrative autonomy
- each network admin may want to control routing in its own network

ICMP: Internet Control Message Protocol

used by hosts and routers to communicate network-level information
- error reporting: unreachable host, network, port, protocol
- echo request/reply
carried as payload inside IP datagram
- protocol number: 1, used for demultiplexing up from IP
ICMP message: type, code plus header and first 8 bytes of IP datagram causing error
Traceroute and ICMP
- source sends sets of UDP segments to destination
  - 1st set has TTL = 1
  - 2nd set has TTL = 2
- datagram in nth set arrives to nth router
  - router discards datagram and sends source ICMP message (type 11, code 0)
  - IP address of router where TTL expired is source,
  - IP address of datagram containing this ICMP message
- when ICMP message arrives at source: record RTTs
- stopping criteria
  - UDP segment eventually arrives at destination host
  - destination returns ICMP “port unreachable” message (type 3, code 3)
  - source stops

Link Layer

Overview
understsand principles behind link layer servics:
error detection, correction
sharing a broadcst channel: multiple access
link layer addressing
practice: instantiation, implementation of various link layer technologies
Ethernet
VLANs
MPLS
data center networks

terminology
- hosts, routers: nodes
- communication channels that directly connect physically adjacent nodes: links
  - wired, wireless
  - LANs
- layer-2 packet: frame, encapsulates datagram
- link layer has responsibility of transferring datagram from one node to physically adjacent node over a link
context
- datagram transferred by different link protocols over different links
  - e.g. WiFi on first link, Ethernet on next link
- each link protocol provides different services
  - e.g. may or may not provide reliable data transfer over link
- transportation analogy:
  - trip from Pinceton to Lausanne
    - limo: Princeton to JFK
    - plane: JFK to Geneva
    - train: Geneva to Lausanne
  - tourist = datagram
  - tranport segment = communication link
services
- framing, link access
  - encapsulate datagram into frame, adding header, trailer
  - channel access if shared medium
  - MAC addresses in frame headers identify source, destination
- reliable delivery between adjacent nodes
  - we already know how to do this
  - seldom used on low bit-error links
  - wireless links: high error rates
    - Q: why both link-level and end-end reliability?
- flow control
  - pacing between adjacent sending and receiving nodes
- error detection
  - errors caused by signal attenuation, noise
  - receiver detects errors, signals retransmission, or drops frame
- error correction
  - receiver identifies and corrects bit error(s) without retransmission
- half-duplex and full-duplex
  - with half duplex, nodes at both ends of link can transmit, but not at same time
Host link-layer implementation
- in each-and-every host
- link layer implemented on-chip or in network interface card (NIC)
  - implements link, physical layer
- attaches into host’s system buses
- combination of hardware, software, firmware
Interfaces communicating
- sending side:
  - encapsulates datagram in frame
  - adds error checking bits, reliable data transfer, flow control, etc.
- receiving side:
  - looks for errors, reliable data transfer, flow control, etc
  - extracts datagram, passes to upper layer at receiving side
Error detection
- Parity checking
  - single bit parity
  - 2-D parity:
    - detect 2 bit errors
    - detect and correct single bit errors without retransmission
- Internet checksum
- Cycic Redundancy Check (CRC)

Multiple Access Links and Protocols

Overview
the multiple access link
multiple access protocols
channel partitioning
random access
taking turns
case study: cable access networks

Share on

Twitter Facebook LinkedIn

Liu Yang (Willow)

Computer Networking: A Top Down Approach

Chapter 1

What is the Internet?

What is a protocol?

Network Edge

Network Core

Performance

Layering, encapsulation

Networks under attack, Network Security

Chapter 2: Application Layer

Principles of the Application Layer

Web, HTTP

Email

Domain Name System (DNS)

Chapter 3: Transport Layer

Introduction

Multiplexing and demultiplexing

Connectionless Transport: UDP

Principles of Reliable Data Transfer

Connection-oriented transport: TCP

Principles of Congestion Control

TCP Congestion Control

BBR: Bottleneck Bandwidth and Round-trip propagation time

Chapter 4: Network Layer

Introduction

What’s inside a router?

The Internet Protocol

Generalized Forwarding

Middleboxes

Control plane (more network-wide view)

Routing algorithms: link state routing (computing the least cost path)

Open Shortest Path First (OSPF)

ICMP: Internet Control Message Protocol

Link Layer

Multiple Access Links and Protocols

Share on

You May Also Enjoy

Future Blog Post

Parallel Computing