Xinye Tao | Blog - Topic - Collection - Resume | RSS


# Created: 2020-02-07; Modified: 2020-06-24

Paper and blog, technical reads.


Noise, queueing, work.


Limiting inflight IO requests, the same strategy is also used by ScyllaDB. It has penalty on throughput because the effectiveness of upper-level scheduling is inversely proportional to the amount of requests visible to hardware driver. In the evaluation results, read is more sensitive to concurrency, e.g. 50% bandwidth drop when only allowing 8 inflight requests.

One thing though, I don’t think a cross-core queue is a must to implement a strict priority scheduler. It seems the paper didn’t research to what extend this global queue contributes to IO latency.


We need interactive intelligence, thus warehouse-scale computer.

Wimpy cores can’t support request level parallelism, needs help of multi-threading, which incurs dev and maintain costs.

Flash has longer tail compared to spinning disk, but it’s inevitable because spinning disk has volume IOPS upper bound. (Why not RAID? Not very convincing)

PUT(Power Usage Effectiveness) is good already. Now we care more about energy proportionality, it turns out a fully-utilized computer is the most energy-efficient and infra-efficient(power plant etc).

The key to improving utilization is resource disaggregation. At that time, slow disk can already be accessed remotely without sacrificing performance, but not the case for flash or memory. Software stack is optimized for bandwidth not latency, partly motivated by used-to-be slow disks and network. They must be reexamined because tail latency becomes critical at large scale.

An impressive showcase: inter-arrival time of root node and tree node.

A text summary of the talk above.


Good example of tail latency amplification in parallel pipeline (RAID).


The title is quite click-baity. For a non-native speaker such as myself, variation seems pretty equivalent to instability. It is actually about improving the reproducibility of benchmarks on storage stack.


RInK (Remote In-memory Key-value store) serves as buffer (short-lived data) or cache. Main idea is to resurect domain-sepecific cache, quite a boring read.

For CPU, bandwidth refers to MIPS (million instructions per second), latency refers to latency of instructions.

Latency helps BW, while BW hurts latency (queueing and larger chip size), which is the essential reason why this rule of thumb will be applicable in foreseeable future.

Ways to speedup latency: caching, replication(this is brought up in “The tail at scale” too), prefetching.


“The main bottleneck can be found in RocksDB itself, synchronizing compactor threads among each other.” I didn’t see how compaction threads can be the bottleneck for RocksDB overwrite workload above 10 cores.


Statistics analysis of disk failures. Unlike previous model (where disk has three different failing periods: early-failure, useful-life, wearout), wearout seems to be the most determining factor in any year of operation.


Graphical overview of distributed event ordering: vector clocks and etc. Skimmed through.

An old but still interesting read. Its main emphasis is on the C/C++’s specification of abstract machine and instruction order guarantee:

(C++ Standard) The observable behavior of the [C++] abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions.

(smells a lot like pure function concept in functional programming language)

This means, to ensure singleton’s allocation completes before modifying the singleton pointer, one must insert volatile temporary variable to avoid any possible reorder. And here volatile Object* volatile ptr isn’t an overkill.

But hey, even this isn’t nearly enough. C++ doesn’t guarantee the instruction order beyond the single-threaded abstract machine. And volatile variables only acquire those benefits when it’s properly initialized (which can be worked around by data member casting).

To this date, standard’s memory order should be a portable solution to this issue. But last I checked standard didn’t mention anything about instruction order guarantee brought by memory barrier. Needs investigating.

In all fields, the problem is to find the question.

Love it. How things are differently perceived when high performant server only has 10MB of RAM. Make a good interview question about estimation. I wonder if the modern VM sizing still cater to its target workloads in such economical ways.

One thing that isn’t as intuitive though, is they models the disk access price by dividing disk price to IOPS. Say we have certain amount of data, that naturally makes the lowerbound for disk capacity. Indeed, adding disks on this basis (through RAID) will increase cost and IOPS, but baseline cost for storing certain amount of data should be excluded from the calculation. It’s basically what cloud vendors are doing now: pricing separately for capacity and performance including IOPS and bandwidth.

The 80-20 rule implies that about 80% of the accesses go to 20% of the data, and 80% of the 80% goes to 20% of that 20%.


Bottom-up reliability is not ecomonical in distributed system, identify the “end” first.