This is a draft post.February 2021
This post contains a bit generic details of observations I made while working on a project – I’ve not included the specific details due to the proprietary nature of the project (it’s for my company). However, it does not hinder the main gist of the problems encountered.
In summary, the project moves large amounts of data from a source to a destination. This post discuss the layers of buffers the bits encounters while traveling, and optimizing the project for some of the more prominent ones in the communication stack between the two nodes.
Characteristics of Data
The data consists of chunks that are varying in sizes. Some chunks are large, over 10GB; some chunks are small, less than 10KB. Most chunks are around 5GB with 3GB variance.
Description of the System
- Add description: avoid batch, parallelism, bottlenecks, saturation, etc
- Network saturation: by having too many streams of I/O on a single machine – I am potentially going beyond the optimal in-flight bandwidth for the network stack (
node xxxhas 1Gbps/1Gbps and a nice network card – but note that it is a VM on that machine, so there’s a level of redirection depending on the setup VT-d/IOMMU/etc).
- By moving the data in parallel, I encountered some of the bottleneck caused by buffering the total in-flight transfers to 30 chunks (the number is somewhat arbitrary).
The constraints observed here boils down to why we have buffers.
- Add why buffers
The application running is effectively a buffer. Yes, it is lean – it pushes the bits as soon as they arrive. However lean though, because the size of each of the stream is varying, it waits for the largest chunk to finish due to (2) above before proceeding to the next batch. See how the batch comes back here, sigh
This could be alleviated by having a bit more complex logic of always ensuring there’s xx number of streams in-flight. That will reduce the problem to just (1) of the observations.