Search this blog...
Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts
MAX_JIFFY_OFFSET: msleep() beyond infinity
>>
linux
,
linux-kernel
,
msleep
,
Programming
Ever wondered what would happen if a driver in the Linux kernel were to invoke Buzz Lightyear's oft-quoted catchphrase?
If not then today is your lucky day, you can ponder over this million-dollar(price of Lightyear's left wing-nut :P) question safe in the comfort of the fact that the definitive answer is only as far as a couple of clicks of your mouse-wheel.
First a few numerical constants relevant to this discussion
Largest unsigned 32bit number = ULONG_MAX = 0xFFFFFFFF
Largest signed 32bit number = LONG_MAX = 0x7FFFFFFF
Largest jiffies defined = MAX_JIFFY_OFFSET = 0x3FFFFFFE
A "jiffy" is defined as 1/HZ seconds. On a typical system with HZ=100, 1Jiffy works out to be 10ms in terms of real world units of time.
msleep() uses msecs_to_jiffies(), which relies on MAX_JIFFY_OFFSET as the definition of infinite sleep timeout.
Now consider the following input/output map of the msec_to_jiffies() for the entire range of 32bit unsigned int.
As is apparent, msecs-jiffies mapping is mostly linearly. Starting from 0, for increasing values of input, msec_to_jiffies() returns correspondingly larger values. However once the input msecs exceeds LONG_MAX, the output jiffies is clamped to MAX_JIFFY_OFFSET.
Its obvious that, for certain values of input to msecs_to_jiffies() the output result exceeds its definition of "infinity" i.e. exceeds MAX_JIFFY_OFFSET. In fact, this happens for a quarter of the range of 32bit unsigned int! Pretty brain-dead, you say, eh?
To be fair, the patch that introduced this definition of MAX_JIFFY_OFFSET is perfectly fine as it solves a genuine practical problem.
commit 9f907c0144496e464bd5ed5a99a51227d63a9c0b
Author: Ingo Molnar
[PATCH] Fix timeout overflow with jiffies
Prevent timeout overflow if timer ticks are behind jiffies
(due to high softirq load or due to dyntick),
by limiting the valid timeout range to MAX_LONG/2.
So what can be done for msecs_to_jiffies()?
Instead of clamping negative values to MAX_JIFFY_OFFSET
502 /*
503 * Negative value, means infinite timeout:
504 */
505 if ((int)m < 0)
506 return MAX_JIFFY_OFFSET;
Should values larger than (MAX_LONG/2) be clamped to MAX_JIFFY_OFFSET?
502 /*
503 * Prevent timeout overflow if timer ticks are behind jiffies
504 * by limiting the valid timeout range to MAX_LONG/2.
505 */
506 if (m > MAX_JIFFY_OFFSET)
507 return MAX_JIFFY_OFFSET;
This update follows the spirit of the patch that introduced the current definition of MAX_JIFFY_OFFSET. Heck, even the comment is borrowed from it!
To answer the original question, any invocation of msleep(N)
where MAX_JIFFY_OFFSET < N < LONG_MAX
technically sleeps for longer than "infinity" as defined in the context of msleep().
For the technically inclined, a parting question -
If not then today is your lucky day, you can ponder over this million-dollar(price of Lightyear's left wing-nut :P) question safe in the comfort of the fact that the definitive answer is only as far as a couple of clicks of your mouse-wheel.
NOTE: This article discusses the intricacies of the implementation of msleep(), specifically the conversion between milliseconds and jiffies within the Linux kernel. For details on implementing blocking calls and other such constructs that wait indefinitely, refer to Section 6.2 Blocking I/O of LDD3 here or here.
First a few numerical constants relevant to this discussion
Largest unsigned 32bit number = ULONG_MAX = 0xFFFFFFFF
Largest signed 32bit number = LONG_MAX = 0x7FFFFFFF
Largest jiffies defined = MAX_JIFFY_OFFSET = 0x3FFFFFFE
A "jiffy" is defined as 1/HZ seconds. On a typical system with HZ=100, 1Jiffy works out to be 10ms in terms of real world units of time.
MAX_JIFFY_OFFSET = 0x3FFFFFFE jiffies
= 1073741822jiffies
= ~10737418seconds
= ~2982 hrs
= ~124days
= ~17weeks
msleep() uses msecs_to_jiffies(), which relies on MAX_JIFFY_OFFSET as the definition of infinite sleep timeout.
Now consider the following input/output map of the msec_to_jiffies() for the entire range of 32bit unsigned int.
As is apparent, msecs-jiffies mapping is mostly linearly. Starting from 0, for increasing values of input, msec_to_jiffies() returns correspondingly larger values. However once the input msecs exceeds LONG_MAX, the output jiffies is clamped to MAX_JIFFY_OFFSET.
Its obvious that, for certain values of input to msecs_to_jiffies() the output result exceeds its definition of "infinity" i.e. exceeds MAX_JIFFY_OFFSET. In fact, this happens for a quarter of the range of 32bit unsigned int! Pretty brain-dead, you say, eh?
To be fair, the patch that introduced this definition of MAX_JIFFY_OFFSET is perfectly fine as it solves a genuine practical problem.
commit 9f907c0144496e464bd5ed5a99a51227d63a9c0b
Author: Ingo Molnar
[PATCH] Fix timeout overflow with jiffies
Prevent timeout overflow if timer ticks are behind jiffies
(due to high softirq load or due to dyntick),
by limiting the valid timeout range to MAX_LONG/2.
So what can be done for msecs_to_jiffies()?
Instead of clamping negative values to MAX_JIFFY_OFFSET
502 /*
503 * Negative value, means infinite timeout:
504 */
505 if ((int)m < 0)
506 return MAX_JIFFY_OFFSET;
Should values larger than (MAX_LONG/2) be clamped to MAX_JIFFY_OFFSET?
502 /*
503 * Prevent timeout overflow if timer ticks are behind jiffies
504 * by limiting the valid timeout range to MAX_LONG/2.
505 */
506 if (m > MAX_JIFFY_OFFSET)
507 return MAX_JIFFY_OFFSET;
This update follows the spirit of the patch that introduced the current definition of MAX_JIFFY_OFFSET. Heck, even the comment is borrowed from it!
To answer the original question, any invocation of msleep(N)
where MAX_JIFFY_OFFSET < N < LONG_MAX
technically sleeps for longer than "infinity" as defined in the context of msleep().
For the technically inclined, a parting question -
Q. With the current implementation in the Linux kernel,
under what circumstances will msleep(10) return sooner than 10ms?
HDD, FS, O_SYNC : Throughput vs. Integrity
>>
CodeProject
,
data-barrier
,
HDD
,
linux
,
linux-kernel
,
page-cache
,
SATA
Today we will spend some time over filesystems, block-devices, throughput and data-integrity. But first, a few "MYTHBUSTER" statements.
#1: Even the fastest HDD today can do ONLY 650KBps natively.
#2: O_SYNC on a filesystem does NOT guarantee a write to the HDD.
#3: Raw-I/O over BLOCK devices DOES guarantee data-integrity.
Hard to believe, right? Lets analyse these statements one by one...
The fastest Hard-disk drives today run at 10,000RPM (compared to regular ones at 5400,7200). Also the faster HDDs have transitioned to a 4096 byte internal block-size (compared to 512 bytes on regular ones).
![]() |
Details of HDD components |
To read/write one particular sector from/to the HDD, the head needs to be first aligned radially in the proper position. Next one waits as the rotating disk platter positions the desired sector under the disk-head. Now one is able read/write to the sector.
Unit of data on HDD = 1 sector =4096 bytes
MAX RPM = 10,000 = 10,000/60 = 166.667 rotations/second
Seek-time = 1/166.667 = 0.006s
Throughput = 4096/0.006 = 682666.667 ~ 650KBps
The above calculation assumes the worst possible values for both the I/O-size(1 sector) and seek-time(1 entire rotation). This condition though is quite easily seen in real life scenarios like database applications which use the HDD as a raw block-device.
Better speeds in the range of 20-50MBps are commonly obtained by a combination of several strategies like:
- Multi-block I/O.
- Native Command Queueing.
- RAID stripping.
Now lets consider a regular I/O request at the HDD level:
HDD Read:
- The kernel raises a disk read request to the HDD.The HDD has a small amount of disk-cache (RAM) which it checks to see if the requested data exists.
- If NOT found in the disk-cache then the disk-head is moved to the data location on the platter.
- The data is read into disk-cache and a copy is returned to the kernel.
HDD Write:
- The kernel raises a disk write request to the HDD.
- The HDD has a small amount of cache (RAM) where the data to be written is placed.
- An on-board disk controller is in-charge of the saving the contents of the cache to the platter. It operates on the following set of rules:
- [Rule1] Write cache to platter as soon as possible.
- [Rule2] Re-order write operations in sequence of platter location.
- [Rule3] Bunch several random writes together into one sequence.
![]() |
"Hasta la vista, baby!" HUD of the disk-IO firmware running on a T-800 Terminator ;-) |
[Rule1] minimises data loss. As cache is volatile i.e. any power-outage will mean that data (in the HDD-cache), which is NOT yet written to the HDD-platter, is lost.
[Rule2] optimises the throughput. As serialising access reduces the time spent in seeking by the read-write head.
[Rule3] reduces power consumption, disk-wear by allowing the disk to be stopped from constantly spinning all the time. Only when the cache is filled to a certain limit the disk motor is powered on and the cache is flushed to the platter, following which the motor is powered-down again until the next cache flush.
Its obvious that [1] [2] [3] are counter-productive and the right balance needs to be struck between the three to have data-integrity, high-throughput, low-power-consumption & longer disk-life. Several complex algorithms have been devised to handle this in modern-day HDD controllers.
The problem though is that by default performance is sacrificed in favour of the other two. This is just a "default" setting though and the beauty of the Linux-Kernel being open-source is that one is free to stray from the "default" setting.
[Rule2] optimises the throughput. As serialising access reduces the time spent in seeking by the read-write head.
[Rule3] reduces power consumption, disk-wear by allowing the disk to be stopped from constantly spinning all the time. Only when the cache is filled to a certain limit the disk motor is powered on and the cache is flushed to the platter, following which the motor is powered-down again until the next cache flush.
Its obvious that [1] [2] [3] are counter-productive and the right balance needs to be struck between the three to have data-integrity, high-throughput, low-power-consumption & longer disk-life. Several complex algorithms have been devised to handle this in modern-day HDD controllers.
The problem though is that by default performance is sacrificed in favour of the other two. This is just a "default" setting though and the beauty of the Linux-Kernel being open-source is that one is free to stray from the "default" setting.
If you are doing mostly sequential raw block-I/O on a SATA HDD, you would be a prime candidate for this patch, which effectively moves the operation-point of the HDD closer to high-performance region in the map.
Moving on, we now focus on how the use of O_SYNC affects I/O on filesystems as well as the raw block device.
Regular write on a regular filesystem
fd = open("/media/mount1/file");
Consider the first case where one does a regular write(NO O_SYNC flag) on a regular filesystem on a HDD. The data is copied from the APP to the FS i.e. userspace-app(RAM) into the kernel filesystem page-cache(RAM) and control returns. During this, one does NOT cross any data-barriers and hence data is NOT guaranteed to be written to the HDD. This makes the entire process of a regular write on an fs extremely fast.
Synchronous write on a regular filesystem
fd = open("/media/mount1/file", O_SYNC);
The second case illustrated above depicts a synchronous write (with O_SYNC flag) on a regular filesystem on a HDD. The man page for open() call contains the following notes:
O_SYNC The file is opened for synchronous I/O. Any writes on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware.
Although using O_SYNC looks like a sureshot guarantee that data is indeed written to disk, there lies a catch in the implementation. Most HDDs contain a on-board cache on the HDD itself. A write command to the disk transfers the data from the kernel filesystem page-cache(RAM) to the HDD-cache(not the actual mechanical platter of the disk.) This process is limited by the bus(SATA, IDE etc) which is faster than the actual mechanical platter. When a HDD receives a write command, it copies the data into its internal HDD-cache and returns immediately. The HDD's internal firmware later transfers the data to the disk-platter according to its "3 rules" as discussed previously. Thus a write to HDD does NOT necessarily imply a write to the disk platter. Hence even synchronous writes on a filsystem do NOT imply 100% data integrity.
Also a data-barrier exists along this path in the filesystem layer where metadata(inode,superblock) info is stored. This will help in identifying any data integrity on future access. Note that maintaining/updating inode,superblocks does NOT guarantee that the data is written to disk. Rather it makes the last sequence of writes atomic (i.e. all the writes get committed to disk or none). The inode,superblock info serves as a kind of checksum as they are updated accordingly following the atomic write operation. All this processing means that throughput incurs a slight penalty in this case.
Synchronous write on a block device
fd = open("/dev/sda", O_SYNC);
The third case illustrated above is a synchronous write to the HDD directly via its block-device(eg. /dev/sda). In this case there is NO data-barrier in the filesystem. The O_SYNC is implemented by using the data-barrier present in the HDD i.e. flushing the disk-cache explicity to ensure that all the data is indeed transferred to the disk-platter before returning. This incurs the maximum penalty and hence the throughput is the slowest of all 3 scenarios above.
Salient observations:
- A data barrier is a module/function across which data integrity is guaranteed i.e if the function is called and it returns successfully, then the data is completely written to non-volatile memory(HDD, in this case). A data barrier introduces an order of magnitude change in:
- Access-time (++)
- Throughput (--)
- A data barrier in the lower layers incurs a larger penalty than one in the upper layers. (Penalty : App < FS < Disk.)
- O_SYNC on HDD via filesystems does NOT guarantee a successful write to non-volatile disk-platter.
- O_SYNC on HDD via block-device directly guarantees data-integrity but offers very low throughput.
- The HDD data-barrier(FLUSH_CACHE) is NOT utilised when using regular filesystems to access the HDD.
- Disabling HDD data-barrier and raw DIRECT-I/O via the block device provides maximum throughput to a HDD.
fd = open("/dev/sda", O_SYNC|O_DIRECT);
Further reading : 5 ways to improve HDD performance on Linux
SATA hotplug : Add/Remove sata HDD in a jiffy
>>
CodeProject
,
HDD
,
linux
,
SATA
Its not common knowledge that SATA HDDs are capable of being hot-plugged into almost any modern PC. However using them in a plug-n-play manner like portable external USB-HDDs is still not common.
Here is how to use your SATA HDD like an portable HDD.
Tested on:
- Ubuntu 11.10
- DELL Optiplex 380
- Seagate Barracuda 1TB (SATA).
1. Connect the SATA HDD to host PC. (sata-bus + power)

2. Scan for new devices on SCSI host
sudo echo "- - -" > /sys/class/scsi_host/hostN/scanwhere N is the host port number on your host PC to which you have plugged-in the SATA HDD. Usually N=1, assuming the primary HDD on host PC is connected on SATA0.
"- - -" stands for wildcards in place of the
channel number, SCSI target ID, and LUN. More Info
3. Mount the newly detected device locally
sudo mount /dev/sdX /media/temphddwhere X is a/b/c/d etc. Usually X=b, assuming the primary HDD on host PC is enumerated as sda an there are no other block devices.
4. Copy all your data to/from the HDD present at /media/tempHDD.
5. Once finished, unmount the device
sudo umount /media/temphdd
6. Powering down the SATA HDD
sudo echo 1 > /sys/block/sdX/device/deleteEnsure that you refer to the proper device (sdb, sdc etc.) as above in step 3.
7. Disconnect the SATA HDD from host PC.
Thats it! Thats how one can use the hotplug feature of SATA HDDs to efectively use them as portable external HDDs.5 ways to improve HDD speed on Linux
>>
C
,
cache
,
CodeProject
,
HDD
,
linux
,
linux-kernel
,
page-cache
,
Programming
LINUX LINUX LINUX LINUX
LINUX LINUX LINUX LINUX
(If you still think this post is about making windows load faster, then press ALT+F4 to continue)
![]() |
Our dear Mr. Client |
Its a fine sunny sunday morning. Due tomorrow, is your presentation to a client on improving disk-I/O. You pull yourself up by your boots and manage to climb out of bed and onto your favorite chair...
![]() |
I think you forgot to wear your glasses... |

Aahh thats better... You jump into the couch and turn on your laptop and launch (your favorite presentation app here). As you sip your morning coffee and wait for the app to load, you look out of the window and wonder what it could be doing.

If you would be so kind enough to wipe your rose-coloured glasses clean, you would see that this is what is ACTUALLY happening

0. The app (running in RAM) decides that it wants to play spin-the-wheel with your hard-disk.
1. It initiates a disk-I/O request to read some data from the HDD to RAM(userspace).
2. The kernel does a quick check in its page-cache(again in RAM) to see if it has this data from any earlier request. Since you just switched-on your computer,...
3. ...the kernel did NOT find the requested data in the page-cache. "Sigh!" it says and starts its bike and begins it journey all the way to HDD-land. On its way, the kernel decides to call-up its old friend "HDD-cache" and tells him that he will be arriving shortly to collect a package of data from HDD-land. HDD-cache, the good-friend as always, tells the kernel not to worry and that everything will be ready by the time he arrives in HDD-land.
4. HDD-cache starts spinning the HDD-disk...
5. ...and locates and collects the data.
6. The kernel reaches HDD-land and picks-up the data package and starts back.
7. Once back home, it saves a copy of the package in its cache in case the app asks for it again. (Poor kernel has NO way of knowing that the app has no such plans).
8. The kernel gives the package of data to the app...
9. ...which promptly stores it in RAM(userspace).
Do keep in mind that this is how it works in case of extremely disciplined, well-behaved apps. At this point misbehaving apps tend to go -
"Yo kernel, ACTUALLY, i didn't allocate any RAM, i wanted to just see if the file existed. Now that i know it does, can you please send me this other file from some other corner of HDD-land."...and the story continues...
Time to go refill your coffee-cup. Go go go...
![]() |
Hmmm... you are back with some donuts too. Nice caching! |
So as you sit there having coffee and donuts, you wonder how does one really improve the disk-I/O performance. Improving performance can mean different things to different people:
- Apps should NOT slow down waiting for data from disk.
- One disk-I/O-heavy app should NOT slow down another app's disk-I/O.
- Heavy disk-I/O should NOT cause increased cpu-usage.
- (Enter your client's requirement here)
So when the disk-I/O throughput is PATHETIC, what does one do?...
![]() |
5 WAYS to optimise your HDD throughput! |
1. Bypass page-cache for "read-once" data.
What exactly does page-cache do? It caches recently accessed pages from the HDD. Thus reducing seek-times for subsequent accesses to the same data. The key here being subsequent. The page-cache does NOT improve the performance the first time a page is accessed from the HDD. So if an app is going to read a file once and just once, then bypassing the page-cache is the better way to go. This is possible by using the O_DIRECT flag. This means that the kernel does NOT considered this particular data for the page-cache. Reducing cache-contention means that other pages (which wouold be accessed repeatedly) have a better chance of being retained in the page-cache. This improves the cache-hit ratio i.e better performance.
void ioReadOnceFile()
{
/* Using direct_fd and direct_f bypasses kernel page-cache.
* - direct_fd is a low-level file descriptor
* - direct_f is a filestream similar to one returned by fopen()
* NOTE: Use getpagesize() for determining optimal sized buffers.
*/
int direct_fd = open("filename", O_DIRECT | O_RDWR);
FILE *direct_f = fdopen(direct_fd, "w+");
/* direct disk-I/O done HERE*/
fclose(f);
close(fd);
}
2. Bypass page-cache for large files.
Consider the case of a reading in a large file (ex: a database) made of a huge number of pages. Every subsequent page accessed get into the page-cache only to be dropped out later as more and more pages are read. This severely reduces the cache-hit ratio. In this case the page-cache does NOT provide any performance gains. Hence one would be better off bypassing the page-cache when accessing large files.
void ioLargeFile()
{
/* Using direct_fd and direct_f bypasses kernel page-cache.
* - direct_fd is a low-level file descriptor
* - direct_f is a filestream similar to one returned by fopen()
* NOTE: Use getpagesize() for determining optimal sized buffers.
*/
int direct_fd = open("largefile.bin", O_DIRECT | O_RDWR | O_LARGEFILE);
FILE *direct_f = fdopen(direct_fd, "w+");
/* direct disk-I/O done HERE*/
fclose(f);
close(fd);
}
3. If (cpu-bound) then scheduler == no-op;
The io-scheduler optimises the order of I/O operations to be queued on to the HDD. As seek-time is the heaviest penalty on a HDD, most I/O schedulers attempt to minimise the seek-time. This is implemented as a variant of the elevator algorithm i.e. re-ordering the randomly ordered requests from numerous processes to the order in which the data is present on the HDD. require a significant amount of CPU-time.
Certain tasks that involve complex operations tend to be limited by how fast the cpu can process vast amounts of data. A complex I/O-scheduler running in the background can be consuming precious CPU cycles, thereby reducing the system performance. In this case, switching to a simpler algorithm like no-op reduces the CPU load and can improve system performance.
echo noop > /sys/block/<block-dev>/queue/scheduler
4. Block-size: Bigger is Better
Q. How will you move Mount Fujito bangalore?
Ans. Bit by bit.
While this will eventually get the job done, its definitely NOT the most optimal way. From the kernel's perspective, the most optimal size for I/O requests is the the filesystem blocksize (i.e the page-size). As all I/O in the filesystem (and the kernel page-cache) is in terms of pages, it makes sense for the app to do transfers in multiples of pages-size too. Also with multi-segmented caches making their way into HDDs now, one would hugely benefit by doing I/O in multiples of block-size.
![]() |
Barracuda 1TB HDD : Optimal I/O block size 2M (=4blocks) |
stat --printf="bs=%s optimal-bs=%S\n" --file-system /dev/<block-dev>
5. SYNC vs. ASYNC (& read vs. write)
![]() |
ASYNC I/O i.e. non-blocking mode is effectively faster with cache |
Subscribe to:
Posts
(
Atom
)