Dynamically allocated pseudo-filesystems

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jake Edge
May 16, 2022

LSFMM

It is perhaps unusual to have a kernel tracing developer leading a filesystem session, Steven Rostedt said, at the beginning of such a session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). But he was doing so to try to find a good way to dynamically allocate kernel data structures for some of the pseudo-filesystems, such as sysfs, debugfs, and tracefs, in the kernel. Avoiding static allocations would save memory, especially on systems that are not actually using any of the files in those filesystems.

Problem

He presented some statistics on the number of files and directories on one of his systems in /sys, /proc, /sys/kernel/tracing (the usual mount point for tracefs), and /sys/kernel/debug (debugfs). In all, he found 29,384 directories and 290,807 files. That's a lot of files, but, he asked, why should he care about that? To answer that, he noted that at one point, he had suggested that Alexei Starovoitov use tracing instances, which add another set of ring buffers for trace events and add a bunch of control files in tracefs. But Starovoitov tried that and complained that new instances used too much memory. The ring buffers are fairly modest in size, a bit over a megabyte per CPU, so Rostedt dug in a bit deeper. It turns out that whenever another instance gets added to tracefs, it adds around 18,000 files. Adding up the in-memory size of the inodes and directory entries (dentries) shows that 14MB is consumed for each tracing instance that gets added.

Looking beyond that, /sys consumes 42MB and /proc uses a whopping 202MB for these in-memory inodes and dentries, he said. But David Howells pointed out that /proc does not keep dentries and inodes around. Rostedt said that if he can use the same technique as procfs, "my talk is over". Ted Ts'o cautioned that it was a procfs-specific hack that had never been generalized, though Howells thought that perhaps it could be.

On the other hand, Chris Mason looked at a Meta production server to see what its /proc looked like; a find from the root took multiple minutes, and pegged the CPU at 100%, to find that there were 31 million files in it. He suggested that the procfs-specific hack "might not be the right hack" to use.

Christian Brauner said that since tracefs is its own filesystem, the procfs technique could simply be used there. But Rostedt was adamant that he did not want a hack just to fix the problem for tracefs; he wanted to find a proper solution that could be generalized for others to use. There should be a generic way for any pseudo-filesystem to opt into a just-in-time mode, where the inodes and dentries are allocated when the files and directories are accessed.

eventfs

Rostedt noted that Ajay Kaher gave a presentation at the 2021 Linux Plumbers Conference (LPC) on eventfs, which dynamically allocates the dentries and inodes for all of the tracing events that appear in tracefs. It is a kind of sub-filesystem for tracefs to handle the event files dynamically so that new instances do not consume so much memory. It only does the dynamic allocation for the events, and not for the other control files that appear in tracefs, Rostedt said. He did some testing with and without eventfs and found that it made a huge difference. Creating a new instance without eventfs used around 11MB extra, while doing that with eventfs only used about 1MB. At LPC, some attendees said that the feature is something that should be added as an option for all pseudo-filesystems, which is what brought Rostedt to LSFMM. He wanted to get a sense for the best way to accomplish this goal and to figure out what the internal API would look like.

In particular, since the event dentries and inodes are only present while they are being used, at least in eventfs, he is concerned that the API needs to have a way to keep them in memory while a trace involving them is running. The worry is that memory pressure could cause eventfs to be unable to create the file to disable the event. David Howells suggested that an emergency pool could be used to handle that particular problem.

Brauner asked which API was used for tracefs; did it use the sysfs API, for example? Rostedt said that tracefs has its own API and is completely separate from any of the other pseudo-filesystems. Tracefs came about because people wanted tracing information available on production systems but did not want to build debugfs into them. So, at Greg Kroah-Hartman's suggestion, Rostedt started with the debugfs code and turned it into tracefs.

Since tracefs has its own API, and does not rely on sysfs or kernfs, for example, that gives it more leeway to define an API for the just-in-time feature without having to convert the others, Brauner said. He thinks it will be difficult to come up with something that could be shared between tracefs and procfs, however, because procfs is so special.

Rostedt said that perhaps tracefs "could be the guinea pig" for the feature, then other filesystems could convert over in time if that was seen as useful. He too wonders if procfs is too special to fit in, however. Mason's concern about procfs being slow because it creates its entries on the fly may also mean that other filesystems will not want the feature. Howells said with a chuckle that if Rostedt wanted to thoroughly test the feature, "putting it in procfs would be one good way to do that".

Approach

Currently eventfs covers just a portion of the control files in tracefs; Rostedt would like to handle all of the tracefs files that way. But the feedback he has gotten from virtual filesystem (VFS) layer developers is that this should not be done solely for tracefs, so he was wondering what the right approach would be.

Amir Goldstein asked if Rostedt had talked with Kroah-Hartman to see if he would be interested in this feature for debugfs. It would seem that debugfs might also benefit from it. Rostedt said he had not asked Kroah-Hartman about that. But Brauner said that debugfs and sysfs have an ingrained idea that it is the responsibility of the creator of the directories and files to clean them up, which is different from the centralization in eventfs (or something along those lines); it might be difficult to rework those other filesystems to use a different model.

Rostedt is also concerned about race conditions and lock-ordering problems, based on his review of the eventfs code. Howells said those kinds of problems "have all been pretty well sorted in procfs". Processes come and go, as do their entries in procfs, even if they are being used. Procfs has its own structure that describes just the pieces it needs, he said, and it creates dentries and inodes on demand. It already deals with the problem of the process directory going away when the process does, though files in that subtree may still be open.

Rostedt wondered whether he should continue working on eventfs with Kaher or if they should drop that and try to make it work for all of tracefs. Eventfs might make a good test case for where the problem areas are. Brauner asked if there were other users who wanted this functionality, which might help guide which way to go. Howells reiterated the idea that procfs might provide the best model to look at since it already handles many of the same kinds of problems.

Overall, Rostedt said that he was not hearing anyone argue that he should not continue working on the idea. In addition, he said that he now has some good ideas of what code to look at as well as names of people to ask questions of. Patches are presumably forthcoming once he and Kaher determine the path they want to pursue.

Index entries for this article
Kernel	Filesystems/Pseudo
Conference	Storage Filesystem & Memory Management/2022

(Log in to post comments)

Dynamically allocated pseudo-filesystems

Posted May 17, 2022 20:57 UTC (Tue) by neilbrown (subscriber, #359) [Link]

> a find from the root took multiple minutes, and pegged the CPU at 100%, to find that there were 31 million files in it.

Is this even slightly surprising? If procfs doesn't keep everything always in the dcache/icache, then the find has to bring everything into the dcache/icache. This requires allocating all those dentries and inode - at the very least. If the "multiple" is (say) 5, then I calculate 9 microsecond per file - not too bad. And of course the CPU will be at 100% - there is no device IO to wait for.

If you want "find" to be fast, keep everything in the cache and put up with the memory cost. If you want to save memory, then expect "find" to be slow - the first "find" at least. The second one should be faster because everything is in the cache.

> But Rostedt was adamant that he did not want a hack just to fix the problem for tracefs; he wanted to find a proper solution that could be generalized for others to use.

Beware of premature optimisation (the rt of al evl), and premature generalisation. If you start by trying to create a completely general solution, you are likely to create a monstrosity. It would be best to look at what procfs has done, and then create something for tracefs which copies the useful lessons but tunes them specifically for tracefs - because tracefs is all you really know. If there is some abstraction that would clearly be useful for both, then maybe that would be worth putting in fs/libfs.c. Then when someone else wants to do the same thing for some other filesystem, they will have two working examples to learn from and will be able to create even more common code. Incremental development for the win.

Dynamically allocated pseudo-filesystems

Posted May 17, 2022 23:41 UTC (Tue) by dgc (subscriber, #6611) [Link]

> > a find from the root took multiple minutes, and pegged the CPU at 100%, to find that there were 31 million files in it.
>
> Is this even slightly surprising?

Nope.

> If you want "find" to be fast, keep everything in the cache and put up with the memory cost.

But that's just plain wrong. Caches only speed up the *second* access and find is generally a single access cold cache workload.

Indeed, what I find surprising is that nobody seems to recognise that the limit here is find being "100% CPU bound". That is, find isn't automatically multithreading and making use of all the CPUs in the system. Yet find is a trivially parallelisable workload - iterating individual (sub-) directories per thread scales almost perfectly out to either IO or CPU hardware limits.

e.g. I can run a concurrent find+stat iteration that visits every inode in a directory structure of over 50 million inodes on XFS in about a 1m30s on my test machine before 16+ CPUs are fully CPU bound on inode cache lock contention. With lock contention sorted, it scales out to 32 CPUs and comes down to about 30s - roughly 1.5 million inodes a second can be streamed through the dentry and inode cache before being CPU bound again.

The inode cache alone on this machine can stream about 6 million cold inodes/s (XFS bulkstat on same 50 million inodes using DONT_CACHE) before we run out of CPU and memory reclaim starts to fall over handling the >10GB/s of memory allocation and reclaim this requires (on a 16GB RAM machine). And even with this sort of crazy high inode scanning rate, the disk is only barely over 50% utilised at ~150k IOPS and 3.5GB/s of read bandwith.

Modern SSDs are *crazy fast* and we can build machines containing dozens of them and we have the memory bandwidth to feed them all. In memory and pseudo filesystems that use CPUs to do all the processing/IO (and I include PMEM+DAX in that group) are *slow* compared to the amount of cached data we can stream and access via asynchronous DMA directly to/from the hardware.

So what this anecdote says to me is that this 'find is slow' problem is caused by the fact our basic filesystem tools still treat systems and storage as if it still is a machine from the 1980s - one CPU and a real slow spinning disk - and so fail to use much of the capability the hardware actually has....

> Beware of premature optimisation (the rt of al evl)

Yup, optimising OS structures because a single threaded find is CPU bound is optimising the wrong thing. We should be providing tools that can, out of the box, scale out to the capability of the underlying hardware they are provided with. There's orders of magnitude to be gained by scaling out the tool, optimising for a single CPU bound workload will, at best, gain a few percent.

-Dave.

Dynamically allocated pseudo-filesystems

Posted May 18, 2022 6:04 UTC (Wed) by zdzichu (subscriber, #17118) [Link]

The article didn't state _which find_ was used. We guess it was GNU/find.
I'm personally using https://github.com/sharkdp/fd daily. It parallelizes on all CPU cores by default.

Dynamically allocated pseudo-filesystems

Posted May 18, 2022 8:43 UTC (Wed) by dgc (subscriber, #6611) [Link]

True, but it doesn't really matter _which find_ was used if it only used 100% CPU. A parallel find that was constrained to a single cpu would behave the same.

FWIW, I do know there are find (and other tool) variants out there that are multi-threaded. I use tools like lbzip2 because compression is another common operation that is trivially parallelisable. The problem is we have to go out of our way to discover and then install multi-threaded tools. It is long past the point where the distros should be defaulting to parallelised versions of common tools rather than they being the exception...

-Dave.

Dynamically allocated pseudo-filesystems

Posted May 26, 2022 14:31 UTC (Thu) by mrugiero (guest, #153040) [Link]

> I use tools like lbzip2 because compression is another common operation that is trivially parallelisable.

There are caveats for complession. Block schemes like bzip2 are trivially parallelisable with increased memory usage (which is quite low anyway) as the only drawback, but Lempel-Ziv and streaming compressors in general may take a hit to compression ratio, at least if done without care.

Dynamically allocated pseudo-filesystems

Posted May 23, 2022 4:33 UTC (Mon) by alison (subscriber, #63752) [Link]

A colleague once filed a bug ticket with the complaint that "find" on /proc took so long. "Tell Linus," I wrote in the comments and marked as "Won't Fix."

Dynamically allocated pseudo-filesystems

Posted May 18, 2022 10:36 UTC (Wed) by adobriyan (subscriber, #30858) [Link]

/proc probably needs it's own specialised finder because of how everything is lumped together.

net sysctl stuff is very over represented:
golden record in in /proc/sys/net which is effectively copied to /proc/$pid/net, /proc/$pid/task/$pid/net (pid is the same!)
and /proc/$pid/task/$tid/net. Add /proc/*/fd and /proc/*/map_files and it is unbounded.

"find /proc -type f -inum +4026531839" to search for something not in /proc/$pid/ doesn't help with memory problem
because find doesn't know to not recurse into top level process stuff.

Filtering out names full of integers will skip directories in /proc/bus/pci so it is not reliable either.