This is an archive of the discontinued LLVM Phabricator instance.

option to write files to memory instead of disk
Needs ReviewPublic

Authored by marcrasi on Apr 13 2020, 2:44 PM.

Details

Summary

This allows clients to invoke clang without writing anything to disk.

I'm not sure if this is the right design, so I'm sending this patch to
start a discussion about the right way to do this. I'll be happy to make
changes and add tests after talking about the right way to do this!

Motivation:

At Google, we run some Swift source tooling on diskless servers. The
Swift source tooling calls Clang through the ClangImporter
(https://github.com/apple/swift/blob/master/lib/ClangImporter/ClangImporter.cpp).
Clang compiles module to an on-disk module cache, which does not work on
the diskless servers. This patch lets clients ask Clang to write the
module cache to memory.

Current design:

  • Adds class InMemoryOutputFileSystem : public llvm::vfs::FileSystem that supports write operations backed by memory. After a file is written, its contents are accessible through the llvm::vfs::FileSystem interface.
  • Adds an InMemoryOutputFileSystem field to CompilerInstance. When this field is set, the CompilerInstance writes to the InMemoryOutputFileSystem instead of the real file system.

Diff Detail

Unit TestsFailed

Event Timeline

marcrasi created this revision.Apr 13 2020, 2:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2020, 2:44 PM
marcrasi edited the summary of this revision. (Show Details)Apr 13 2020, 2:44 PM
mwyman added a subscriber: mwyman.Apr 13 2020, 3:00 PM

Ping on this. Could anyone take a look?

It's great that you're looking at this! Clang tooling in general and clangd in particular a lot of use of VFS (particularly when deployed inside google), so I've got a fair bit of interest/experience in these interfaces. And we'd also looked at virtualizing module caches in particular for clangd (but haven't tackled it yet).

A couple of high-level design questions:

  • if the goal is really to make module caching work without a physical disk, is virtualizing the FS underlying the cache really the right abstraction? Compared to an alternative like plugging in a ModuleCache interface, it seems like quite a complicated contract to have to maintain, and less flexible too (we've dreamed about ideas like a network-distributed cache, expiry, selectively sharing cached modules based on security context etc... easier if the objects are modules+metadata rather than filenames+bytes). Of course it's pretty generic and maybe can be used for other stuff too, but...
  • if we *are* going to introduce a writable VFS concept to LLVM, I feel we must allow this to be reused for virtualizing file writes in general. (clang has several ad-hoc mechanisms for virtual file inputs that AIUI predate llvm::vfs, and they're a pain). I think at minimum we should ensure that the writable FS is an interface rather than just one implementation, and that we can implement it well for the real FS also. Does this seem reasonable?

I think this is important; thanks for working on it!

I'm sorry I missed it before... coincidentally, I'm working in a similar problem space right now (likely I'll post a patch with an RFC next week) but my design is a bit different.

The main abstraction I came up with is in two parts:

  • An OutputManager, which can support opening / closing outputs with configurable destinations (kind of like a chain of consumers, but my initial design is monolithic to simplify optional optimizations like using a WriteThroughMemoryBuffer for mirroring a just-written on-disk file in memory). The expectation is that various CompilerInstances (and other things) could share one of these. This is not a file system; it just has very simple "open" / "close" / "erase" APIs for output paths (initially it has three possible destinations: on-disk, in-memory (store in an InMemoryFileSystem), and a request-provided raw_pwrite_ostream; it can do any or all of these).
  • An OutputContext, which can be instantiated to track multiple outputs in a specific context. Each CompilerInstance would have one of these, effectively replacing the std::list<OutputFile> it currently has (clearOutputFiles would call OutputContext::closeAllOutputs, which then calls OutputManager::closeOutput for each output its tracking). This can have partial configuration (partially overriding the manager's default configuration).

(Initially, I was thinking of a virtual base class OutputManager with subclasses OnDiskOutputManager and InMemoryOutputManager, and I didn't have OutputContext as a separate abstraction, but I don't think that gives us the flexibility we need for various use cases; see below.)

I also have:

  • OutputConfig, configuration for a specific file; the default configuration lives on OutputManager (and is configurable).
  • PartialOutputConfig, a partial configuration that overrides only some of a configuration; can be put on an OutputContext ("mirror all of my outputs in memory" or "don't write to disk, only to memory, for these outputs"), or put on an individual "open file" request ("the client needs seeking support for this output").

These are the use cases I think we should support:

  • A "normal" clang invocation without implicit modules should just write to disk.
  • If invoking clang from the command-line and implicit modules is turned on, the modules should be mirrored in memory (but also written to disk). My idea here is that VFS is constructed with an OverlayFileSystem at the top level and an initially-empty InMemoryFileSystem underneath in front of the "usual" one. The OutputContexts for the modules builds would be configured to mirror all outputs into the InMemoryFileSystem. (This is a step toward dropping the rather awkward InMemoryModuleCache; I think that will be easy once this lands.)
  • Index services often don't want outputs written to disk. If they're happy passing in an InMemoryFileSystem to the OutputManager, they can use the same approach.
  • When clang-scan-deps is used to do a modules pre-scan, its source-minimized modules should never hit the disk. The OutputManager can be configured to only fill up the InMemoryFileSystem.
  • For unit tests or a -cc1 compile-in-memory API where we care about a single output file, it might be nice to be able to ignore most files (write to raw_null_ostream), but get back specific file content and/or configure certain files in some special way.
  • There are lots of other outputs in clang (e.g., index-while-building files from -index-store-path, which is still being upstreamed), which all do their own temp-file dance (... or worse, they don't), and they should be able to delete code and use this as well; I think even references to errs() should go through so this could be captured in unit tests (or an in-memory compile API, or...).
  • It should be easy to cleanly add more output destinations (e.g., CAS) as we find the need, and OutputContext users would not need to know about specific destinations unless they wanted some special configuration.

I'll reply here once I've posted the RFC and patch (as I said, I'm hoping next week) so you can take a look.

I'll reply here once I've posted the RFC and patch (as I said, I'm hoping next week) so you can take a look.

I don't quite have my patch and RFC ready (and I may not until I'm back from holidays in January), but if you're curious the current state is at https://github.com/dexonsmith/llvm-project/tree/output-manager / https://github.com/dexonsmith/llvm-project/commit/924cb118272acab5f031c04f55465ae4799be4bf.

@dexonsmith thanks for sharing!
Some initial thoughts since abstracting outputs is something we're starting to care about too...

This doesn't appear to be an extension point - you can write to an InMemoryFS or to real disk, but not to anything else. If we're going to add abstractions around output, I think they should support flexible use in libraries (which doesn't need to cost complexity here: FS vs MemoryFS vs custom storage can implement the same interface).
With the right interface, I think this obviates the need for RequestedOutput and generalizes it - a caller can install an intercepting output backend that hooks certain paths and delegates the rest to the old backend.
(Of course such a backend could be provided, but it gets the complexity out of the way of the core abstractions)

I think the lifecycle and allowed operations on outputs would be clearer if an Output class was returned, with methods, rather than a stream + handle and various "free" functions (well, methods on OutputManager) than manipulate the handle.
This would again make it easier to reason about what operations are part of a storage backend: there'd be some OutputBackend interface to implement, and it would hand out Output objects which again must be implemented. (This would be reminiscent of FileManager + vfs::FileSystem + vfs::File).

dexonsmith added a comment.EditedJan 5 2021, 1:09 PM

@dexonsmith thanks for sharing!
Some initial thoughts since abstracting outputs is something we're starting to care about too...

Thanks for looking.

This doesn't appear to be an extension point - you can write to an InMemoryFS or to real disk, but not to anything else. If we're going to add abstractions around output, I think they should support flexible use in libraries (which doesn't need to cost complexity here: FS vs MemoryFS vs custom storage can implement the same interface).
With the right interface, I think this obviates the need for RequestedOutput and generalizes it - a caller can install an intercepting output backend that hooks certain paths and delegates the rest to the old backend.
(Of course such a backend could be provided, but it gets the complexity out of the way of the core abstractions)

I agree with your concerns with the monolithic design. My motivation for it is to get the memory optimization of using a write-through memory buffer when the output is needed both on-disk and in-memory. I imagine similar concerns if/when we add a CAS-based storage option, where it's likely more efficient to ask the CAS to materialize the on-disk file than to write it ourselves.

I'll think about whether there's a clean way to model these kinds of interactions between storage implementations without a monolithic design; let me know if you have any concrete ideas.

Another option I see is to have an abstract extension point that allows libraries to plug in custom storage, but keep the monolithic design for on-disk and in-memory (and eventually CAS).

I think the lifecycle and allowed operations on outputs would be clearer if an Output class was returned, with methods, rather than a stream + handle and various "free" functions (well, methods on OutputManager) than manipulate the handle.
This would again make it easier to reason about what operations are part of a storage backend: there'd be some OutputBackend interface to implement, and it would hand out Output objects which again must be implemented. (This would be reminiscent of FileManager + vfs::FileSystem + vfs::File).

I'll think some more about the options here. The current design is motivated by how outputs are currently used / referenced in CompilerInstance (and how the streams are passed around separately to APIs that don't know anything about outputs); i.e., the current design seems simple to land as an incremental change. But it's probably not hard to make it a bit cleaner.

@dexonsmith thanks for sharing!
Some initial thoughts since abstracting outputs is something we're starting to care about too...

Thanks for looking.

I think I've found a fairly clean way to get the write-through memory buffer optimization with pluggable backends; still working through the other feedback, but hoping to have the patch / RFC out soonish.

@dexonsmith thanks for sharing!
Some initial thoughts since abstracting outputs is something we're starting to care about too...

Thanks for looking.

I think I've found a fairly clean way to get the write-through memory buffer optimization with pluggable backends; still working through the other feedback, but hoping to have the patch / RFC out soonish.

FYI, I posted an RFC here: https://lists.llvm.org/pipermail/cfe-dev/2021-January/067576.html

bruno resigned from this revision.May 19 2021, 6:14 PM
lamb-j added a subscriber: lamb-j.Aug 12 2022, 2:13 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2022, 2:13 PM