[go: nahoru, domu]

Open Bug 1814509 Opened 2 years ago Updated 2 years ago

[research] revisit LRU disk cache

Categories

(Eliot :: General, task, P2)

Tracking

(Not tracked)

People

(Reporter: willkg, Unassigned)

Details

Eliot downloads symbol files from symbols.mozilla.org, parses them, and then stores a fast-loading binary representation in an on-disk LRU cache. The cache is managed by a disk cache manager that uses inotify to keep track of "least recently used".

The problem with this is that when a new Eliot instance starts up, the cold cache severely affects symbolication times. A hot cache could result in symbolication of 0.5s. A cold cache could result in symbolication of 30s.

The symcache format is not guaranteed to work from one Symbolic version to the next, so when we update Symbolic versions, we shouldn't use any symcache files from previous Symbolic versions.

It would be better to have a system where new Eliot instances don't start with a cold cache.

From my notes in the Eliot: GCP Migration document.

Also, we won't do anything about this until after the GCP migration is over.

Notes

Currently, Eliot downloads sym files which can be very large from AWS S3 and parses the sym file which takes a long time before it can do symbol lookups against the sym file data.

To reduce the amount of time this takes, Eliot maintains an on-disk LRU cache of symcache files which is a convenient binary representation of the sym file that can be loaded into memory quickly for symbol lookups. It lets us skip the downloading and parsing steps.

Eliot has a disk cache manager process that manages the LRU cache of files. This allows us to keep "hot files" around to reduce overall symbolication times.

Eliot uses the Symbolic library to parse sym files and generate the symcache file. The symcache file format changes between Symbolic versions. Whenever we update Symbolic, we can't use symcache files generated by previous versions.

When Eliot container is destroyed (scaling down, new release), the LRU cache of files is destroyed with it.

Thus we have these requirements:

we need to be able to store a representation of sym file data that allows us to load it into memory quickly to do lookups
we're currently using the Symbolic symcache format for our fast-load representation and it can change between Symbolic versions
we don't need this fast-load representation for all sym files--only the ones that are currently being used for symbolication

Option 1: Maintain current architecture

Each Eliot container has its own disk and disk cache manager process dedicated to that container.

Pros:

  • no code changes required

Cons:

  • every time a new Eliot container comes online, it has an empty cache and we definitely notice this in the metrics

Option 2: Switch to ephemeral shared disk for all containers

Switch the architecture such that all Eliot containers for a specific release have a shared disk that is tied to a specific Eliot release.

This requires some minor code changes:

  • remove disk cache manager from Eliot containers
  • create a dedicated disk cache manager container
  • disk would be tied to an Eliot release and when we did a deploy of a new release, we would trash the disk

Pros:

  • every time a new Eliot container comes online, it can access the disk cache
  • we don't need a large disk for every Eliot container--we would instead have a single large disk for the cluster

Cons:

  • minor code changes required

Option 3: Switch to permanent shared disk for all containers for all releases

Switch the architecture such that Eliot containers for a specific release have a shared disk that is permanent to the Eliot service.

This requires some minor code changes:

  • remove disk cache manager from Eliot containers
  • create a dedicated disk cache manager container
  • change Eliot code to include symbolic version in filenames so when we do a new Eliot release that changes the Symbolic library, it doesn't load old symcache files

Pros:

  • every time a new Eliot container comes online, it can access the disk cache
  • we don't need a large disk for every Eliot container--we would instead have a single large disk for the cluster

Cons:

  • minor code changes required

Option 3b: Switch to permanent shared disk for all containers for all releases that has a retention policy of like 1 week

Similar to 3, but this variant uses GCS or some other data storage service that has a retention policy. Instead of doing an LRU-cache, we just keep all symcache files for a week. If a file is re-requested, it'll get regenerated.

This allows us to ditch the disk cache manager altogether.

You need to log in before you can comment on or make changes to this bug.