Metadata | |
---|---|
cEP | 26 |
Version | 1.0 |
Title | Optimize caching for the NextGen-Core |
Authors | Palash Nigam mailto:[email protected] |
Status | Implementation Due |
Type | Feature |
This document describes how a caching mechanism (an enhanced version of the one being used by the old core) will be implemented and integrated with the NextGen-Core as a part of this GSoC project.
Since the NextGen-Core aims to provide faster speed and better performance for coala-runs on large codebases it needs caching to achieve this kind of performance.
Structure wise the NextGen cache is a dictionary with bear types and cache
tables as key-value pairs. The cache tables themselves are
dictionary-like-objects that map the persistent_hash
s of the task objects to
the bear results. This current mechanism will be further enhanced by adding
features like FileFactory
class to load files, Directory
class to interface
with the directories, ignoring unmodified directories during successive runs and
adding cache control flags for better usability and providing the user various
options to perform caching according to a project's needs.
The implementation will be divided into four parts
- Prototype for
FileFactory
class implementation. - Prototype for
Directory
class implementation. - Ignore directories functionality.
- Cache control flags.
This class will follow the Factory design pattern and will act as a proxy to dealing with files. This class will have various methods to provide the file contents in different forms like string, binary format, etc. With the help of this class we can replace the actual file contents in our file-dict with these objects which will itself be a performance improvement.
The following is the prototype for FileFactory
class.
import linecache
import os
from coala_utils.decorators import generate_eq
from coalib.core.PersistentHash import persistent_hash
@generate_eq('_filename', '_stamp')
class FileFactory:
def __init__(self, filename):
self._filename = filename
self._parent = os.path.abspath(os.path.join(self._filename, os.pardir))
self._stamp = os.path.getmtime(self._filename)
def get_line(self, line):
return linecache.getline(self._filename, line)
def file_as_list(self):
self.lines = linecache.getlines(self._filename)
def file_as_bytes(self):
with open(self._filename, 'rb') as fp:
return fp.read()
def file_as_string(self):
return self.file_as_bytes().decode()
@property
def name(self):
return self._filename
@property
def parent(self):
return self._parent
@property
def timestamp(self):
return self._stamp
def hash(self):
# hash the absolute file path to get a unique hash
return persistent_hash(self._filename)
The objects of this class will also reside in the file-set (formerly known as
the file-dict). The objects of this class can also be used by DirectoryBears
as discussed on this thread #4999
by performing a simple type check on the entries in the file-set, whether the
objects are instances of FileFactory
or Directory
.
Following is the prototype for the Directory
class.
import os
from coalib.parsing.Globbing import glob
class Directory:
def __init__(self, path):
self.path = path
self.parent = os.path.abspath(os.path.join(path, os.pardir))
self.files = 0 # number of files contained in a directory
self.subdirs = 0 # number of sub-directories contained in a directory
def count_children(self):
children = os.listdir(self.path)
for child in children:
if os.path.isfile(child):
self.files += 1
self.subdirs = len(children) - self.files
def get_dir(pattern):
return glob(pattern) # get sub directory matching the given glob pattern
This will probably be the biggest performance boost for coala. Instead of ignoring only unmodified files on subsequent runs, coala shall also ignore directories with no unchanged files and subdirectories. It can be implemented accordingly:
FileFactory
will have an additional attribute parent (the parent directory path) in its__init__
method.Directory
class will also have additional attributesfiles
(number of files)subdirs
(number of subdirs) andparent
(the parent directory's path; in case of project's root it will beNone
)- There will be a separate dict (called directory-tracker) containing directory
paths as keys and number of files and subdirs as a list of values for e.g.
{ Directory.path: [Directory.files, Directory.subdirs] }
- First all the changed files will be deleted from the cache. If a file
is not unmodified then it can be untracked, we will look up its parent
directory using
FileFactory.parent
and decrease the count of the number of files (Directory.files
in the directory-tracker dict) corresponding to that path by one. - Once all the modified files have been deleted from the cache coala will move
on to the directories in our dict. coala will look for directories with
Directory.subdirs
not equal to 0 (starting from the bottommost level of the directory tree) if theDirectory.files
associated with such paths is 0 (i.e. all the files contained inside that directory have been uncached) then theDirectory
object corresponding to this path will be deleted from the cache. - As soon as a Directory object is uncached the value of the
Directory.subdirs
of their parent directory will also be decreased by 1. - This process will continue for all the directories and the paths having both
the values (
Directory.files
andDirectory.subdirs
) as zero will have their correspondingDirectory
objects deleted from the cache.
This feature enhancement focuses on both performance and usability. The aim is to provide different caching modes through cache control flags to provide the user the flexibility to tweak the caching mechanism according to their project's needs. The following cache flags can be implemented to provide different caching modes:
--cache-protocol
: This will accept three different arguments:none
: if the user wants to run coala without caching enabledprimitive
to store all the cache entries without removing the unchanged ones on subsequent runs (memory wise quite inefficient for large projects but this will prove to be useful in projects with many incremental builds that need fast linting).untrack
the opposite of theprimitive
mode which will stop tracking unchanged files and directories on subsequent coala runslru
(last recently used) in which cached items only persist until the next run (a count parameter can also be used to know when to discard an unused cached item). This should be the default argument.
--clear-cache
: Similar to--flush-cache
.--cache-compression
: This will accept the following argumentsnone
: for compression to be disabled.lzma
orgzip
: to compress the cache using these modules.
--cache-optimize
: For faster cache loading we can make use of this flag which will utilizepickletools.optimize
for faster cache loading.--export-cache
or--import-cache
: An API (yet to be designed) to transfer cache across different machines (like the user's local machine and the CI servers) and speed up builds.