Optimize caching for the NextGen-Core

Metadata
cEP	26
Version	1.0
Title	Optimize caching for the NextGen-Core
Authors	Palash Nigam mailto:npalash25@gmail.com
Status	Implementation Due
Type	Feature

Abstract

This document describes how a caching mechanism (an enhanced version of the one being used by the old core) will be implemented and integrated with the NextGen-Core as a part of this GSoC project.

Introduction

Since the NextGen-Core aims to provide faster speed and better performance for coala-runs on large codebases it needs caching to achieve this kind of performance.

Structure wise the NextGen cache is a dictionary with bear types and cache tables as key-value pairs. The cache tables themselves are dictionary-like-objects that map the persistent_hashs of the task objects to the bear results. This current mechanism will be further enhanced by adding features like FileFactory class to load files, Directory class to interface with the directories, ignoring unmodified directories during successive runs and adding cache control flags for better usability and providing the user various options to perform caching according to a project's needs.

Implementation

The implementation will be divided into four parts

Prototype for FileFactory class implementation.
Prototype for Directory class implementation.
Ignore directories functionality.
Cache control flags.

Prototype of FileFactory class implementation

This class will follow the Factory design pattern and will act as a proxy to dealing with files. This class will have various methods to provide the file contents in different forms like string, binary format, etc. With the help of this class we can replace the actual file contents in our file-dict with these objects which will itself be a performance improvement.

The following is the prototype for FileFactory class.

import linecache
import os

from coala_utils.decorators import generate_eq
from coalib.core.PersistentHash import persistent_hash


@generate_eq('_filename', '_stamp')
class FileFactory:

    def __init__(self, filename):
        self._filename = filename
        self._parent = os.path.abspath(os.path.join(self._filename, os.pardir))
        self._stamp = os.path.getmtime(self._filename)

    def get_line(self, line):
        return linecache.getline(self._filename, line)

    def file_as_list(self):
        self.lines = linecache.getlines(self._filename)

    def file_as_bytes(self):
        with open(self._filename, 'rb') as fp:
            return fp.read()

    def file_as_string(self):
        return self.file_as_bytes().decode()

    @property
    def name(self):
        return self._filename

    @property
    def parent(self):
        return self._parent

    @property
    def timestamp(self):
        return self._stamp

    def hash(self):
        # hash the absolute file path to get a unique hash
        return persistent_hash(self._filename)

Prototype of Directory class implementation

The objects of this class will also reside in the file-set (formerly known as the file-dict). The objects of this class can also be used by DirectoryBears as discussed on this thread #4999 by performing a simple type check on the entries in the file-set, whether the objects are instances of FileFactory or Directory.

Following is the prototype for the Directory class.

import os

from coalib.parsing.Globbing import glob


class Directory:
    def __init__(self, path):
        self.path = path
        self.parent = os.path.abspath(os.path.join(path, os.pardir))
        self.files = 0 # number of files contained in a directory
        self.subdirs = 0 # number of sub-directories contained in a directory

    def count_children(self):
        children = os.listdir(self.path)

        for child in children:
            if os.path.isfile(child):
                self.files += 1

        self.subdirs = len(children) - self.files

    def get_dir(pattern):
        return glob(pattern) # get sub directory matching the given glob pattern

Ignore directories functionality

This will probably be the biggest performance boost for coala. Instead of ignoring only unmodified files on subsequent runs, coala shall also ignore directories with no unchanged files and subdirectories. It can be implemented accordingly:

FileFactory will have an additional attribute parent (the parent directory path) in its __init__ method.
Directory class will also have additional attributes files (number of files) subdirs (number of subdirs) and parent (the parent directory's path; in case of project's root it will be None)
There will be a separate dict (called directory-tracker) containing directory paths as keys and number of files and subdirs as a list of values for e.g.
```
    { Directory.path: [Directory.files, Directory.subdirs] }
```
First all the changed files will be deleted from the cache. If a file is not unmodified then it can be untracked, we will look up its parent directory using FileFactory.parent and decrease the count of the number of files (Directory.files in the directory-tracker dict) corresponding to that path by one.
Once all the modified files have been deleted from the cache coala will move on to the directories in our dict. coala will look for directories with Directory.subdirs not equal to 0 (starting from the bottommost level of the directory tree) if the Directory.files associated with such paths is 0 (i.e. all the files contained inside that directory have been uncached) then the Directory object corresponding to this path will be deleted from the cache.
As soon as a Directory object is uncached the value of the Directory.subdirs of their parent directory will also be decreased by 1.
This process will continue for all the directories and the paths having both the values (Directory.files and Directory.subdirs) as zero will have their corresponding Directory objects deleted from the cache.

Cache control flags

This feature enhancement focuses on both performance and usability. The aim is to provide different caching modes through cache control flags to provide the user the flexibility to tweak the caching mechanism according to their project's needs. The following cache flags can be implemented to provide different caching modes:

--cache-protocol: This will accept three different arguments:
- none: if the user wants to run coala without caching enabled
- primitive to store all the cache entries without removing the unchanged ones on subsequent runs (memory wise quite inefficient for large projects but this will prove to be useful in projects with many incremental builds that need fast linting).
- untrack the opposite of the primitive mode which will stop tracking unchanged files and directories on subsequent coala runs
- lru (last recently used) in which cached items only persist until the next run (a count parameter can also be used to know when to discard an unused cached item). This should be the default argument.
--clear-cache: Similar to --flush-cache.
--cache-compression: This will accept the following arguments
- none: for compression to be disabled.
- lzma or gzip: to compress the cache using these modules.
--cache-optimize: For faster cache loading we can make use of this flag which will utilize pickletools.optimize for faster cache loading.
--export-cache or --import-cache: An API (yet to be designed) to transfer cache across different machines (like the user's local machine and the CI servers) and speed up builds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cEP-0026.md

cEP-0026.md

Optimize caching for the NextGen-Core

Abstract

Introduction

Implementation

Prototype of FileFactory class implementation

Prototype of Directory class implementation

Ignore directories functionality

Cache control flags

Files

cEP-0026.md

Latest commit

History

cEP-0026.md

File metadata and controls

Optimize caching for the NextGen-Core

Abstract

Introduction

Implementation

Prototype of FileFactory class implementation

Prototype of Directory class implementation

Ignore directories functionality

Cache control flags