3. API Documentation

3.1. Main Module

class pyvcsshark.main.Application(config)[source]

Main application class. Contains the most important process logic. The main application consists of different steps:

1. The correct datastore is found (inherits from: pyvcsshark.datastores.basestore.BaseStore) by looking at which one was chosen by the user and the class is instantiated

2. The correct parser (inherits from: pyvcsshark.parser.baseparser.BaseParser) for the specified repository is instantiated

3. pyvcsshark.parser.baseparser.BaseParser.initialize() is called (concreter: the implemented function of the correct parser)

4. pyvcsshark.datastores.basestore.BaseStore.initialize() is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)

5. pyvcsshark.baseparser.BaseParser.parse() is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)

6. pyvcsshark.parser.baseparser.BaseParser.finalize() is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)

7. pyvcsshark.datastores.basestore.BaseStore.finalize() is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)

Parameters: config – An instance of Config, which contains the configuration parameters

3.2. Application

class pyvcsshark.Application(config)[source]

Main application class. Contains the most important process logic. The main application consists of different steps:

1. The correct datastore is found (inherits from: pyvcsshark.datastores.basestore.BaseStore) by looking at which one was chosen by the user and the class is instantiated

2. The correct parser (inherits from: pyvcsshark.parser.baseparser.BaseParser) for the specified repository is instantiated

3. pyvcsshark.parser.baseparser.BaseParser.initialize() is called (concreter: the implemented function of the correct parser)

4. pyvcsshark.datastores.basestore.BaseStore.initialize() is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)

5. pyvcsshark.baseparser.BaseParser.parse() is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)

6. pyvcsshark.parser.baseparser.BaseParser.finalize() is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)

7. pyvcsshark.datastores.basestore.BaseStore.finalize() is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)

Parameters: config – An instance of Config, which contains the configuration parameters

3.3. Configuration and Misc

3.3.1. Configuration

class pyvcsshark.Config(args)[source]

Holds configuration information

Parameters: args – argumentparser of the class argparse.ArgumentParser

3.3.2. Utils

pyvcsshark.utils.find_plugins(plugin_dir)[source]

Finds all python files in the specified path and imports them. This is needed, if we want to detect automatically, which datastore and parser we can apply

Parameters: plugin_dir – path to the plugin directory

pyvcsshark.utils.get_immediate_subdirectories(a_dir)[source]

Helper method, which gets the immediate subdirectories of a path. Is helpful, if one want to create a parser, which looks if certain folders are there.

Parameters: a_dir – directory from which immediate subdirectories should be listed

pyvcsshark.utils.readable_dir(prospective_dir)[source]

Function that checks if a path is a directory, if it exists and if it is accessible and only returns true if all these three are the case

Parameters: prospective_dir – path to the directory

3.4. Datastores

3.4.1. BaseDatastore

class pyvcsshark.datastores.basestore.BaseStore[source]

Abstract class for the datastores. One must inherit from this class and implement the methods to create a new datastore.

Based on pythons abc: abc

Property projectName: name of the project, which should be stored
Property projectURL: url of the repository of the project, which should be stored
Property repositoryType: type of the repository of the project, which should be stored
Parameters: metaclass – name of the abstract metaclass

Note

If you want to use a logger for your implementation of a datastore you can write:

logger = logging.getLogger("store") 

to get the logger.

Note

It is possible to include datastores, which are not databases like mongoDB or mysql. But you should make sure, that your datastore implementation handles the values which are given to it correctly.

abstract add_commit(commit_model)[source]

Add the commit to the datastore. How this is handled depends on the implementation.

Parameters: commit_model – instance of CommitModel, which includes all important information about the commit

Warning

The commits we get here are not sorted. Furthermore, they need to be processed right away or stored in a SimpleQueue. Storing it in a normal list or dictionary can not be done, as some parser (e.g. GitParser) use multiprocessing to add the commits.

abstract finalize()[source]: Is called in the end to finalize the datastore (e.g. closing files or connections)

static find_correct_datastore(datastore_identifier)[source]

Finds the correct datastore by looking at the datastore.storeIdentifier property

Parameters: datastore_identifier – string that represents the correct datastore (e.g. mongo)

abstract initialize(config, repository_url, repository_type)[source]

Initializes the datastore

Parameters

config – all configuration
repository_url – url of the repository, which is to be analyzed
repository_type – type of the repository, which is to be analyzed (e.g. “git”)

abstract property store_identifier: Must return a string identifier for the datastore (e.g. mongo)

3.4.2. MongoStore

3.4.2.1. API

class pyvcsshark.datastores.mongostore.MongoStore[source]

Datastore implementation for saving data to the mongodb. Inherits from pyvcsshark.datastores.basestore.BaseStore.

Property commit_queue: instance of a multiprocessing.JoinableQueue, which holds objects of pyvcsshark.dbmodels.models.CommitModel, that should be put into the mongodb
Property logger: holds the logging instance, by calling logging.getLogger(“store”)

add_branch(branch_model)[source]: Add branch to extra queue

add_commit(commit_model)[source]: Adds commits of class pyvcsshark.dbmodels.models.CommitModel to the commitqueue

finalize()[source]: As we depend on commits beeing finished with branches (for the references) we must wait first for them to finish before we can start our branch processing.

initialize(config, repository_url, repository_type)[source]

Initializes the mongostore by connecting to the mongodb, creating the project in the project collection and setting up processes (see: pyvcsshark.datastores.mongostore.CommitStorageProcess, which read commits out of the commitqueue, process them and store them into the mongodb.

Parameters

config – all configuration
repository_url – url of the repository, which is to be analyzed
repository_type – type of the repository, which is to be analyzed (e.g. “git”)

property store_identifier: Returns the identifier mongo for this datastore

class pyvcsshark.datastores.mongostore.CommitStorageProcess(queue, vcs_system_id, last_commit_date, config, name)[source]

Class that inherits from multiprocessing.Process for processing instances of class pyvcsshark.dbmodels.models.CommitModel and writing it into the mongodb

Parameters

queue – queue, where the pyvcsshark.dbmodels.models.CommitModel are stored in
vcs_system_id – object id of class bson.objectid.ObjectId from the vcs system
last_commit_date – object of class datetime.datetime, which holds the last commit that was parsed
config – object of class pyvcsshark.config.Config, which holds configuration information

create_branch_list(branches)[source]

Creates a list of the different branch names, where a commit belongs to. We go through the branches property of the class pyvcsshark.dbmodels.models.CommitModel, which is a list of different branch objects of class pyvcsshark.dbmodels.models.CommitModel

Parameters: branches – list of objects of class pyvcsshark.dbmodels.models.BranchModel

create_file_actions(files, mongo_commit_id)[source]

Creates a list of object ids of type bson.objectid.ObjectId for the different file actions of the commit by transforming the files into file actions of type FileAction, File, and Hunk (pycoshark library)

Parameters

files – list of changed files of type pyvcsshark.dbmodels.models.FileModel
mongo_commit_id – mongoid of the commit which is processed

Note

Hunks and the file action itself are inserted via bulk insert.

create_people(name, email)[source]

Creates a people object of type People (which can be found in the pycoshark library) and returns a object id of the type bson.objectid.ObjectId of the stored object

Parameters

name – name of the contributor
email – email of the contributor

Note

The call to mongoengine.queryset.QuerySet.upsert_one() is thread/process safe

run()[source]

Endless loop for the processes, which consists of several steps:

Get a object of class pyvcsshark.dbmodels.models.CommitModel from the queue
Check if this commit was stored before and if it is so: update branches and tags (if they have changed)
Store author and committer in mongodb
Store Tags in mongodb
Create a list of branches, where the commit belongs to
Save the different file actions, which were done in this commit in the mongodb
Save the commit itself

Note

The committer date is used to check if a commit was already stored before. Meaning: We get the last commit out of the database and check if the committer date of the commits we process are > than the committer date of the last commit.

Warning

We only look for changed tags and branches here for already processed commits!

3.5. Models

class pyvcsshark.parser.models.CommitModel(id, branches=[], tags=[], parents=[], author=None, committer=None, message=None, changedFiles=[], authorDate=None, authorOffset=None, committerDate=None, committerOffset=None)[source]

Model that represents a commit to a repository

Parameters

id – id of the ocmmit (e.g. a revision hash)
branches – set of branches to which the commit belongs to
tags – list of tags of type pyvcsshark.dbmodels.models.TagModel
parents – list of strings, which contains the parent ids of the commit
author – author of the commit. Must be of type pyvcsshark.dbmodels.models.PeopleModel
committer – committer of the commit. Must be of type pyvcsshark.dbmodels.models.PeopleModel
message – string of the commit message
changedFiles – list of files of type pyvcsshark.dbmodels.models.FileModel
authorDate – date of the creation of the change of the commit (must be a UNIX timestamp)
authorOffset – offset for the authordate (timezone)
committerDate – date of the commit (must be a UNIX timestamp)
committerOffset – offset for the committerdate (timezone)

Note

If your parser do not provide all information, then just use the default ones

class pyvcsshark.parser.models.FileModel(path, size=None, linesAdded=None, linesDeleted=None, isBinary=None, mode=None, hunks=[], oldPath=None, parent_revision_hash=None)[source]

Model that holds the changes of the files.

Parameters

path – path to the file that was changed
size – size of the file that was changed
linesAdded – count of how many lines were added to the file
linesDeleted – count of how many lines were deleted
isBinary – boolean, which is true if the file is a binary file
mode – mode of the file action (e.g. “A” for file was added)
hunks – list of hunks for the file
oldPath – old path to the file, which only exist if a file was copied or moved
parent_revision_hash – hash of the parent commit

class pyvcsshark.parser.models.Hunk(new_start, new_lines, old_start, old_lines, content)[source]

class pyvcsshark.parser.models.TagModel(name, message=None, tagger=None, taggerDate=None, taggerOffset=None)[source]

Model that holds the information for the different tags.

Parameters

name – name of the tag
message – message of the tag
tagger – creator of the tag. Must be of type pyvcsshark.dbmodels.models.PeopleModel.
taggerDate – date of the creation of the tag. Must be a UNIX timestamp.
taggerOffset – offset for taggerdate (timezone)

class pyvcsshark.parser.models.BranchModel(name)[source]

Model which holds the branch information.

Parameters: name – name of the branch

class pyvcsshark.parser.models.PeopleModel(name=None, email=None)[source]

Model that holds the people information.

Parameters

name – name of the person
email – email of the person

3.6. Parser

3.6.1. BaseParser

class pyvcsshark.parser.baseparser.BaseParser[source]

Abstract class for the parsers. One must inherit from this class and implement the methods to create a new repository parser.

Based on pythons abc: abc

Parameters: metaclass – name of the abstract metaclass

Note

If you want to use a logger for your implementation of a datastore you can write:

logger = logging.getLogger("parser") 

to get the logger.

abstract detect(repository_path)[source]

Return true if the parser is applicable to the repository

Parameters: repository_path – path to the repository

abstract finalize()[source]: Finalization process for parser

static find_correct_parser(repository_path)[source]

Finds the correct parser by executing the parser.detect() method on the given repository path

Parameters: repository_path – path to the repository

abstract get_project_url()[source]: Retrieves the project url from the repository. This need to be put here, as only the parser is specific to the repository type

abstract initialize()[source]: Initialization process for parser

abstract parse(repository_path, datastore, cores_per_job)[source]

Parses the repository

Parameters

repository_path – path to the repository
datastore – subclass of pyvcsshark.datastores.basestore.BaseStore.
cores_per_job – number of cores used for parsing

Note

We must call the pyvcsshark.datastores.basestore.BaseStore.addCommit() function in the parsing process if we want to add commits to the datastore

abstract property repository_type: Must return the type for the given repository. E.g. git

3.6.2. GitParser

class pyvcsshark.parser.gitparser.GitParser[source]

Parser for git repositories. The general parsing process is described in pyvcsshark.parser.gitparser.GitParser.parse().

Property SIMILARITY_THRESHOLD: sets the threshold for deciding if a file is similar to another. Default: 50%

multiprocessing.cpu_count(). :property repository: object of class pygit2.Repository, which represents the repository :property commits_to_be_processed: dictionary that is set up the following way: commits_to_be_processed = {‘<revisionHash>’ : {‘branches’ : set(), ‘tags’ : []}}, where <revisionHash> must be replaced with the actual hash. Therefore, this dictionary holds information about every revision and which branches this revision belongs to and which tags it has. :property logger: logger, which is acquired via logging.getLogger(“parser”) :property datastore: datastore, where the commits should be saved to :property commit_queue: object of class multiprocessing.JoinableQueue, where commits are stored in that can be parsed

add_branch(commit_hash, branch)[source]

Does two things: First it adds the commitHash to the commitqueue, so that the parsing processes can process this commit. Second it creates objects of type pyvcsshark.parser.models.BranchModel and stores it in the dictionary.

Parameters

commit_hash – revision hash of the commit to be processed
branch – branch that should be added for the commit

add_tag(tagged_commit, tag_name, tag_object)[source]

Creates objects of type pyvcsshark.parser.models.TagModel and stores it in the dictionary mentioned above.

Parameters

tagged_commit – revision hash of the commit to be processed
tag_name – name of the tag that should be added
tag_object – in git it is possible to annotate tags. If a tag is annotated, we get a tag object of class pygit2.Tag

Note

It can happen, that people committed to a tag and therefore created a “tag-branch” which is normally not possible in git. Therefore, we go through all tags and check if they respond to a commit, which is already in the dictionary. If yes -> we tag that commit If no -> we ignore it

detect(repository_path)[source]: Try to detect the repository, if its not there an exception is raised and therfore false can be returned

finalize()[source]: Finalization process for parser

get_project_url()[source]: Returns the url of the project, which is processed

initialize()[source]: Initializes the parser. It gets all the branch and tag information and puts it into two different locations: First the commit id is put into the commitqueue for the processing with the parsing processes. Second a dictionary is created, which holds the information of which branches a commit is on and which tags it has

parse(repository_path, datastore, cores_per_job)[source]

Parses the repository, which is located at the repository_path and save the parsed commits in the datastore, by calling the pyvcsshark.datastores.basestore.BaseStore.add_commit() method of the chosen datastore. It mostly uses pygit2 (see: http://www.pygit2.org/).

The parsing process is divided into several steps:

A list of all branches and tags are created

All branches and tags are parsed. So we create dictionary of all commits with their corresponding tags and branches and add all revision hashes to the commitqueue

Add the poison pills for terminating of the parsing process to the commit_queue

Create processes of class pyvcsshark.parser.gitparser.CommitParserProcess, which parse all commits.

Parameters

repository_path – Path to the repository
datastore – Datastore used to save the data to

property repository_type: Must return the type for the given repository. E.g. git

class pyvcsshark.parser.gitparser.CommitParserProcess(queue, commits_to_be_processed, repository, datastore, lock)[source]

A process, which inherits from multiprocessing.Process, that will parse the branches it gets from the queue and call the pyvcsshark.datastores.basestore.BaseStore.addCommit() function to add the commits

Property logger

logger acquired by calling logging.getLogger(“parser”)

Parameters

queue – queue, where the different commithashes are stored in
commits_to_be_processed – dictionary, which contains information about the branches and tags of each commit
repository – repository object of type pygit2.Repository
datastore – object, that is a subclass of pyvcsshark.datastores.basestore.BaseStore
lock – lock that is used, so that only one process at a time is calling the pyvcsshark.datastores.basestore.BaseStore.addCommit() function

create_hunks(hunks, initial_commit=False)[source]

Creates the diff in the unified format (see: https://en.wikipedia.org/wiki/Diff#Unified_format)

If we have the initial commit, we need to turn around the hunk.* attributes.

Parameters

hunks – list of objects of class pygit2.DiffHunk
initial_commit – indicates if we have an initial commit

get_changed_files_for_initial_commit(commit)[source]

Special function for the initial commit, as we need to diff against the empty tree. Creates the changed files list, where objects of class pyvcsshark.parser.models.FileModel are added. For every changed file in the initial commit.

Parameters: commit – commit of type pygit2.Commit

get_changed_files_with_similiarity(parent, commit)[source]

Creates a list of changed files of the class pyvcsshark.parser.models.FileModel. For every changed file in the commit such an object is created. Furthermore, hunks are saved an each file is tested for similarity to detect copy and move operations

Parameters

parent – Object of class pygit2.Commit, that represents the parent commit
commit – Object of class pygit2.Commit, that represents the child commit

parse_commit(commit)[source]

Function for parsing a commit.

changedFiles are created (type: list of pyvcsshark.parser.models.FileModel)
author and commiter are created (type: pyvcsshark.parser.models.PeopleModel)
parents are added (list of strings)
commit model is created (type: pyvcsshark.parser.models.CommitModel)
pyvcsshark.datastores.basestore.BaseStore.addCommit() is called

Parameters: commit – commit object of type pygit2.Commit

Note

The call to pyvcsshark.datastores.basestore.BaseStore.addCommit() is thread/process safe, as a lock is used to regulate the calls

run()[source]: The process gets a commit out of the queue and processes it. We use the poisonous pill technique here. Means, our queue has #Processes times “None” in it in the end. If a process encounters that None, he will stop and terminate.