3. API Documentation

3.1. Main Module

class pyvcsshark.main.Application(config)[source]

Main application class. Contains the most important process logic. The main application consists of different steps:

1. The correct datastore is found (inherits from: pyvcsshark.datastores.basestore.BaseStore) by looking at which one was chosen by the user and the class is instantiated

2. The correct parser (inherits from: pyvcsshark.parser.baseparser.BaseParser) for the specified repository is instantiated

3. pyvcsshark.parser.baseparser.BaseParser.initialize() is called (concreter: the implemented function of the correct parser)

4. pyvcsshark.datastores.basestore.BaseStore.initialize() is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)

5. pyvcsshark.baseparser.BaseParser.parse() is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)

6. pyvcsshark.parser.baseparser.BaseParser.finalize() is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)

7. pyvcsshark.datastores.basestore.BaseStore.finalize() is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)

Parameters

config – An instance of Config, which contains the configuration parameters

3.2. Application

class pyvcsshark.Application(config)[source]

Main application class. Contains the most important process logic. The main application consists of different steps:

1. The correct datastore is found (inherits from: pyvcsshark.datastores.basestore.BaseStore) by looking at which one was chosen by the user and the class is instantiated

2. The correct parser (inherits from: pyvcsshark.parser.baseparser.BaseParser) for the specified repository is instantiated

3. pyvcsshark.parser.baseparser.BaseParser.initialize() is called (concreter: the implemented function of the correct parser)

4. pyvcsshark.datastores.basestore.BaseStore.initialize() is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)

5. pyvcsshark.baseparser.BaseParser.parse() is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)

6. pyvcsshark.parser.baseparser.BaseParser.finalize() is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)

7. pyvcsshark.datastores.basestore.BaseStore.finalize() is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)

Parameters

config – An instance of Config, which contains the configuration parameters

3.3. Configuration and Misc

3.3.1. Configuration

class pyvcsshark.Config(args)[source]

Holds configuration information

Parameters

args – argumentparser of the class argparse.ArgumentParser

3.3.2. Utils

pyvcsshark.utils.find_plugins(plugin_dir)[source]

Finds all python files in the specified path and imports them. This is needed, if we want to detect automatically, which datastore and parser we can apply

Parameters

plugin_dir – path to the plugin directory

pyvcsshark.utils.get_immediate_subdirectories(a_dir)[source]

Helper method, which gets the immediate subdirectories of a path. Is helpful, if one want to create a parser, which looks if certain folders are there.

Parameters

a_dir – directory from which immediate subdirectories should be listed

pyvcsshark.utils.readable_dir(prospective_dir)[source]

Function that checks if a path is a directory, if it exists and if it is accessible and only returns true if all these three are the case

Parameters

prospective_dir – path to the directory

3.4. Datastores

3.4.1. BaseDatastore

class pyvcsshark.datastores.basestore.BaseStore[source]

Abstract class for the datastores. One must inherit from this class and implement the methods to create a new datastore.

Based on pythons abc: abc

Property projectName

name of the project, which should be stored

Property projectURL

url of the repository of the project, which should be stored

Property repositoryType

type of the repository of the project, which should be stored

Parameters

metaclass – name of the abstract metaclass

Note

If you want to use a logger for your implementation of a datastore you can write:

logger = logging.getLogger("store") 

to get the logger.

Note

It is possible to include datastores, which are not databases like mongoDB or mysql. But you should make sure, that your datastore implementation handles the values which are given to it correctly.

abstract add_commit(commit_model)[source]

Add the commit to the datastore. How this is handled depends on the implementation.

Parameters

commit_model – instance of CommitModel, which includes all important information about the commit

Warning

The commits we get here are not sorted. Furthermore, they need to be processed right away or stored in a SimpleQueue. Storing it in a normal list or dictionary can not be done, as some parser (e.g. GitParser) use multiprocessing to add the commits.

abstract finalize()[source]

Is called in the end to finalize the datastore (e.g. closing files or connections)

static find_correct_datastore(datastore_identifier)[source]

Finds the correct datastore by looking at the datastore.storeIdentifier property

Parameters

datastore_identifier – string that represents the correct datastore (e.g. mongo)

abstract initialize(config, repository_url, repository_type)[source]

Initializes the datastore

Parameters
  • config – all configuration

  • repository_url – url of the repository, which is to be analyzed

  • repository_type – type of the repository, which is to be analyzed (e.g. “git”)

abstract property store_identifier

Must return a string identifier for the datastore (e.g. mongo)

3.4.2. MongoStore

3.4.2.1. API

class pyvcsshark.datastores.mongostore.MongoStore[source]

Datastore implementation for saving data to the mongodb. Inherits from pyvcsshark.datastores.basestore.BaseStore.

Property commit_queue

instance of a multiprocessing.JoinableQueue, which holds objects of pyvcsshark.dbmodels.models.CommitModel, that should be put into the mongodb

Property logger

holds the logging instance, by calling logging.getLogger(“store”)

add_branch(branch_model)[source]

Add branch to extra queue

add_commit(commit_model)[source]

Adds commits of class pyvcsshark.dbmodels.models.CommitModel to the commitqueue

finalize()[source]

As we depend on commits beeing finished with branches (for the references) we must wait first for them to finish before we can start our branch processing.

initialize(config, repository_url, repository_type)[source]

Initializes the mongostore by connecting to the mongodb, creating the project in the project collection and setting up processes (see: pyvcsshark.datastores.mongostore.CommitStorageProcess, which read commits out of the commitqueue, process them and store them into the mongodb.

Parameters
  • config – all configuration

  • repository_url – url of the repository, which is to be analyzed

  • repository_type – type of the repository, which is to be analyzed (e.g. “git”)

property store_identifier

Returns the identifier mongo for this datastore

class pyvcsshark.datastores.mongostore.CommitStorageProcess(queue, vcs_system_id, last_commit_date, config, name)[source]

Class that inherits from multiprocessing.Process for processing instances of class pyvcsshark.dbmodels.models.CommitModel and writing it into the mongodb

Parameters
  • queue – queue, where the pyvcsshark.dbmodels.models.CommitModel are stored in

  • vcs_system_id – object id of class bson.objectid.ObjectId from the vcs system

  • last_commit_date – object of class datetime.datetime, which holds the last commit that was parsed

  • config – object of class pyvcsshark.config.Config, which holds configuration information

create_branch_list(branches)[source]

Creates a list of the different branch names, where a commit belongs to. We go through the branches property of the class pyvcsshark.dbmodels.models.CommitModel, which is a list of different branch objects of class pyvcsshark.dbmodels.models.CommitModel

Parameters

branches – list of objects of class pyvcsshark.dbmodels.models.BranchModel

create_file_actions(files, mongo_commit_id)[source]

Creates a list of object ids of type bson.objectid.ObjectId for the different file actions of the commit by transforming the files into file actions of type FileAction, File, and Hunk (pycoshark library)

Parameters
  • files – list of changed files of type pyvcsshark.dbmodels.models.FileModel

  • mongo_commit_id – mongoid of the commit which is processed

Note

Hunks and the file action itself are inserted via bulk insert.

create_people(name, email)[source]

Creates a people object of type People (which can be found in the pycoshark library) and returns a object id of the type bson.objectid.ObjectId of the stored object

Parameters
  • name – name of the contributor

  • email – email of the contributor

Note

The call to mongoengine.queryset.QuerySet.upsert_one() is thread/process safe

run()[source]

Endless loop for the processes, which consists of several steps:

  1. Get a object of class pyvcsshark.dbmodels.models.CommitModel from the queue

  2. Check if this commit was stored before and if it is so: update branches and tags (if they have changed)

  3. Store author and committer in mongodb

  4. Store Tags in mongodb

  5. Create a list of branches, where the commit belongs to

  6. Save the different file actions, which were done in this commit in the mongodb

  7. Save the commit itself

Note

The committer date is used to check if a commit was already stored before. Meaning: We get the last commit out of the database and check if the committer date of the commits we process are > than the committer date of the last commit.

Warning

We only look for changed tags and branches here for already processed commits!

3.5. Models

class pyvcsshark.parser.models.CommitModel(id, branches=[], tags=[], parents=[], author=None, committer=None, message=None, changedFiles=[], authorDate=None, authorOffset=None, committerDate=None, committerOffset=None)[source]

Model that represents a commit to a repository

Parameters
  • id – id of the ocmmit (e.g. a revision hash)

  • branches – set of branches to which the commit belongs to

  • tags – list of tags of type pyvcsshark.dbmodels.models.TagModel

  • parents – list of strings, which contains the parent ids of the commit

  • author – author of the commit. Must be of type pyvcsshark.dbmodels.models.PeopleModel

  • committer – committer of the commit. Must be of type pyvcsshark.dbmodels.models.PeopleModel

  • message – string of the commit message

  • changedFiles – list of files of type pyvcsshark.dbmodels.models.FileModel

  • authorDate – date of the creation of the change of the commit (must be a UNIX timestamp)

  • authorOffset – offset for the authordate (timezone)

  • committerDate – date of the commit (must be a UNIX timestamp)

  • committerOffset – offset for the committerdate (timezone)

Note

If your parser do not provide all information, then just use the default ones

class pyvcsshark.parser.models.FileModel(path, size=None, linesAdded=None, linesDeleted=None, isBinary=None, mode=None, hunks=[], oldPath=None, parent_revision_hash=None)[source]

Model that holds the changes of the files.

Parameters
  • path – path to the file that was changed

  • size – size of the file that was changed

  • linesAdded – count of how many lines were added to the file

  • linesDeleted – count of how many lines were deleted

  • isBinary – boolean, which is true if the file is a binary file

  • mode – mode of the file action (e.g. “A” for file was added)

  • hunks – list of hunks for the file

  • oldPath – old path to the file, which only exist if a file was copied or moved

  • parent_revision_hash – hash of the parent commit

class pyvcsshark.parser.models.Hunk(new_start, new_lines, old_start, old_lines, content)[source]
class pyvcsshark.parser.models.TagModel(name, message=None, tagger=None, taggerDate=None, taggerOffset=None)[source]

Model that holds the information for the different tags.

Parameters
  • name – name of the tag

  • message – message of the tag

  • tagger – creator of the tag. Must be of type pyvcsshark.dbmodels.models.PeopleModel.

  • taggerDate – date of the creation of the tag. Must be a UNIX timestamp.

  • taggerOffset – offset for taggerdate (timezone)

class pyvcsshark.parser.models.BranchModel(name)[source]

Model which holds the branch information.

Parameters

name – name of the branch

class pyvcsshark.parser.models.PeopleModel(name=None, email=None)[source]

Model that holds the people information.

Parameters
  • name – name of the person

  • email – email of the person

3.6. Parser

3.6.1. BaseParser

class pyvcsshark.parser.baseparser.BaseParser[source]

Abstract class for the parsers. One must inherit from this class and implement the methods to create a new repository parser.

Based on pythons abc: abc

Parameters

metaclass – name of the abstract metaclass

Note

If you want to use a logger for your implementation of a datastore you can write:

logger = logging.getLogger("parser") 

to get the logger.

abstract detect(repository_path)[source]

Return true if the parser is applicable to the repository

Parameters

repository_path – path to the repository

abstract finalize()[source]

Finalization process for parser

static find_correct_parser(repository_path)[source]

Finds the correct parser by executing the parser.detect() method on the given repository path

Parameters

repository_path – path to the repository

abstract get_project_url()[source]

Retrieves the project url from the repository. This need to be put here, as only the parser is specific to the repository type

abstract initialize()[source]

Initialization process for parser

abstract parse(repository_path, datastore, cores_per_job)[source]

Parses the repository

Parameters

Note

We must call the pyvcsshark.datastores.basestore.BaseStore.addCommit() function in the parsing process if we want to add commits to the datastore

abstract property repository_type

Must return the type for the given repository. E.g. git

3.6.2. GitParser

class pyvcsshark.parser.gitparser.GitParser[source]

Parser for git repositories. The general parsing process is described in pyvcsshark.parser.gitparser.GitParser.parse().

Property SIMILARITY_THRESHOLD

sets the threshold for deciding if a file is similar to another. Default: 50%

multiprocessing.cpu_count(). :property repository: object of class pygit2.Repository, which represents the repository :property commits_to_be_processed: dictionary that is set up the following way: commits_to_be_processed = {‘<revisionHash>’ : {‘branches’ : set(), ‘tags’ : []}}, where <revisionHash> must be replaced with the actual hash. Therefore, this dictionary holds information about every revision and which branches this revision belongs to and which tags it has. :property logger: logger, which is acquired via logging.getLogger(“parser”) :property datastore: datastore, where the commits should be saved to :property commit_queue: object of class multiprocessing.JoinableQueue, where commits are stored in that can be parsed

add_branch(commit_hash, branch)[source]

Does two things: First it adds the commitHash to the commitqueue, so that the parsing processes can process this commit. Second it creates objects of type pyvcsshark.parser.models.BranchModel and stores it in the dictionary.

Parameters
  • commit_hash – revision hash of the commit to be processed

  • branch – branch that should be added for the commit

add_tag(tagged_commit, tag_name, tag_object)[source]

Creates objects of type pyvcsshark.parser.models.TagModel and stores it in the dictionary mentioned above.

Parameters
  • tagged_commit – revision hash of the commit to be processed

  • tag_name – name of the tag that should be added

  • tag_object – in git it is possible to annotate tags. If a tag is annotated, we get a tag object of class pygit2.Tag

Note

It can happen, that people committed to a tag and therefore created a “tag-branch” which is normally not possible in git. Therefore, we go through all tags and check if they respond to a commit, which is already in the dictionary. If yes -> we tag that commit If no -> we ignore it

detect(repository_path)[source]

Try to detect the repository, if its not there an exception is raised and therfore false can be returned

finalize()[source]

Finalization process for parser

get_project_url()[source]

Returns the url of the project, which is processed

initialize()[source]

Initializes the parser. It gets all the branch and tag information and puts it into two different locations: First the commit id is put into the commitqueue for the processing with the parsing processes. Second a dictionary is created, which holds the information of which branches a commit is on and which tags it has

parse(repository_path, datastore, cores_per_job)[source]

Parses the repository, which is located at the repository_path and save the parsed commits in the datastore, by calling the pyvcsshark.datastores.basestore.BaseStore.add_commit() method of the chosen datastore. It mostly uses pygit2 (see: http://www.pygit2.org/).

The parsing process is divided into several steps:

  1. A list of all branches and tags are created

  2. All branches and tags are parsed. So we create dictionary of all commits with their corresponding tags and branches and add all revision hashes to the commitqueue

  3. Add the poison pills for terminating of the parsing process to the commit_queue

  4. Create processes of class pyvcsshark.parser.gitparser.CommitParserProcess, which parse all commits.

Parameters
  • repository_path – Path to the repository

  • datastore – Datastore used to save the data to

property repository_type

Must return the type for the given repository. E.g. git

class pyvcsshark.parser.gitparser.CommitParserProcess(queue, commits_to_be_processed, repository, datastore, lock)[source]

A process, which inherits from multiprocessing.Process, that will parse the branches it gets from the queue and call the pyvcsshark.datastores.basestore.BaseStore.addCommit() function to add the commits

Property logger

logger acquired by calling logging.getLogger(“parser”)

Parameters
  • queue – queue, where the different commithashes are stored in

  • commits_to_be_processed – dictionary, which contains information about the branches and tags of each commit

  • repository – repository object of type pygit2.Repository

  • datastore – object, that is a subclass of pyvcsshark.datastores.basestore.BaseStore

  • lock – lock that is used, so that only one process at a time is calling the pyvcsshark.datastores.basestore.BaseStore.addCommit() function

create_hunks(hunks, initial_commit=False)[source]

Creates the diff in the unified format (see: https://en.wikipedia.org/wiki/Diff#Unified_format)

If we have the initial commit, we need to turn around the hunk.* attributes.

Parameters
  • hunks – list of objects of class pygit2.DiffHunk

  • initial_commit – indicates if we have an initial commit

get_changed_files_for_initial_commit(commit)[source]

Special function for the initial commit, as we need to diff against the empty tree. Creates the changed files list, where objects of class pyvcsshark.parser.models.FileModel are added. For every changed file in the initial commit.

Parameters

commit – commit of type pygit2.Commit

get_changed_files_with_similiarity(parent, commit)[source]

Creates a list of changed files of the class pyvcsshark.parser.models.FileModel. For every changed file in the commit such an object is created. Furthermore, hunks are saved an each file is tested for similarity to detect copy and move operations

Parameters
  • parent – Object of class pygit2.Commit, that represents the parent commit

  • commit – Object of class pygit2.Commit, that represents the child commit

parse_commit(commit)[source]

Function for parsing a commit.

  1. changedFiles are created (type: list of pyvcsshark.parser.models.FileModel)

  2. author and commiter are created (type: pyvcsshark.parser.models.PeopleModel)

  3. parents are added (list of strings)

  4. commit model is created (type: pyvcsshark.parser.models.CommitModel)

  5. pyvcsshark.datastores.basestore.BaseStore.addCommit() is called

Parameters

commit – commit object of type pygit2.Commit

Note

The call to pyvcsshark.datastores.basestore.BaseStore.addCommit() is thread/process safe, as a lock is used to regulate the calls

run()[source]

The process gets a commit out of the queue and processes it. We use the poisonous pill technique here. Means, our queue has #Processes times “None” in it in the end. If a process encounters that None, he will stop and terminate.