3. API Documentation
3.1. Main Module
- class pyvcsshark.main.Application(config)[source]
Main application class. Contains the most important process logic. The main application consists of different steps:
1. The correct datastore is found (inherits from:
pyvcsshark.datastores.basestore.BaseStore
) by looking at which one was chosen by the user and the class is instantiated2. The correct parser (inherits from:
pyvcsshark.parser.baseparser.BaseParser
) for the specified repository is instantiated3.
pyvcsshark.parser.baseparser.BaseParser.initialize()
is called (concreter: the implemented function of the correct parser)4.
pyvcsshark.datastores.basestore.BaseStore.initialize()
is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)5.
pyvcsshark.baseparser.BaseParser.parse()
is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)6.
pyvcsshark.parser.baseparser.BaseParser.finalize()
is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)7.
pyvcsshark.datastores.basestore.BaseStore.finalize()
is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)- Parameters
config – An instance of
Config
, which contains the configuration parameters
3.2. Application
- class pyvcsshark.Application(config)[source]
Main application class. Contains the most important process logic. The main application consists of different steps:
1. The correct datastore is found (inherits from:
pyvcsshark.datastores.basestore.BaseStore
) by looking at which one was chosen by the user and the class is instantiated2. The correct parser (inherits from:
pyvcsshark.parser.baseparser.BaseParser
) for the specified repository is instantiated3.
pyvcsshark.parser.baseparser.BaseParser.initialize()
is called (concreter: the implemented function of the correct parser)4.
pyvcsshark.datastores.basestore.BaseStore.initialize()
is called with the different configuration parameters and values from the parser (concreter: the implemented function of the correct datastore)5.
pyvcsshark.baseparser.BaseParser.parse()
is called to start the parsing process of the repository (concreter: the implemented function of the correct parser)6.
pyvcsshark.parser.baseparser.BaseParser.finalize()
is called to finalize the parsing process (e.g. closing files) (concreter: the implemented function of the correct parser)7.
pyvcsshark.datastores.basestore.BaseStore.finalize()
is called to finalize the storing process (e.g. closing connections) (concreter: the implemented function of the correct datastore)- Parameters
config – An instance of
Config
, which contains the configuration parameters
3.3. Configuration and Misc
3.3.1. Configuration
- class pyvcsshark.Config(args)[source]
Holds configuration information
- Parameters
args – argumentparser of the class
argparse.ArgumentParser
3.3.2. Utils
- pyvcsshark.utils.find_plugins(plugin_dir)[source]
Finds all python files in the specified path and imports them. This is needed, if we want to detect automatically, which datastore and parser we can apply
- Parameters
plugin_dir – path to the plugin directory
3.4. Datastores
3.4.1. BaseDatastore
- class pyvcsshark.datastores.basestore.BaseStore[source]
Abstract class for the datastores. One must inherit from this class and implement the methods to create a new datastore.
Based on pythons abc:
abc
- Property projectName
name of the project, which should be stored
- Property projectURL
url of the repository of the project, which should be stored
- Property repositoryType
type of the repository of the project, which should be stored
- Parameters
metaclass – name of the abstract metaclass
Note
If you want to use a logger for your implementation of a datastore you can write:
logger = logging.getLogger("store")
to get the logger.
Note
It is possible to include datastores, which are not databases like mongoDB or mysql. But you should make sure, that your datastore implementation handles the values which are given to it correctly.
- abstract add_commit(commit_model)[source]
Add the commit to the datastore. How this is handled depends on the implementation.
- Parameters
commit_model – instance of
CommitModel
, which includes all important information about the commit
Warning
The commits we get here are not sorted. Furthermore, they need to be processed right away or stored in a
SimpleQueue
. Storing it in a normal list or dictionary can not be done, as some parser (e.g. GitParser) use multiprocessing to add the commits.
- abstract finalize()[source]
Is called in the end to finalize the datastore (e.g. closing files or connections)
- static find_correct_datastore(datastore_identifier)[source]
Finds the correct datastore by looking at the datastore.storeIdentifier property
- Parameters
datastore_identifier – string that represents the correct datastore (e.g. mongo)
- abstract initialize(config, repository_url, repository_type)[source]
Initializes the datastore
- Parameters
config – all configuration
repository_url – url of the repository, which is to be analyzed
repository_type – type of the repository, which is to be analyzed (e.g. “git”)
- abstract property store_identifier
Must return a string identifier for the datastore (e.g. mongo)
3.4.2. MongoStore
3.4.2.1. API
- class pyvcsshark.datastores.mongostore.MongoStore[source]
Datastore implementation for saving data to the mongodb. Inherits from
pyvcsshark.datastores.basestore.BaseStore
.- Property commit_queue
instance of a
multiprocessing.JoinableQueue
, which holds objects ofpyvcsshark.dbmodels.models.CommitModel
, that should be put into the mongodb- Property logger
holds the logging instance, by calling logging.getLogger(“store”)
- add_commit(commit_model)[source]
Adds commits of class
pyvcsshark.dbmodels.models.CommitModel
to the commitqueue
- finalize()[source]
As we depend on commits beeing finished with branches (for the references) we must wait first for them to finish before we can start our branch processing.
- initialize(config, repository_url, repository_type)[source]
Initializes the mongostore by connecting to the mongodb, creating the project in the project collection and setting up processes (see:
pyvcsshark.datastores.mongostore.CommitStorageProcess
, which read commits out of the commitqueue, process them and store them into the mongodb.- Parameters
config – all configuration
repository_url – url of the repository, which is to be analyzed
repository_type – type of the repository, which is to be analyzed (e.g. “git”)
- property store_identifier
Returns the identifier mongo for this datastore
- class pyvcsshark.datastores.mongostore.CommitStorageProcess(queue, vcs_system_id, last_commit_date, config, name)[source]
Class that inherits from
multiprocessing.Process
for processing instances of classpyvcsshark.dbmodels.models.CommitModel
and writing it into the mongodb- Parameters
queue – queue, where the
pyvcsshark.dbmodels.models.CommitModel
are stored invcs_system_id – object id of class
bson.objectid.ObjectId
from the vcs systemlast_commit_date – object of class
datetime.datetime
, which holds the last commit that was parsedconfig – object of class
pyvcsshark.config.Config
, which holds configuration information
- create_branch_list(branches)[source]
Creates a list of the different branch names, where a commit belongs to. We go through the branches property of the class
pyvcsshark.dbmodels.models.CommitModel
, which is a list of different branch objects of class pyvcsshark.dbmodels.models.CommitModel- Parameters
branches – list of objects of class
pyvcsshark.dbmodels.models.BranchModel
- create_file_actions(files, mongo_commit_id)[source]
Creates a list of object ids of type
bson.objectid.ObjectId
for the different file actions of the commit by transforming the files into file actions of type FileAction, File, and Hunk (pycoshark library)- Parameters
files – list of changed files of type
pyvcsshark.dbmodels.models.FileModel
mongo_commit_id – mongoid of the commit which is processed
Note
Hunks and the file action itself are inserted via bulk insert.
- create_people(name, email)[source]
Creates a people object of type People (which can be found in the pycoshark library) and returns a object id of the type
bson.objectid.ObjectId
of the stored object- Parameters
name – name of the contributor
email – email of the contributor
Note
The call to
mongoengine.queryset.QuerySet.upsert_one()
is thread/process safe
- run()[source]
Endless loop for the processes, which consists of several steps:
Get a object of class
pyvcsshark.dbmodels.models.CommitModel
from the queueCheck if this commit was stored before and if it is so: update branches and tags (if they have changed)
Store author and committer in mongodb
Store Tags in mongodb
Create a list of branches, where the commit belongs to
Save the different file actions, which were done in this commit in the mongodb
Save the commit itself
Note
The committer date is used to check if a commit was already stored before. Meaning: We get the last commit out of the database and check if the committer date of the commits we process are > than the committer date of the last commit.
Warning
We only look for changed tags and branches here for already processed commits!
3.5. Models
- class pyvcsshark.parser.models.CommitModel(id, branches=[], tags=[], parents=[], author=None, committer=None, message=None, changedFiles=[], authorDate=None, authorOffset=None, committerDate=None, committerOffset=None)[source]
Model that represents a commit to a repository
- Parameters
id – id of the ocmmit (e.g. a revision hash)
branches – set of branches to which the commit belongs to
tags – list of tags of type
pyvcsshark.dbmodels.models.TagModel
parents – list of strings, which contains the parent ids of the commit
author – author of the commit. Must be of type
pyvcsshark.dbmodels.models.PeopleModel
committer – committer of the commit. Must be of type
pyvcsshark.dbmodels.models.PeopleModel
message – string of the commit message
changedFiles – list of files of type
pyvcsshark.dbmodels.models.FileModel
authorDate – date of the creation of the change of the commit (must be a UNIX timestamp)
authorOffset – offset for the authordate (timezone)
committerDate – date of the commit (must be a UNIX timestamp)
committerOffset – offset for the committerdate (timezone)
Note
If your parser do not provide all information, then just use the default ones
- class pyvcsshark.parser.models.FileModel(path, size=None, linesAdded=None, linesDeleted=None, isBinary=None, mode=None, hunks=[], oldPath=None, parent_revision_hash=None)[source]
Model that holds the changes of the files.
- Parameters
path – path to the file that was changed
size – size of the file that was changed
linesAdded – count of how many lines were added to the file
linesDeleted – count of how many lines were deleted
isBinary – boolean, which is true if the file is a binary file
mode – mode of the file action (e.g. “A” for file was added)
hunks – list of hunks for the file
oldPath – old path to the file, which only exist if a file was copied or moved
parent_revision_hash – hash of the parent commit
- class pyvcsshark.parser.models.TagModel(name, message=None, tagger=None, taggerDate=None, taggerOffset=None)[source]
Model that holds the information for the different tags.
- Parameters
name – name of the tag
message – message of the tag
tagger – creator of the tag. Must be of type
pyvcsshark.dbmodels.models.PeopleModel
.taggerDate – date of the creation of the tag. Must be a UNIX timestamp.
taggerOffset – offset for taggerdate (timezone)
3.6. Parser
3.6.1. BaseParser
- class pyvcsshark.parser.baseparser.BaseParser[source]
Abstract class for the parsers. One must inherit from this class and implement the methods to create a new repository parser.
Based on pythons abc:
abc
- Parameters
metaclass – name of the abstract metaclass
Note
If you want to use a logger for your implementation of a datastore you can write:
logger = logging.getLogger("parser")
to get the logger.
- abstract detect(repository_path)[source]
Return true if the parser is applicable to the repository
- Parameters
repository_path – path to the repository
- static find_correct_parser(repository_path)[source]
Finds the correct parser by executing the parser.detect() method on the given repository path
- Parameters
repository_path – path to the repository
- abstract get_project_url()[source]
Retrieves the project url from the repository. This need to be put here, as only the parser is specific to the repository type
- abstract parse(repository_path, datastore, cores_per_job)[source]
Parses the repository
- Parameters
repository_path – path to the repository
datastore – subclass of
pyvcsshark.datastores.basestore.BaseStore
.cores_per_job – number of cores used for parsing
Note
We must call the
pyvcsshark.datastores.basestore.BaseStore.addCommit()
function in the parsing process if we want to add commits to the datastore
- abstract property repository_type
Must return the type for the given repository. E.g. git
3.6.2. GitParser
- class pyvcsshark.parser.gitparser.GitParser[source]
Parser for git repositories. The general parsing process is described in
pyvcsshark.parser.gitparser.GitParser.parse()
.- Property SIMILARITY_THRESHOLD
sets the threshold for deciding if a file is similar to another. Default: 50%
multiprocessing.cpu_count()
. :property repository: object of classpygit2.Repository
, which represents the repository :property commits_to_be_processed: dictionary that is set up the following way: commits_to_be_processed = {‘<revisionHash>’ : {‘branches’ : set(), ‘tags’ : []}}, where <revisionHash> must be replaced with the actual hash. Therefore, this dictionary holds information about every revision and which branches this revision belongs to and which tags it has. :property logger: logger, which is acquired via logging.getLogger(“parser”) :property datastore: datastore, where the commits should be saved to :property commit_queue: object of classmultiprocessing.JoinableQueue
, where commits are stored in that can be parsed- add_branch(commit_hash, branch)[source]
Does two things: First it adds the commitHash to the commitqueue, so that the parsing processes can process this commit. Second it creates objects of type
pyvcsshark.parser.models.BranchModel
and stores it in the dictionary.- Parameters
commit_hash – revision hash of the commit to be processed
branch – branch that should be added for the commit
- add_tag(tagged_commit, tag_name, tag_object)[source]
Creates objects of type
pyvcsshark.parser.models.TagModel
and stores it in the dictionary mentioned above.- Parameters
tagged_commit – revision hash of the commit to be processed
tag_name – name of the tag that should be added
tag_object – in git it is possible to annotate tags. If a tag is annotated, we get a tag object of class
pygit2.Tag
Note
It can happen, that people committed to a tag and therefore created a “tag-branch” which is normally not possible in git. Therefore, we go through all tags and check if they respond to a commit, which is already in the dictionary. If yes -> we tag that commit If no -> we ignore it
- detect(repository_path)[source]
Try to detect the repository, if its not there an exception is raised and therfore false can be returned
- initialize()[source]
Initializes the parser. It gets all the branch and tag information and puts it into two different locations: First the commit id is put into the commitqueue for the processing with the parsing processes. Second a dictionary is created, which holds the information of which branches a commit is on and which tags it has
- parse(repository_path, datastore, cores_per_job)[source]
Parses the repository, which is located at the repository_path and save the parsed commits in the datastore, by calling the
pyvcsshark.datastores.basestore.BaseStore.add_commit()
method of the chosen datastore. It mostly uses pygit2 (see: http://www.pygit2.org/).The parsing process is divided into several steps:
A list of all branches and tags are created
All branches and tags are parsed. So we create dictionary of all commits with their corresponding tags and branches and add all revision hashes to the commitqueue
Add the poison pills for terminating of the parsing process to the commit_queue
Create processes of class
pyvcsshark.parser.gitparser.CommitParserProcess
, which parse all commits.
- Parameters
repository_path – Path to the repository
datastore – Datastore used to save the data to
- property repository_type
Must return the type for the given repository. E.g. git
- class pyvcsshark.parser.gitparser.CommitParserProcess(queue, commits_to_be_processed, repository, datastore, lock)[source]
A process, which inherits from
multiprocessing.Process
, that will parse the branches it gets from the queue and call thepyvcsshark.datastores.basestore.BaseStore.addCommit()
function to add the commits- Property logger
logger acquired by calling logging.getLogger(“parser”)
- Parameters
queue – queue, where the different commithashes are stored in
commits_to_be_processed – dictionary, which contains information about the branches and tags of each commit
repository – repository object of type
pygit2.Repository
datastore – object, that is a subclass of
pyvcsshark.datastores.basestore.BaseStore
lock – lock that is used, so that only one process at a time is calling the
pyvcsshark.datastores.basestore.BaseStore.addCommit()
function
- create_hunks(hunks, initial_commit=False)[source]
Creates the diff in the unified format (see: https://en.wikipedia.org/wiki/Diff#Unified_format)
If we have the initial commit, we need to turn around the hunk.* attributes.
- Parameters
hunks – list of objects of class
pygit2.DiffHunk
initial_commit – indicates if we have an initial commit
- get_changed_files_for_initial_commit(commit)[source]
Special function for the initial commit, as we need to diff against the empty tree. Creates the changed files list, where objects of class
pyvcsshark.parser.models.FileModel
are added. For every changed file in the initial commit.- Parameters
commit – commit of type
pygit2.Commit
- get_changed_files_with_similiarity(parent, commit)[source]
Creates a list of changed files of the class
pyvcsshark.parser.models.FileModel
. For every changed file in the commit such an object is created. Furthermore, hunks are saved an each file is tested for similarity to detect copy and move operations- Parameters
parent – Object of class
pygit2.Commit
, that represents the parent commitcommit – Object of class
pygit2.Commit
, that represents the child commit
- parse_commit(commit)[source]
Function for parsing a commit.
changedFiles are created (type: list of
pyvcsshark.parser.models.FileModel
)author and commiter are created (type:
pyvcsshark.parser.models.PeopleModel
)parents are added (list of strings)
commit model is created (type:
pyvcsshark.parser.models.CommitModel
)pyvcsshark.datastores.basestore.BaseStore.addCommit()
is called
- Parameters
commit – commit object of type
pygit2.Commit
Note
The call to
pyvcsshark.datastores.basestore.BaseStore.addCommit()
is thread/process safe, as a lock is used to regulate the calls