Process documentation.

  Bio::EnsEMBL::Hive::Process

Privates (from "my" definitions)

$g_hive_process_workdir;

  Abstract superclass.  Each Process makes up the individual building blocks 
  of the system.  Instances of these processes are created in a hive workflow 
  graph of Analysis entries that are linked together with dataflow and 
  AnalysisCtrl rules.
  
  Instances of these Processes are created by the system as work is done.
  The newly created Process will have preset $self->queen, $self->dbc, 
  $self->input_id, $self->analysis and several other variables. 
  From this input and configuration data, each Process can then proceed to 
  do something.  The flow of execution within a Process is:
    fetch_input();
    run();
    write_output();
    DESTROY
  The developer can implement their own versions of fetch_input, run, 
  write_output, and DESTROY to do what they need.  
  
  The entire system is based around the concept of a workflow graph which
  can split and loop back on itself.  This is accomplished by dataflow
  rules (or pipes) that connect one Process (or analysis) to others.
  Where a unix commandline program can send output on STDOUT STDERR pipes, 
  a hive Process has access to unlimited pipes referenced by numerical 
  branch_codes. This is accomplished within the Process via 
  $self->dataflow_output_id(...);  
  
  The design philosophy is that each Process does it's work and creates output, 
  but it doesn't worry about where the input came from, or where it's output 
  goes. If the system has dataflow pipes connected, then the output jobs 
  have purpose, if not the output work is thrown away.  The workflow graph 
  'controls' the behaviour of the system, not the processes.  The processes just 
  need to do their job.  The design of the workflow graph is based on the knowledge 
  of what each Process does so that the graph can be correctly constructed.
  The workflow graph can be constructed a priori or can be constructed and 
  modified by intelligent Processes as the system runs.
  
  
  The Hive is based on AI concepts and modeled on the social structure and 
  behaviour of a honey bee hive. So where a worker honey bee's purpose is
  (go find pollen, bring back to hive, drop off pollen, repeat), an ensembl-hive 
  worker's purpose is (find a job, create a Process for that job, run it,
  drop off output job(s), repeat).  While most workflow systems are based 
  on 'smart' central controllers and external control of 'dumb' processes, 
  the Hive is based on 'dumb' workflow graphs and job kiosk, and 'smart' workers 
  (autonomous agents) who are self configuring and figure out for themselves what 
  needs to be done, and then do it.  The workers are based around a set of 
  emergent behaviour rules which allow a predictible system behaviour to emerge 
  from what otherwise might appear at first glance to be a chaotic system. There 
  is an inherent asynchronous disconnect between one worker and the next.  
  Work (or jobs) are simply 'posted' on a blackboard or kiosk within the hive 
  database where other workers can find them.  
  The emergent behaviour rules of a worker are:
     1) If a job is posted, someone needs to do it.
     2) Don't grab something that someone else is working on
     3) Don't grab more than you can handle
     4) If you grab a job, it needs to be finished correctly
     5) Keep busy doing work
     6) If you fail, do the best you can to report back
  For further reading on the AI principles employed in this design see:
     http://en.wikipedia.org/wiki/Autonomous_Agent
     http://en.wikipedia.org/wiki/Emergence

sub DESTROY {

  my $self = shift;
  $self->SUPER::DESTROY if $self->can("SUPER::DESTROY");
}


######################################################
#
# methods that subclasses can use to get access
# to hive infrastructure
#
######################################################

}

sub analysis {

  my ($self, $analysis) = @_;

  if($analysis) {
    throw("Not a Bio::EnsEMBL::Analysis object")
      unless ($analysis->isa("Bio::EnsEMBL::Analysis"));
    $self->{'_analysis'} = $analysis;
  }
  return $self->{'_analysis'};

}

sub autoflow_inputjob {

  my $self = shift;
  $self->{'_autoflow_inputjob'} = shift if(@_);
  $self->{'_autoflow_inputjob'}=1 unless(defined($self->{'_autoflow_inputjob'}));  
  return $self->{'_autoflow_inputjob'};

}

sub check_if_exit_cleanly {

  my $self = shift;

  my $id = $self->input_job->dbID;
  my $honeycomb_dir = $self->{'honeycomb_dir'};
  $honeycomb_dir =~ s/\/$//;
  my $not_allowed = $honeycomb_dir . "/" . "relegate." . $id;
  my $exit_cleanly = $honeycomb_dir . "/" . "relegate.all";
  if (-e $not_allowed) {
    $self->update_status('FAILED');
    throw("This job has been relegated to be killed - $id\n");
  } elsif (-e $exit_cleanly) {
    $self->update_status('READY');
    throw("This job has been relegated to be exited - $id\n");
  }
  return undef;
}

1;

}

sub dataflow_output_id {

  my ($self, $output_id, $branch_code, $blocked) = @_;

  return unless($output_id);
  return unless($self->analysis);

  $branch_code=1 unless(defined($branch_code));

  # Dataflow works by doing a transform from this process to the next.
  # The job starts out 'attached' to this process hence the analysis_id, branch_code, and dbID
  # are all relative to the starting point.  The dataflow process transforms the job to a 
  # different analysis_id, and moves the dbID to the previous_analysis_job_id
  
  my $job = new Bio::EnsEMBL::Hive::AnalysisJob;
  $job->input_id($output_id);
  $job->analysis_id($self->analysis->dbID);
  $job->branch_code($branch_code);
  $job->dbID($self->input_job->dbID);
  $job->status('READY');
  $job->status('BLOCKED') if(defined($blocked) and ($blocked eq 'BLOCKED'));
  
  #if process uses branch_code 1 explicitly, turn off automatic dataflow
  $self->autoflow_inputjob(0) if($branch_code==1);

  return $self->queen->flow_output_job($job);

}

sub db {

  my $self = shift;
  return undef unless($self->queen);
  return $self->queen->db;

}

sub dbc {

  my $self = shift;
  return undef unless($self->queen);
  return $self->queen->dbc;

}

sub debug {

  my $self = shift;
  $self->{'_debug'} = shift if(@_);
  $self->{'_debug'}=0 unless(defined($self->{'_debug'}));  
  return $self->{'_debug'};

}

sub encode_hash {

  my $self = shift;
  my $hash_ref = shift;

  return "" unless($hash_ref);

  my $hash_string = "{";
  my @keys = sort(keys %{$hash_ref});
  foreach my $key (@keys) {
    if(defined($hash_ref->{$key})) {
      $hash_string .= "'$key'=>'" . $hash_ref->{$key} . "',";
    }
  }
  $hash_string .= "}";

  return $hash_string;

}

sub fetch_input {

  my $self = shift;
  return 1;

}

sub input_id {

  my $self = shift;
  return '' unless($self->input_job);
  return $self->input_job->input_id;

}

sub input_job {

  my( $self, $job ) = @_;
  if($job) {
    throw("Not a Bio::EnsEMBL::Hive::AnalysisJob object")
        unless ($job->isa("Bio::EnsEMBL::Hive::AnalysisJob"));
    $self->{'_input_job'} = $job;
  }
  return $self->{'_input_job'};

}

sub new {

  my ($class,@args) = @_;
  my $self = bless {}, $class;
  
  my ($analysis) = rearrange([qw( ANALYSIS )], @args);
  $self->analysis($analysis) if($analysis);
  
  return $self;
}


##########################################
#
# methods subclasses should override 
# in order to give this process function
#
##########################################

}

sub output {

  my ($self) = @_;

  unless (defined $self->{'output'}) {
    $self->{'output'} = [];
    foreach my $r (@{$self->runnable}){
      push(@{$self->{'output'}}, @{$r->output});
    }
  }

  return @{$self->{'output'}};

}

sub parameters {

  my $self = shift;
  return '' unless($self->analysis);
  return $self->analysis->parameters;

}

sub queen {

  my $self = shift;
  $self->{'_queen'} = shift if(@_);
  return $self->{'_queen'};

}

sub run {

  my $self = shift;
  return 1;

}

sub runnable {

  my ($self,$arg) = @_;

  if (!defined($self->{'runnable'})) {
      $self->{'runnable'} = [];
  }
  
  if (defined($arg)) {
    if ($arg->isa("Bio::EnsEMBL::Analysis::Runnable")) {
      push(@{$self->{'runnable'}},$arg);
    } else {
      &throw("[$arg] is not a Bio::EnsEMBL::Analysis::Runnable");
    }
  }
  return $self->{'runnable'};

}

sub worker {

  my $self = shift;
  $self->{'_worker'} = shift if(@_);
  return $self->{'_worker'};

}

sub worker_temp_directory {

  my $self = shift;
  return undef unless($self->worker);
  return $self->worker->worker_process_temp_directory;
}

#################################################
#
# methods to make porting from RunnableDB easier
#
#################################################

}

sub write_output {

  my $self = shift;
  return 1;

}

General documentation

  Contact Jessica Severin on EnsEMBL::Hive implemetation/design detail: jessica@ebi.ac.uk
  Contact Ewan Birney on EnsEMBL in general: birney@sanger.ac.uk

  The rest of the documentation details each of the object methods. 
  Internal methods are usually preceded with a _

DESTROY	Description	Code
analysis	Description	Code
autoflow_inputjob	Description	Code
check_if_exit_cleanly	Description	Code
dataflow_output_id	Description	Code
db	Description	Code
dbc	Description	Code
debug	Description	Code
encode_hash	Description	Code
fetch_input	Description	Code
input_id	No description	Code
input_job	Description	Code
new	No description	Code
output	Description	Code
parameters	No description	Code
queen	Description	Code
run	Description	Code
runnable	Description	Code
worker	No description	Code
worker_temp_directory	Description	Code
write_output	Description	Code