Professional Documents
Culture Documents
5x
This document provides tips and references for troubleshooting your Unicenter
AutoSys 4.5.x implementation.
Additional Resources
http://supportconnectw.ca.com/public/autosys/infodocs/autosys-menu.asp
http://supportconnectw.ca.com/premium/unicenter/implementationcd/Aut
oSys/Autosys_Frame.htm
Note that, although this link contains older information, the basic AutoSys
tips still apply.
1
Troubleshooting Steps
Clearly state what happened that should not have happened – or what did
not happen that should have happened. Be sure to note the scope of the
problem – including the date\time of occurrence, affected
jobs\machines\users\network as well as any preceding jobs or recent
activity on the machines involved in the transaction.
Do this for all affected components and note any security\firewall settings
that may be in effect.
3. Confirm communication
Verify that the affected machines can “talk” to one another to determine if
the fault lies with a possible network\firewall\permissions error
Verify that the job syntax is correct. This helps determine if the fault lies
with the scheduling system or with the job itself.
Check to see what happened on the affected machine(s) and ensure that
the system date\clock are correct – especially when using date\time
related job parameters such as start_times and RunWindow.
Use the autosyslog command to view either the event processor log file or
the Remote Agent log file for a specified job. Both the Remote Agent and
Event Processor write diagnostic messages to their respective logs, as part
of their normal operations and in response to detected error conditions.
The event processor logs all events it processes and provides a detailed
trace of its activities. The Remote Agent’s log displays the log for the
specific job’s most recent run. Although the Remote Agent’s log file is
automatically deleted by default after a successful job run, the log file will
not be deleted at job completion if the job ended with a FAILURE status.
The event processor log also contains a timestamped history of each event
that occurs. Viewing this log is an alternative to monitoring “all jobs” and
“all events.”
2
For more information on autosyslog, consult the Unicenter AutoSys
Reference Guide for Windows and Unix.
Once solution has been applied take steps to prevent repeat. Typically,
this involves education, development\documentation of standardized
processes and conventions (e.g., naming conventions), or application of
necessary patches (and establishing a process whereby this is not allowed
to lapse.)
Prevention
Ensure that you (and anyone else who will be scheduling jobs through
AutoSys) understand the architecture (e.g., components and their
relationships, firewall requirements, job submission authorizations, etc.) and
follow agreed upon standards for defining and submitting jobs – including file
naming conventions and calendars.
Note: Naming conventions for jobs, calendars, variables and views should be
clearly established as part of the initial architecture and consistently enforced
throughout the implementation.
In some cases, a job’s failure to execute has to do more with the job itself
than the scheduling system. Therefore, one of your first troubleshooting steps
should be to verify the validity of the job – including its syntax and access to
required resources.
If the job failed because the command being executed by the job returned an
error, run the AutoSys autorep –J jobname -d and investigate why the Job
abended:
3
In the example above, the command executed by the Job returned an exit
code of “1” upon completion (see “Pri/Xit” column). Notice that AutoSys
attempted to run the Job twice (as seen in the Ntry colum which notes the
number of restart attempts). At first, the job failed because the Remote Agent
was not running (“Connect to socket FAILED”). However, that was
corrected and AutoSys resubmitted the Job, which failed again for the same
reason.
Make sure that the correct syntax is provided to enable the command,
executable, UNIX shell script, application, or batch file (and its parameters) to
run on the Remote Agent Client (when all necessary conditions are met). Keep
the following in mind when using the command attribute in Job definitions:
You cannot redirect standard input, output, and error files in the command
attribute. Use other job attributes, such as std_in_file for standard
input, to provide the necessary functionality.
When specifying drive letters in job definitions, you must enclose the colon
character with quotation marks or backslashes. For example, C\:\tmp or
"C:\tmp" is valid; C:\tmp is not.
4
Job Runs on Command Prompt but not through AutoSys
Jobs can also fail because the job’s owner ID and/or password have not been
defined to the AutoSys security or if it does not have permission to start a Job
on a Client.
When an Agent runs a job on a computer, it logs on as the user who owns the
job. To enable the Agent to do this, the Scheduler passes both the job
information and an encrypted version of the job owner’s password from the
database to the Agent. You must ensure that the password you provide is
valid!
In the following example the job could not run because user “Autosys” or its
password had not been defined to the AutoSys security.
5
To remedy this first logon as the EDIT superuser and run autosys_secure:
autosys_secure will prompt for credentials. Enter the correct user name,
host or domain, and password:
6
autosys_secure can also be executed fully at the command prompt without
requiring interaction.
7
If a job’s starting conditions have not been met, run the AutoSys job_depends
–J jobname –d command to see why it could not start at its start time:
For example:
In the example above, the job’s starting conditions had not been met because
it can only run if its predecessor returns a 0 (exitcode=0). However, since
the predecessor job was still running (and, therefore, had not yet returned a
“0”) when the job’s date condition was met, it could not start. To avoid this
type of problem, make sure that the job’s start_times attribute is set
appropriately.
8
Maximum and Minimum Run Time Errors
If the job failed because it exceeded its maximum run time (specified through
term_run_time) the job is taking longer than the specified time to finish, which
might indicate that the job is stuck in a loop or is waiting for additional data.
Therefore, you should:
Make sure that the job is not stuck in a loop or waiting for data that has
never arrived.
Also, make sure that the maximum run time threshold is adequate.
Note: Keep in mind that if you used the max_run_alarm attribute, exceeding
the limit will send an alarm – it will not cause the job to terminate.
Conversely, a job might also fail to meet its minimum run time, finishing
sooner than expected, which could also indicate that it is not running properly.
In this case you should:
Make sure that the job is getting all the data it needs to run properly.
The run_window attribute controls only when the job starts — not when it
stops. If a job definition contains the run_window attribute, once the job
becomes eligible to run (based on its starting conditions), Unicenter AutoSys
JM verifies whether the specified run window includes the current time. If it
does, the job starts. If it does not, the product determines whether to run the
job based on the end of the previous run window and the beginning of the
next run window.
autorep –J jobname –d
For example:
9
The run_window attribute is not, in itself, a starting condition — it is an
additional control over when a job may start after its starting conditions are
satisfied. This attribute is especially useful, for example, when you do not
know when a monitored file may arrive and there are specific times when a job
dependent on the monitored file should not run.
You should also consider the availability of resources required by the job. For
example, notice that the Job below is queued and that it has a short run
window.
10
This job may not start before the end of the run window because its load
(job_load attribute) added to the load of the running job may exceed the
max_load attribute of the machine they run on. In fact, that is exactly what
happened in the example above:
Here you can see that the job did not run because there were not enough
resources available before the end of its run window.
11
Retries Limit
If Wait Time > Max Restart Wait, then WaitTime = Max Restart Wait.
If necessary, define the number of times to attempt to restart the job after it
exits with a failure status. The n_retrys value can be set to any integer
ranging from 0 to 20 (default: 0 – the job will not restart). For example:
n_retrys: 3
12
specifies that the job will automatically restart up to five times after an
application failure. This means that the job would start as scheduled, and if it
fails, it would restart up to three times for a total of four attempts.
Make sure the job is scheduled according to its date/time condition. These are
specified by the days_of_week, start_times, start_mins, and run_calendar
attributes. Attempting to start the Job via sendevent –E STARTJOB –J
jobname –T “MM/DD/YYYY HH:MM” will result in a date/time condition failure.
The Job report will show:
In the example above you will see that job is scheduled to run on 08/07/2007
at 21:46 (Job definition). However, it was manually scheduled to run on the
present date at 22:55. The Event State (ES) is Processed (PD), but the Job
Status (ST) is Inactive (IN).
When a job runs longer than the time specified by the – term_run_time
attribute it is terminated by Unicenter AutoSys JM.
13
Note: Under Windows, processes launched by user applications or batch
(*.bat) files are not terminated. Unicenter AutoSys JM only terminates the
CMD.EXE process that it used to launch the job. Otherwise, Unicenter AutoSys
JM kills the process specified in the command definition. In UNIX, all child
processes associated with the command process are killed.
Define the maximum number of minutes the job should ever require to finish
normally, if necessary. term_run_time can be set to any integer (default: 0 –
the job is allowed to run forever). For example:
term_run_time: 15
specifies that the job will terminate if it runs for more than 15 minutes.
In some cases a job does not execute because network problems, such
as name resolution errors, or firewall configurations prevent the
AutoSys Scheduler from reaching the Job Management Agent in the
first place. Use diagnostic tools, such as tracert and pathping, to help
determine problems such as broken links.
ccicntrl status
You should also make sure CCI is sending and receiving by using CCIR and
CCIS utilities. For example:
14
If the required CCI components are running and there are no network related
issues, verify that the Event Management components are running by
executing the following commands:
unifstat –c evtd
unifstat –c evtr
For example:
15
In the example above, the Event Management components which are essential
to remediation, in fact, stopped. Re-start Event Management by running the
following command:
Diagnostic tools such as tracert and pathping can help determine problems
such as broken links.
16