You are on page 1of 16

Troubleshooting Tips for Unicenter AutoSys 4.

5x
This document provides tips and references for troubleshooting your Unicenter
AutoSys 4.5.x implementation.

Additional Resources

The Unicenter AutoSys User Guide provides helpful troubleshooting


information, particularly in Chapter 14 “Troubleshooting”, Appendix B
“Troubleshooting CCI” and Appendix C “General Debugging.” In addition, you
should also consult the following links on the CA Support Online website:

„ Techdoc Index for AutoSys

http://supportconnectw.ca.com/public/autosys/infodocs/autosys-menu.asp

Notable troubleshooting topics include the following:

– TEC424427: “I am getting the following message at my command


prompt: Unrecognized job type”

– TEC411871: “AutoSys 4.5.1 will not execute commands with a recent


12.5.1 Sybase Client installed on Remote Agents. When running
command (autorep -G all) on a Remote Agent you get following
message; This application failed to start because libsybdb.dll not
found.”

– TEC411389: “Error message received when trying to save a new job:


"Unexpected Response from Listener with ATSYS 4.5/ UWCC”

– TEC406557: “Trying to start the eventor processor by running


$AUTOSYS/bin/eventor. The event_demon fails to start and no real
explanation or error is displayed as to why.”

– TEC405710: “eTrust Access Control Locks out the Telnet in Unicenter


AutoSys 4.5”

– TEC428525: “Running "autoping" from the event processor machine to


a newly installed Windows remote agent fails stating "read failed win
socket error =10054". Attempts to telnet from the EP machine to the
Remote Agent machine also fail.”

„ Implementation CD AutoSys section

http://supportconnectw.ca.com/premium/unicenter/implementationcd/Aut
oSys/Autosys_Frame.htm

Note that, although this link contains older information, the basic AutoSys
tips still apply.

1
Troubleshooting Steps

In troubleshooting your Unicenter AutoSys implementation (or any product


implementation for that matter) you should take care to ensure your approach
is consistent, repeatable, inquisitive and well documented. In general you will
need to:

1. Define the problem

Clearly state what happened that should not have happened – or what did
not happen that should have happened. Be sure to note the scope of the
problem – including the date\time of occurrence, affected
jobs\machines\users\network as well as any preceding jobs or recent
activity on the machines involved in the transaction.

2. Identify versions\patch level

Do this for all affected components and note any security\firewall settings
that may be in effect.

3. Confirm communication

Verify that the affected machines can “talk” to one another to determine if
the fault lies with a possible network\firewall\permissions error

4. Execute the job manually

Verify that the job syntax is correct. This helps determine if the fault lies
with the scheduling system or with the job itself.

5. Check logs and system date\clock

Check to see what happened on the affected machine(s) and ensure that
the system date\clock are correct – especially when using date\time
related job parameters such as start_times and RunWindow.

Use the autosyslog command to view either the event processor log file or
the Remote Agent log file for a specified job. Both the Remote Agent and
Event Processor write diagnostic messages to their respective logs, as part
of their normal operations and in response to detected error conditions.

The syntax for this command is:

autosyslog [-e | -J job_name] [-p]

The event processor logs all events it processes and provides a detailed
trace of its activities. The Remote Agent’s log displays the log for the
specific job’s most recent run. Although the Remote Agent’s log file is
automatically deleted by default after a successful job run, the log file will
not be deleted at job completion if the job ended with a FAILURE status.

The event processor log also contains a timestamped history of each event
that occurs. Viewing this log is an alternative to monitoring “all jobs” and
“all events.”

2
For more information on autosyslog, consult the Unicenter AutoSys
Reference Guide for Windows and Unix.

6. Document the solution

Once solution has been applied take steps to prevent repeat. Typically,
this involves education, development\documentation of standardized
processes and conventions (e.g., naming conventions), or application of
necessary patches (and establishing a process whereby this is not allowed
to lapse.)

Prevention

Unicenter AutoSys provides a highly flexible, very customizable scheduling tool


for managing your workload environment; however, you should be very aware
of the implications and restrictions inherent in any of the scheduling options
before employing them. For example, if a predecessor job regularly exceeds
its “term_run_time” it can cause a ripple effect felt through the full chain of its
dependent jobs.

Ensure that you (and anyone else who will be scheduling jobs through
AutoSys) understand the architecture (e.g., components and their
relationships, firewall requirements, job submission authorizations, etc.) and
follow agreed upon standards for defining and submitting jobs – including file
naming conventions and calendars.

Note: Naming conventions for jobs, calendars, variables and views should be
clearly established as part of the initial architecture and consistently enforced
throughout the implementation.

Job Related Errors

In some cases, a job’s failure to execute has to do more with the job itself
than the scheduling system. Therefore, one of your first troubleshooting steps
should be to verify the validity of the job – including its syntax and access to
required resources.

If the job failed because the command being executed by the job returned an
error, run the AutoSys autorep –J jobname -d and investigate why the Job
abended:

3
In the example above, the command executed by the Job returned an exit
code of “1” upon completion (see “Pri/Xit” column). Notice that AutoSys
attempted to run the Job twice (as seen in the Ntry colum which notes the
number of restart attempts). At first, the job failed because the Remote Agent
was not running (“Connect to socket FAILED”). However, that was
corrected and AutoSys resubmitted the Job, which failed again for the same
reason.

Make sure that the correct syntax is provided to enable the command,
executable, UNIX shell script, application, or batch file (and its parameters) to
run on the Remote Agent Client (when all necessary conditions are met). Keep
the following in mind when using the command attribute in Job definitions:

„ You cannot redirect standard input, output, and error files in the command
attribute. Use other job attributes, such as std_in_file for standard
input, to provide the necessary functionality.

„ Environment variables for the command are defined by a default profile or


the profile specified in the job definition.

„ Although system environment variables are automatically set in the


command’s environment, user environment variables are not. You must
define all other required environment variables in the job’s profile.

„ If a command works properly when issued at the command prompt, but


fails to run properly when specified as a command attribute, the necessary
user-defined environment variables and the variables defined in the job
profile are probably different. If this is the case, compare the variables to
verify that all required user environment variables are defined in the job
profile. Information on how to do this can be found in the User Guides.

„ When specifying drive letters in job definitions, you must enclose the colon
character with quotation marks or backslashes. For example, C\:\tmp or
"C:\tmp" is valid; C:\tmp is not.

4
Job Runs on Command Prompt but not through AutoSys

If a command runs manually on a Windows command prompt but fails with a


“job returned =-1” when run via AutoSys check the system's PATH variable to
see if it contains spaces in the path location to the command. Setting the
command's bin location at the beginning of the PATH variable in the
Administrator GUI "System" Environment Variables either surrounded by
double-quotes or using "~1" in place of the portion of the PATH definition with
"spaces" will allow the command to be found.

Password and Permission Errors (Windows Only)

Jobs can also fail because the job’s owner ID and/or password have not been
defined to the AutoSys security or if it does not have permission to start a Job
on a Client.

When an Agent runs a job on a computer, it logs on as the user who owns the
job. To enable the Agent to do this, the Scheduler passes both the job
information and an encrypted version of the job owner’s password from the
database to the Agent. You must ensure that the password you provide is
valid!

The EDIT Superuser can use autosys_secure interactively or from the


command line to enter these IDs and/or passwords. After the EDIT Superuser
enters the IDs and passwords, any user who knows an existing user ID and
password can change the password or delete that user ID and password.

In the following example the job could not run because user “Autosys” or its
password had not been defined to the AutoSys security.

5
To remedy this first logon as the EDIT superuser and run autosys_secure:

Select option 5: Manage AutoSys User@Host users. Then, select 1:


Create AutoSys User@Host or Domain Password.

autosys_secure will prompt for credentials. Enter the correct user name,
host or domain, and password:

6
autosys_secure can also be executed fully at the command prompt without
requiring interaction.

Scheduling Problems in the Job Definition

If you include scheduling options, such as max_run_alarm, term_run_time or


run_window, it is critical that you understand how these parameters work and
how long the job typically takes to run, particularly when there are many
dependencies spanning multiple platforms and machines.

7
If a job’s starting conditions have not been met, run the AutoSys job_depends
–J jobname –d command to see why it could not start at its start time:

For example:

In the example above, the job’s starting conditions had not been met because
it can only run if its predecessor returns a 0 (exitcode=0). However, since
the predecessor job was still running (and, therefore, had not yet returned a
“0”) when the job’s date condition was met, it could not start. To avoid this
type of problem, make sure that the job’s start_times attribute is set
appropriately.

In the following example the output of the job_depends –J jobname –d


command shows that the job’s starting conditions have not been met because
it can only run if its predecessor runs successfully. Since its predecessor
failed, it cannot be started.

8
Maximum and Minimum Run Time Errors

If the job failed because it exceeded its maximum run time (specified through
term_run_time) the job is taking longer than the specified time to finish, which
might indicate that the job is stuck in a loop or is waiting for additional data.
Therefore, you should:

„ Make sure that the job is not stuck in a loop or waiting for data that has
never arrived.

„ Also, make sure that the maximum run time threshold is adequate.

Note: Keep in mind that if you used the max_run_alarm attribute, exceeding
the limit will send an alarm – it will not cause the job to terminate.

Conversely, a job might also fail to meet its minimum run time, finishing
sooner than expected, which could also indicate that it is not running properly.
In this case you should:

„ Make sure that the job is getting all the data it needs to run properly.

„ Verify that the minimum run time threshold is adequate.

Missed Run Window

The run_window attribute controls only when the job starts — not when it
stops. If a job definition contains the run_window attribute, once the job
becomes eligible to run (based on its starting conditions), Unicenter AutoSys
JM verifies whether the specified run window includes the current time. If it
does, the job starts. If it does not, the product determines whether to run the
job based on the end of the previous run window and the beginning of the
next run window.

To see what happened, execute the following command:

autorep –J jobname –d

For example:

9
The run_window attribute is not, in itself, a starting condition — it is an
additional control over when a job may start after its starting conditions are
satisfied. This attribute is especially useful, for example, when you do not
know when a monitored file may arrive and there are specific times when a job
dependent on the monitored file should not run.

Therefore, make sure that the job’s condition (starting conditions),


date_condition (date/time conditions), and run_window attributes are all set
appropriately (for example, a run window cannot span more than 24 hours).
Then, if the job is on hold, make sure to run sendevent –E JOB_OFF_HOLD –J
jobname before the end of the run window.

You should also consider the availability of resources required by the job. For
example, notice that the Job below is queued and that it has a short run
window.

10
This job may not start before the end of the run window because its load
(job_load attribute) added to the load of the running job may exceed the
max_load attribute of the machine they run on. In fact, that is exactly what
happened in the example above:

Here you can see that the job did not run because there were not enough
resources available before the end of its run window.

11
Retries Limit

When a job exceeds the maximum number of retries specified by n_retrys in


the job definition or Max Restart Trys in the instance configuration, it exits
with a failure status. The n_retrys attribute applies to application
failures – for example, where AutoSys is unable to find a file or command, or
where permissions were not properly set. It does not apply to system or
network failures such as when a computer is unavailable or a socket
connection has timed out. Job restarts after system or network failures are
controlled by the Max Restart Trys setting on the Unicenter AutoSys JM
Administrator Scheduler window.

The delay between restarts is determined by the Restart Constant and


Restart Factor settings on the Unicenter AutoSys JM Administrator
Scheduler window which are limited by the maximum specified by the Max
Restart Wait setting.

The following formula is used to calculate wait time:

Wait Time = Restart Constant + (Max Restart Trys * Restart Factor)

If Wait Time > Max Restart Wait, then WaitTime = Max Restart Wait.

If necessary, define the number of times to attempt to restart the job after it
exits with a failure status. The n_retrys value can be set to any integer
ranging from 0 to 20 (default: 0 – the job will not restart). For example:

n_retrys: 3

12
specifies that the job will automatically restart up to five times after an
application failure. This means that the job would start as scheduled, and if it
fails, it would restart up to three times for a total of four attempts.

Job Date\Time Conditions Not Met

Make sure the job is scheduled according to its date/time condition. These are
specified by the days_of_week, start_times, start_mins, and run_calendar
attributes. Attempting to start the Job via sendevent –E STARTJOB –J
jobname –T “MM/DD/YYYY HH:MM” will result in a date/time condition failure.
The Job report will show:

In the example above you will see that job is scheduled to run on 08/07/2007
at 21:46 (Job definition). However, it was manually scheduled to run on the
present date at 22:55. The Event State (ES) is Processed (PD), but the Job
Status (ST) is Inactive (IN).

Term Run Time Limit Exceeded

A Job may terminate because it either exceeded its term_run_time attribute,


which designates the maximum run time (in minutes) that the job should
require to finish normally or the job was killed with a command such as
sendevent –E KILLJOB –J jobname.

When a job runs longer than the time specified by the – term_run_time
attribute it is terminated by Unicenter AutoSys JM.

13
Note: Under Windows, processes launched by user applications or batch
(*.bat) files are not terminated. Unicenter AutoSys JM only terminates the
CMD.EXE process that it used to launch the job. Otherwise, Unicenter AutoSys
JM kills the process specified in the command definition. In UNIX, all child
processes associated with the command process are killed.

Define the maximum number of minutes the job should ever require to finish
normally, if necessary. term_run_time can be set to any integer (default: 0 –
the job is allowed to run forever). For example:

term_run_time: 15

specifies that the job will terminate if it runs for more than 15 minutes.

Unable to Contact Machine

In some cases a job does not execute because network problems, such
as name resolution errors, or firewall configurations prevent the
AutoSys Scheduler from reaching the Job Management Agent in the
first place. Use diagnostic tools, such as tracert and pathping, to help
determine problems such as broken links.

Tools and Verification Checks

CCI is used to facilitate communications between components and this is


particularly critical when cross platform scheduling is in effect. To verify that
the necessary CCI components are running, execute the following command:

ccicntrl status

Here you can see an example of the results:

Depending upon the exact configuration of the machines in your environment,


NR-Client may be running instead of NR-Server. Usually NR-Server is installed
and remote is used for persistent connections. Therefore, at least two CCI
components must be running: Transport and NR-Server.

You should also make sure CCI is sending and receiving by using CCIR and
CCIS utilities. For example:

14
If the required CCI components are running and there are no network related
issues, verify that the Event Management components are running by
executing the following commands:

unifstat –c evtd
unifstat –c evtr

For example:

15
In the example above, the Event Management components which are essential
to remediation, in fact, stopped. Re-start Event Management by running the
following command:

unicntrl start opr

Diagnostic tools such as tracert and pathping can help determine problems
such as broken links.

16

You might also like