mirror of
https://github.com/opsxcq/mirror-textfiles.com.git
synced 2025-08-12 12:44:08 +02:00
384 lines
15 KiB
Plaintext
384 lines
15 KiB
Plaintext
|
||
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
||
%$% %$%
|
||
$%$ Electronic Switching System Faults $%$
|
||
%$% %$%
|
||
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
|
||
|
||
|
||
|
||
|
||
"Notes from No 2 ESS Administration and Maintenance Plan,"
|
||
"BSTJ Vol 48, 1969"
|
||
|
||
"Data Maintenance"
|
||
|
||
|
||
Memory mutilation results from hardware faults and program bugs.
|
||
During nonsynchronous operation mismatch detection not available so
|
||
there may be a long period of time during which mutilation occurs.
|
||
Mismatch detection useless in finding data mutilation caused by program bugs.
|
||
|
||
Data maintenance aided by
|
||
ease of communication among programs,
|
||
absence of linked lists, and
|
||
per call memory allocation (Call processing program addressing is relative to the allocated memory, reducing scope of data accesses).
|
||
|
||
Defensive programming techniques:
|
||
|
||
Range check table indexes,
|
||
Zero check derived transfer-to addresses, and
|
||
Distinct program and data errors prevent programs being read as data.
|
||
Audit programs detect bad data.
|
||
Audits run periodically or as requested from tty.
|
||
Separate audits for different memory blocks
|
||
Audits correct by idling memory blocks containing bad data.
|
||
System recovery initiated by control unit switch during simplex operation, control
|
||
unit switch can be caused by bad data or bugs that cause sanity time out.
|
||
|
||
System recovery Funtions:
|
||
Make call store consistent with state of periphery.
|
||
Clear memory associated with program in control at time of recovery,
|
||
Run audits,
|
||
Repeat the above with widening scope of memory initialization until sanity obtained
|
||
|
||
|
||
|
||
|
||
"Notes from Design of Recovery Strategies for A Fault Tolerant No. 4 ESS"
|
||
"by R. J Willet - BSTJ vol 61, no 10, 4-13-82"
|
||
|
||
"Objectives"
|
||
|
||
616,000 call attempts/hour
|
||
100,000 acive terminations
|
||
Downtime less than 2 hours in 40 years
|
||
Not cost-effective (or possible) to remove all software errors - minimize
|
||
number of service effecting errors and analyze data for cause.
|
||
|
||
|
||
"Software Recovery"
|
||
Reconstruct data from associated information - slow, disturbs few calls.
|
||
Reinitialize memory structure - fast, disturbs many calls.
|
||
|
||
|
||
"Audit Programs"
|
||
Provide for integrity of system memory
|
||
Structured into mutilation detection and correction modules
|
||
Detection modules run continiously in background
|
||
Detection modules augmented by defensive checks in operational programs
|
||
Call correction modules to correct errors found by background audits or
|
||
defensive checks.
|
||
|
||
|
||
"System Integrity Programs"
|
||
Provide for integrity of programs
|
||
Monitor job scheduling and sequencing for frequency and execution times
|
||
Use sanity timers
|
||
Call audits or reinitialize system to correct errors.
|
||
|
||
|
||
"Recovery from software problems"
|
||
Software problems caused by program errors or bad data
|
||
Out-of-range accesses trigger hardware interrupt, recovery
|
||
requires correction of data, or killing of call and return of control
|
||
to a safe point.
|
||
Inhibit (pest) interrupts while audits are correcting problem,
|
||
risky, but assumes single software fault.
|
||
In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting.
|
||
Software recovery does not consider the possibility of a hardware fault.
|
||
Recovery cannot fix a program bug. Running pested may allows the system to
|
||
operate in a degraded fashion while maintenance personnel analyze data and
|
||
correct program.
|
||
The buffer overflow problem - may be caused by program error.
|
||
Buffers protected by hardware overflow interrupts.
|
||
Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly.
|
||
The overflow interrupt is pested.
|
||
If problem continues, hardware is suspect.
|
||
|
||
|
||
|
||
"No. 4 ESS: Maintenance Software"
|
||
"by M. N. Meyers, W. A. Routt and K. W. Yoder,"
|
||
"BSTJ Vol. 56, No. 7, September 1977"
|
||
|
||
|
||
"Software Error Recovery"
|
||
Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation.
|
||
System must be as error-free as possibile.
|
||
Since system cannot be completely error-free, it must be error tolerant.
|
||
|
||
|
||
"Classification of software errors"
|
||
Errors in interfaces between software modules.
|
||
Non-conformity to systems rules.
|
||
KsO[0;44;1m$H@![0m fwLogic errors.
|
||
Coding errors.
|
||
Complex man-machine interfaces that lead to procedural errors.
|
||
|
||
|
||
"Effects of software errors"
|
||
Loss of viability
|
||
Loss of call processing
|
||
Loss of a facility
|
||
Loss of a functions
|
||
Loss of capacity
|
||
Loss of single call
|
||
No effect
|
||
|
||
|
||
"Error Prevention"
|
||
Standardization,
|
||
Simplification,
|
||
Improved Documentation,
|
||
I wonder why improved testing isn't listed
|
||
|
||
|
||
"Error Tolerance"
|
||
Error tolerance acheived through defensive programming and defensive memory.
|
||
Programs attempt to remain operational in presence of errors
|
||
Restrict access to data to noncritical regions where errors have minor impact. (Using special instrutions to write to critical memory).
|
||
Checking state codes
|
||
Range Checks
|
||
Positive Decisions
|
||
Symbolic Addressing
|
||
Interpreting all possible stimuli
|
||
Link by index rather than absolute address
|
||
Special purpose memory allocators
|
||
|
||
|
||
"Error Handling"
|
||
"Software Integrity Control"
|
||
Receives reports of all software errors, decides and activates corrective action.
|
||
Corrective action: request an audit, or if history indicates repition, request a phase of system initialization.
|
||
Receives reports of overload - must decide if these are true, or are the results of software errors and take appropriate action.
|
||
|
||
|
||
"Audits"
|
||
Detect and correct data errors.
|
||
Control structure and audit routines specific to particular data structures.
|
||
Control structure schedules routine audits.
|
||
Demand audits run when errors found or suspect.
|
||
Error detection through: comparison of duplicate structures, association (correct structures linked together), and semantics (data is reasonably correct).
|
||
|
||
|
||
U"qGu"Integrity monitor system"
|
||
Detects shceduling and cycling irregularities and los of major system functions.
|
||
Time monitors detect basic sanity
|
||
Base level monitor verifies validity of base cycle
|
||
Test call program observes progress of test calls
|
||
|
||
|
||
"Recovery"
|
||
Remedial actions - audits - correct specific errors.
|
||
If remedial actions fail undertake system recovery.
|
||
System recovery reconfigures hardware and initializes memory.
|
||
System recovery initiated upon detection of:
|
||
Mutilation of program memory
|
||
Loss of vital system function,
|
||
Loss of major facility,
|
||
Escalation of remedial actions,
|
||
Software integrity monitor problems,
|
||
Sanity Timeout
|
||
Mutilation of software clock,
|
||
Duplex failure
|
||
|
||
System recovery phases:
|
||
Phase 1 - initialize specific areas of transient memory - does not kill calls,
|
||
Phase 2 - reconfigure peripheral hardware and initialize all transient memory - kills
|
||
only non-stable calls.
|
||
Phase 3 - reconfigure processors, initialize all memory - kills only non-stable calls.
|
||
Phase 4 - totally initialize system - kills all calls - manual activation only.
|
||
|
||
More detailed and specific manual recovery procedures can be initiated.
|
||
|
||
|
||
|
||
"Notes from DS5C 3.00.00.00 - Issue 2 No. 5 ESS Maintenance Plan, 40288-500"
|
||
|
||
|
||
Up to 127 interface module processors - each is duplicated.
|
||
No simplex failure causes service loss, duplex failures cause service loss for only a few terminations.
|
||
Central Processor (CP) is duplicated.
|
||
The time multiplexed switch contains a duplicated processor.
|
||
The Control Distribution Units are not duplicated, since a single one is sufficient for inter-processor communications.
|
||
The Input/Output Processor is duplicated.
|
||
CP maintenance is under control of CP programs.
|
||
CP programs have minimal involvement in peripheral unit maintenance.
|
||
IM resposible for own maintenance, with few exceptions (manual requests, etc.).
|
||
IM uses self-checking hardware and operational checks.
|
||
Inter-processor messages use a reliable communications protocol.
|
||
Test for errors during use, scheduled tests for equipment not frequently used or in standby mode.
|
||
Test error detecting capabilities by inducing errors.
|
||
Operational software can detect some hardware errors through in-line checks,
|
||
and keep history to isolate error.
|
||
Audits and defensive checks to reduce software errors.
|
||
Software reliability is obtained at the expense of real time performance.
|
||
|
||
Defensive checks:
|
||
Range checks on pointers and indexes.
|
||
Duplicate copies of data at two different locations to detect mutilation.
|
||
Two-way linked lists.
|
||
Maintain and check redundant status information.
|
||
Positive decisions
|
||
Limited use of GOTOs
|
||
One entry point per module
|
||
Check incoming buffered data for mutilation while in buffer.
|
||
Ack timeouts.
|
||
|
||
Audits check for resources in an invalid state and restore the resource to a valid state.
|
||
Audits try to detect erroneous states before system performance is affected.
|
||
Loss of all of a given class of resource may halt call processing.
|
||
Long loops in programs cause them to consume all of the real time and put the system in overload.
|
||
Auditing of data resident in more than one processor leads to a race condition.
|
||
If audits find one error in a data structure then all associated data is assume to be bad.
|
||
Processes audited by timing, timeout in transient states indicated errors.
|
||
Scheduled audits have low priority.
|
||
Errors detected by audits are output on TTY.
|
||
Audit history is maintained in CP and can be queried.
|
||
Audits are stitched together during initialization.
|
||
Craft can request audits.
|
||
CP has audits for its data, IM has audits for its data, CP controls audits for data shared by
|
||
IM and CP.
|
||
Test calls are generated and monitored to detect timing irregularities.
|
||
Operating system monitors scheduling to detect overload or fault.
|
||
Hardware sanity timers to detect basic processor sanity.
|
||
Overload is caused by a shortage of resources.
|
||
In overload give preference to terminating traffic.
|
||
During growth intervals, audits must be gracefully updated to test new resources.
|
||
|
||
|
||
|
||
"Notes from DS5D 5.00.02.00 - Issue 1 Feature Control Development Specification Case 40288-100"
|
||
|
||
|
||
|
||
Feature control programs must explicitly record state during real time break, even if the
|
||
state can be derived implicitly by the program.
|
||
Restricted control of the process stack improves reliability.
|
||
Program design represented by finite state machine.
|
||
|
||
|
||
|
||
"Notes from 5ESS AUDITS Description and Procedures Manual, Issue 2, 1983"
|
||
|
||
|
||
|
||
Audits are 10% of 5ESS software.
|
||
Audits are powerful enough to disconnect stable calls and initialize the system.
|
||
|
||
5ESS software is divided into 5 areas:
|
||
operating system,
|
||
call processing,
|
||
Administration,
|
||
database, and
|
||
maintenance (The largest).
|
||
|
||
Audits are responsible for data in each area.
|
||
The maintenance area uses five system integrity features:
|
||
Asserts (defensive checks),
|
||
integrity monitors (sanity timers and other checks),
|
||
initializations, and audits.
|
||
|
||
Application audits run under OSDS which provides:
|
||
Process management,
|
||
Resource allocation,
|
||
Process synchronization and communication,
|
||
Inter-processor communication,
|
||
IM interrupt and fault handling,
|
||
Timing services.
|
||
|
||
DMERT has audits that run in the System Integrity Monitor environment at the CP, not descibed here.
|
||
Audit control in charge of audits in all environments, different audits
|
||
are run in different environments.
|
||
Routine audits are segmented - 20 millisecond segments for error detection,
|
||
40 milliseconds for error correction.
|
||
|
||
Audits may run in 5 modes:
|
||
routine segmented,
|
||
(run at lowest priority, take real-time breaks, output results to file,
|
||
escalate if critical error found)
|
||
elevated segmented,
|
||
(run at second lowest priority, take real-time breaks, caused by; input messages,
|
||
asserts, single process purges, other audits, if cannot correct errors in 40 msec,
|
||
either escalate or return to safe point)
|
||
nonsegmented,
|
||
(no real-time breaks)
|
||
initialization,
|
||
(run sets of audits back-to-back with no real-time break)
|
||
postinitialization.
|
||
(used to restore non-critical data after initialization)
|
||
|
||
A single event number is assigned to all audit actions resulting from the
|
||
same error.
|
||
Asserts placed in operational code (includeing audit code) to detect:
|
||
Out-of-range indexes and pointers
|
||
Redundancy on duplicate data,
|
||
Point-to/point-back linkage,
|
||
consistency between related data,
|
||
invalid function return codes.
|
||
|
||
Failing asserts usually trigger audits.
|
||
Asserts are a convention for defensive programming that standardize history keeping, and output messages.
|
||
|
||
Audits classify errors as: process errors, non-process errors, or audit failures.
|
||
Process errors cause single process purge.
|
||
Non-process errors are accumulated until end of segment and then an elevated
|
||
audit is scheduled to correct them.
|
||
Audit failures (the error is such that the audit can't continue) result in
|
||
scheduling of audits to fix the error.
|
||
Interprocessor audits check duplicate data in different processors.
|
||
Interprocessor audits run in routine and elevated segmented modes.
|
||
Interprocessor audits triggered manually, by asserts, and routinely.
|
||
Interprocessor audits correct data by making CP agree with IM.
|
||
Interprocessor audits divided into partitions that run on a single processor.
|
||
Interprocessor audit execution controled by CP, which receives completetion
|
||
reports from each partition.
|
||
About 70 application audits.
|
||
|
||
Current software development methodologies incapable of producing fault-free
|
||
programs.
|
||
Operator mistakes introduce incorrect data into system.
|
||
Hardware faults cause data mutilation.
|
||
Early detection and correction of fault effect retains system availability,
|
||
at a cost (usually mishandling of single job).
|
||
|
||
|
||
"DEFINITIONS"
|
||
|
||
Fault - improper action
|
||
Error - the result of some faults - an erroneous states
|
||
Failure - incorrect system output (wrong or delayed)
|
||
Outage - Failure effecting significant number of jobs
|
||
Fault tolerant systems prevent errors from becoming failures and failures
|
||
from becoming outages.
|
||
Faults that do not produce an erroneous state are not detectable
|
||
|
||
|
||
"CLASSIFICATION"
|
||
"ESS Software Reliability"
|
||
"Objectives"
|
||
|
||
Same as achieved by electro-mechanical systems:
|
||
No more than 2 hours total outage time in 40 years (720 seconds allocated to outages called by software faults).
|
||
No more than 2 calls out of 10,000 mishandled.
|
||
About 100-200 call attempts/sec
|
||
Tolerable response times range from milli-seconds to seconds.
|
||
"System Characteristics - 1ESS, 1AESS, 4ESS"
|
||
|
||
Base level cycle partitioned into A,B,C,D,E and interject priorities.
|
||
Interrupts for timing and hardware exceptions.
|
||
I/O at interrupt level
|
||
Time shared CPU with segmented (interleaved) processing
|
||
System memory paritioned into:
|
||
Text
|
||
Parameters (describe office size)
|
||
Translations (describe connections)
|
||
Shared Memory
|
||
|
||
|
||
Note: This file was in EMACS , I hope I edited it good enough to read, so that it provides use. -DT
|
||
|
||
|
||
----------------------------------------------------------------------------
|
||
|