Tuesday, April 26, 2011

Real Application Clusters Part 1


What is RAC
 Single Instance DB: DB 1to1 Instance
 RAC DB: DB 1toM Instances
 One instance on one Server
 N instances = n Servers
 DB = combined processing power and memory of n Servers : throughput
and scalability
 High availability: prevents single point of failure in the given n number
servers
 Instances can join and leave dynamically
 Instances share common data files, control files
 Each instance will have separate rollback segments, undo tablespace and
log files
 Instances can simultaneously run transactions against database
 Instances communicate with each other through cache fusion technology

Background Processes
 LMSn (Lock Manager Services) - Global Cache Service Processes
 Number of LMSn depends on messaging process among nodes for buffer
block movement.
 Copy blocks (read consistent/write) from holding instance buffer cache to
requesting instances buffer cache
 This process consumes a lot of CPU time
 Minimum one LMS process per instance; in 10.2 and above there are 36
LMSn (LMS0-LMS9 LMSa-LMSz)
 This LMS process can be specified by gcs_server_process in 10.1 and
above; default value is 1(single CPU system)
 Implements local part of GRD
 How set of each blocks are handled by LMSn background processes are
decided by a hash function and a block is always handled by one and same
LMSn process
 The Number of blocks served by LMSn is available at v$CR_BLOCK_SERVER
 Internal view is x$kjmsdp

Background Processes
 LCKn – lock processes or enquee process is used for managing locks; act as lock requesting agents on behalf of an instance
 Based on the parameter gc_lck_procs in 9.0.1 and below
 Assists LMS processes
 Manages instance resource requests
 Manages cross instance call operations
 By default only one lock process is started
 Maximum 10 lock processes is possible
 All instances should have same number of lock processes
 These processes communicate with remote instance LMD process for locks

which are requested by user process on another instance and these processes (LCK) are from that instance only.
 Based on the priority of request either lock requests are buffered at source instance and send simultaneously or directly been sent to other instance

Background Processes
 LMDn – Lock Manager Domain – Global Enquee Service (GES)
Domain
 One per Instance
 Manages the requests for global enquees by updating the status
when granted or revoked from an instance
 It performs global deadlock detection
 Monitors lock conversion timeouts
 Manages global enquees and global resource access
 Handles remote enquee requests (resource requests from another
instance)
 Internal view is x$KJMDDP

Background Processes
 LMON – Lock Monitor – Global Enquee Service Monitor
 One LMON Process per instance
 Monitors cluster to maintain global enquees and resources
 handles both instance and process abnormal expiration
 Reconfiguration of locks and resources when instance joins or
leaves is also handled
 Manages recovery process for cluster enquees
 Every 10 minutes this process executes dynamic lock remastering
 each LMON monitors its instances for dead lock detection, library
cache locks and dictionary cache locks

Background Processes
 DIAG: Diagnosibility Process
 Creates diagnostic data in the event of instance and process failure
in background_dump_dest location
 Monitors health of instance
 Also checks instance hang and deadlock

Understanding RAC: VIP
 Introduced in Oracle 10.1
 All applications connect using VIP, used by listener.
 Each node will be defined with a VIP
 VIP Failover between nodes: If node fails VIP is automatically reloaded to another
node i.e., check with ifconfig on new node to have VIP address: Thus multiple VIPs
may exist on a system during failover.
 Only applies to newly connecting Sessions
 VIP is configured thru VIPCA utility

Understanding RAC: Logs
 Logs are created on $ORA_CRS_HOME/log/nodename directory
 Logs : alert<node>.log : Oracle clusterware alert log
 Client Log files of OCR such as OCRCHECK, OCRCONFIG, OCRDUMP, CSS,
CLSCFG,OIFCFG
 Crsd/Crsd.log, cssd/cssd.log, evmd/evmd.log, racg/racg(log files of VIP and ONS)

Understanding RAC: heartbeats
 CSS maintains Heartbeats
 Heartbeats are of two types
 Network
 Runs across interconnect
 Disk
 Runs at voting disk
 It has internal I/O timeout in Seconds
 Can be specified using disk timeout parameter in 10.2.0.2
 It is Maximum time allowed for voting file I/O to complete
 If exceeds file will be marked as offline
 The default is 200 Seconds
 Crsctl is used to get and set this timeout
 Crsctl get css disktimeout
 Crsctl set css disktimeout <time in seconds>

Understanding RAC: Dynamic Performance Views
 Each V$ view has a GV$ view
 Each gv$ view has an additional column inst_id
 PZnn (0-99) are used to return data to remote hosts
 Catclust.sql is used to create the views

Understanding RAC: Segmented Arrays
 Arrays of objects (segments): Objects too large to fit in Granule, Arrays may grow dynamically
 Objects using segmented arrays are enquees (lock), enquee resources (resources), transactions and transaction branches
 External view is x$ksmdd
 RAC Segmented Arrays: GCS Resources, GES Enquees, GCS shadows

Components
 Components: VD and OCR
 VD: Voting disk : Member node information in stored into it
 OCR: Oracle cluster Registry: cluster configuration information is stored into it
 Both are Stored on shared storage
 Interconnect is a network between nodes which nodes only will access
 Interconnect can be redundant, can be on same network
 Interconnect is recommended to use UDP on gigabit Ethernet, crossover cables are not supported
 Oracle clusterware manages node membership and prevents split brain syndrome: where each node tries to manage the database due to communication failure in interconnect

Voting Disk
 Voting disk needs to be on a shared disk
 At least three voting disks recommended by oracle for high availability
 If single voting disk is configured then external mirroring is required
 A maximum of 32 voting disks are allowed
 Cluster uses voting disk to determine which instances are members of cluster

OCR
 This is used to store and manage the components which oracle clusterware controls such as database, instances, services, listener, VIPS and any other application. OCR repository stores this information in the form of key-values directory tree structure
 It must reside on shared disk and accessible by all nodes
 Recommended to multiplex OCR
 Updatable thru DBCA, SRVCTL and OEM

Services: Node membership Service
 CSS: Cluster Synchronization Service: Manages Cluster node membership and notifies node joins and exits to members.
 Maintains Heartbeat and VD stores current status of each node
 CSS provides node membership, group access, basic cluster locking
 UNIX/Linux process of CSS: ocssd, ocssd.bin, init.cssd
 Windows process of CSS: ocssd.exe and Service is OracleCSService
 Run as administrator user
 Also works with non-RAC systems
 Can integrate with third party cluster products

Resource Mgt Service
 CRS: Cluster Ready Service: crs process manages cluster resources like database, instance, service, listener, VIP, application process.
 Crs manages (start, stop, monitor and failover) resource based on resource configuration information stored in OCR
 Crs generates events when resource status changes
 Crs restart components when a failure occurs. the CRS process makes
three attempts to start the Oracle Notification Service (ONS), one attempt to
start an Oracle database, and five attempts to restart other database
components. The CRS process does not attempt to restart the VIP.
 Linux/unix process of crs: crsd.bin
 Windows process of crs: crsd.exe and Service name is OracleCRSService
 Runs as administrator user
 Maintain state information in OCR (Oracle cluster repository)
 It automatically get restarted

Event Mgt Service
 EVM: Event management: publishes events created by CRS
 Linux/unix process of evm: evmd, evmd.bin, evmlogger
 Windows process of evm: evmlogger.exe, evmd.exe and service name is OracleEVMService
 Runs as administrator user
 Generates messages when events occur which can be accessed for programmatic purposes
 Starts racgevt thread

FAN Event Services
 RACG: clusterware support for complex resources, runs server
callout scripts when FAN Event occurs
 Linux/unix process for RACG: racgmain, racgimon
 Windows process for RACG: racgmain.exe, racgimon.exe
 Runs as processes or threads not as service

Node application Services
 Are standard set of application Services automatically launched by
RAC
 VIP
 Oracle Net listener
 GSD
 ONS
 They run on each node and relocated to other nodes

Global Service Daemon
 This daemon executes srvctl commands: it receives tasks such as startup and shutdown from srvctl and executes them locally on each node and results are sent back to srvctl
 This daemon is by default installed on nodes and it should not be killed
 GSD Launches internal Services
 GCS: Global Cache Service: manages cache fusion and in memory data buffers
 GES: Global Enqueue Service: Manages inter instance locking and RAC recover
 GSD is externally controlled where as GCS and GES are OS threads and these threads manage GRD (Global Resource Directory: set of virtual tables in memory)
Listener
 Listens incoming client requests, it is server side component of oracle net.
 It Manages traffic by receiving client request for network session with Server and brokers the client request

Cache Fusion
A diskless cache coherency mechanism in Real Application Clusters that provides copies of blocks directly from a holding instance's memory cache to a requesting instance's memory cache.
Cache Fusion addresses several types of concurrency as described under the following headings:

Concurrent Reads on Multiple Nodes

Concurrent reads on multiple nodes occur when two instances need to read the same data block. Real Application Clusters resolves this situation without synchronization because multiple instances can share data blocks for read access without cache coherency conflicts.

Concurrent Reads and Writes on Different Nodes

A read request from an instance for a block that was modified by another instance and not yet written to disk can be a request for either the current version of the block or for a read-consistent version. In either case, the Global Cache Service Processes (LMSn) transfer the block from the holding instance's cache to the requesting instance's cache over the interconnect.

Concurrent Writes on Different Nodes

Concurrent writes on different nodes occur when the same data block is modified frequently by different instances. In such cases, the holding instance completes its work on the data block after receiving a request for the block. The GCS then converts the resources on the block to be globally managed and the LMSn processes transfer a copy of the block to the cache of the requesting instance. The main features of this processing are:
  • The Global Cache Service (GCS) tracks a each version of a data block, and each version is referred to as a past image (PI). In the event of a failure, Oracle can reconstruct the current version of a block by using the information in a PI.
  • The cache-to-cache data transfer is done through the high speed IPC interconnect, thus eliminating disk I/O.
  • Cache Fusion limits the number of context switches because of the reduced sequence of round trip messages. Reducing the number of context switches enables greater cache coherency protocol efficiency. The database writer (DBWn) processes are not involved in Cache Fusion block transfers.
Notification Services
 ONS: Oracle Notification Service: Transmits FAN (Fast Application
Notification) Events. Introduced in 10.1 and is key component of FAN
 JDBC depends on ONS to propagate DB events
 Windows process for ONS: ons.exe
 Linux/Unix process for ONS: ons
 Onsctl command is used to start, stop, reconfigure and monitor ONS daemon
 Racgons is used to configure server side ons configuration
 Ons configuration is controlled by ons.config file of CRS_HOME/opmn/conf
folder
 For every FAN event status change and all executables in ORA_CRS_HOME/racg/usrco are launched
 It extends functionality of event manager daemon to push event messages to multiple targets (other RAC nodes)
 Utilized to extend RAC high availability and load balancing to mid-tier applications

Global Cache Service
 It is controlling process that implements cache fusion
 Manages the status of data blocks and transfer of data blocks across buffer caches of all instances
 It is integrated with Buffer cache Manager and does fast lookup of resource information in GRD
 Maintains block mode for blocks in global role
 It employs various background processes such as
 LMSn: Global Cache Service processes
 LMD: Global Enqueue Service Daemon

LMON: Global Enqueue Service Monitor
 Is is background process that monitor entire cluster to manage global resources
 Manages instance and process expiration
 It does recovery for GCS
 It also handles part of recovery associated with global resources
 The services provided by LMON are called as cluster group services
 Actual name is Lock Monitor Lock or Enqueue Service Monitor and it manages both local and global hence name is LMON so Global Enqueue Service Monitor
 Enqueues are shared memory structures that serialize access to database resources. Enqueues are local to instance on single Instance database and Global to database on RAC.

GRD: Global Resource Directory
 This is data structures which resides in memory, associated with Global resources and distributes across all instances/nodes in cluster
 It is GCS and GES which maintain the GRD. These Services record information of resources and enqueues held globally
 Each node manages its part of global resources in GRD

Process Monitor Daemon
 OPROCD: Oracle process monitor Daemon: This process is locked in memory to monitor cluster and provide I/O Fencing. It performs its check and stops and wakes up in expected time. If it does not wake up in expected time then OPROCD resets the processor and reboots the node. If OPROCD is failed to reboot the node then clusterware will do it.
 On Linux below 10.2.0.3 hangcheck timer module is used
 OPROCD takes –t timeout(in milliseconds, default is 1000 milliseconds) and
–m margin(acceptable margin before rebooting default is 500 milliseconds) parameter
 Parameters for changing the OPROCD values are in init.cssd of /etc/init.d folder and the parameters are OPROCD_DEFAULT_TIMEOUT=<N MILLISECOND>, OPROCD_DEFAULT_MARGIN=<N MILLI SECOND>
 Linux (on 10.2.0.4)/unix process of OPROCD: oprocd
 Windows Service for OPROCD: OraFenceService


Thursday, April 14, 2011

Freelists, PCTUSED, PCTFREE, Row chaining, Row migration


Segments(table for example)
    Extents (initial extent and next extents)
             Blocks (set of OS blocks):header(block address,type of segment), table directory, row directory, freespace(PCTUSED, PCTFree)
                       rows: rowheader,now of columns, cluster key, rowid, column length,column date
   
Freelists : list of free blocks where data is inserted, decided by PCTFREE
PCTFREE: is a percentage of space in a block upto which the rows will be inserted lests say if 20% is set as pctfree then upto 80% of block is used for inserts and till that time the block will be found in freelist. once it reaches that 20% pctfree limit it means it has consumed all 80% and reached the pctfree percentage level in that case from freelist this block is removed and is used for updates.
PCTUSED: the block is used for updates once it is inside pctfree and this update may cause increase in block size usage or decrease in its usage due to updates and deletes and once the usage brings the space limit down and down and down to given PCTUSED limit the block will be back to freelist. Hence PCTUSED is a limit which says that due to updates if block is getting free and it reaches to a given limit of
freeness in its space then it can again be used for inserts.
Row Migration: due to updates many a times size of row does not fit into a block and it requires to be allocated into completely new block leaving address of that block into old block for index traversal. This concept of allocating the updated row to a new block due to its non fitness in current block after update is called as row migration.
Row Chaining: when a row can not fit at all into a block due to its size which is greate than the size of block in those cases oracle splits the rows into blocks this concept of spliting the rows into blocks is called as chaining. This may happend due to lengthy columns been used in table definition like long or long raw or char(lengthy size).
Extent: group of blocks
select SEGMENT_NAME,bytes,blocks(BLOCKS GIVEN),BLOCK_ID 'FROM BLOCK' from dba_extents where segment_name = 'TEST';

Wednesday, April 13, 2011

RBA,SCN,UBA,DBA,XID

Redo byte address - RBA
size 10 bytes
RBA=LOG Sequence Numer.block number within redo log.byte number within log

System Change/Commit Number
size 6 bytes
SCN=base(2 bytes).Wrap(4 bytes)

Database Block Address - DBA
size 4 bytes
DBA=upper 10 bites(file number) & Lower 22 bits(block Number)

Undo block Address - UBA
size 7 bytes
UBA=dba of undo block.Sequence number.Record number in block

Transaction ID - XID
size 8 bytes
XID=USN(undo segment number) of transaction.undo segment header transaction table slot.Sequence number(wrap)

Redo log
 header
 redo records
   redo record header (contains Thread,RBA,LENgth of record in bytes,SCN,date timestamp)
   change vectors
     header (contains CHANGE number, change TYP, CLSass, Absolute file number, Relative DBA, SCN, Sequence Number, Operation Code)
     array of change record lengths
     array of change records