CA1284389C

CA1284389C - Read in process memory apparatus

Info

Publication number: CA1284389C
Application number: CA000540643A
Authority: CA
Inventors: James W. Keeley; George J. Barlow
Original assignee: Bull HN Information Systems Inc
Current assignee: Bull HN Information Systems Inc
Priority date: 1986-06-27
Filing date: 1987-06-26
Publication date: 1991-05-21
Anticipated expiration: 2008-05-21
Also published as: EP0258559B1; KR880000858A; KR920010916B1; US4768148A; DE3750107T2; AU599671B2; EP0258559A2; KR880000861A; AU7478687A; EP0258559A3; DE3750107D1; KR920008430B1

Abstract

ABSTRACT OF THE DISCLOSURE

A cache memory subsystem couples to main memory through interface circuits via a system bus in common with a plurality of central processing subsystems which have similar interface circuits. The cache memory subsystem includes multilevel directory memory and buffer memory pipeline stages shareable by at least a pair of processing units. A read in process (RIP) memory associated with the buffer memory stage is set to a predetermined state in response to each read request which produces a miss condition to identify the buffer memory location of a specific level in the buffer memory which has been preallocated. The contents of the buffer memory stage are maintained coherent with main memory by updating its contents in response to write requests applied to the system bus by other subsystems. Upon detecting the receipt of data prior to the receipt of the requested data which would make the buffer memory contents incoherent, the cache switches the state of control means associated with the RIP memory. Upon receipt of the requested data, the directory memory is accessed, the RIP memory is reset and the latest data is forwarded to the requesting processing unit as a function of the state of the control means.

Description

RELATED PATENT APPLICATION
_____________ ____________ l. The Canadian patent applicatlon of James W. Keeley and Thomas F. Joyce entitled, "Multiprocessor Shared Pipeline Cache Memory", filed on September 26, 1985, bearing serial number 491,637, which is assigned to the same assignee as this paten~t applLcatlon.

. ` ~ !

, ' ' '~

~i~5g The present inYention relate~ ~o cache memory systems and more particularly .to cache memory sys'cems 5 utilized in multiprscessor systems.

P~iQ~
It is well known that cach~ memories haYe been highly effec~ive in increasing the throughput of small and large monoprocessor and multiprocessor s~stems. In 10 such system~, cache~ are frequently con~igured in a so-called private ca~he arrangem~nt in which the cache m~nory is dedicated to a single processor.
To increase system throughput, systems have increased the number of proce~sing units, each 15 connectins a cache memory which, in turn, connec~ed to main memory and the other units of the overall system via an asynchronous system bus. In such ~ystems, the independently operating processing units produce unsynchronized overlapping requests to main memory.
20 This can substantively affect cache coherency which is the ability of the cache to accurately and correctly track the contents of main mesnory. The result is that processing units can be forced to opera'ce with stale data which could eventually bring the system to a halt~
2S In general, there have been two basic approaches employed in maintaining cache coherency~ The f~rst termed a write through approach employs a lis~ener device which detects the occurrence of any write operations made to main memory and updates the contents 30 of the cache. The second approach employs circuits ' 8~8 which invalidate the data contents of the cache location~ when any other syst~m unit writes into a main m~ory location which ha~ been mapped into a proce~sing unit' s cache~, Employing the write through approach, one prior art caehe system detec~ when any da . a is read or written in~o main memory by ano'cher unit prior to the receipt of the reque~ted data which falls wi~hin a given addres~ range. If it does, ~he cache is bypa~sed 10 and the r~ue~ted data is tran~f erred to the requestirlg processing unit. While this en~ure~ cache coherency, the process of bypassing cache can re~ult in decreased system efficiencyO This occur~ as the system becomes busier due to the addition of more and faster 15 proce~sing units resulting in a substantial decrease in cache hit ratio.
Accordingly, it is a primary obj ect o~ 'che present invention to provide a cache memory which maintains cache coherency without decreasing system efficiency.
It is a f urther object of the present invention to main~ain a cache coherency notwiths~canding unsynchronized overlapping memory requests by a number of independently operated processing units which connect to a system bus through independent interraces.

2 5 ~ IQ~

The above obj ects and advarltages are achieved in a pre~erred embodiment of the present invention. In the pre:Eerred embodiment, a cache memory subsystem couples to main memo~y through interface circuits via an 30 asynchronous system bu~ in common with a number of , ... . . .

.

central processing sub3ystem which h~ve similar interface circuits~ Th~ cache memory subsystem includes multilevel directory memory and buffer memory pipeline stages shareable by a number of processin~
S unit~.
~ ccording to the present invention, r~ad in process ~RIP) memory means is operatively associated with one of the pipeline stagesO The RIP memory means is set to a predetermined state in respon~e to read reque t from a processing unit which produces a mi~s condi~ion indi~ating that the reque-Qted data is not resident in the buffer memory stageO Setting occurs when listener apparatus coupled to the system bus de~ect that the cache memory subsystem has forwarded a request ~or the missing data to main memory and presents that memory request to the cache memory subsystem resulting in the preallocation of directory memory . When so set, the RIP memory mean~ identif ies the address of the directory location within a specific ~ level which was preallocated. The listener apparatus maintains the contents of the buffer memory stage coherent by updating its contents in response to any r equest applied to the system bus by other subsystems.
In the preferred embodiment, prior to the receipt of the requested data from main memory, the cache memory subsystem, in response to detecting that the listener apparatus has been receiving data which would make the huffer memory incoherent, operates to switch control means associated with the RIP memory from a 30 preset state to a different state. When the r~quested data is reeeived, the directory memory is accessed, the RIP memory means is reset and the latest version of the ,, . ,: . .

.

requested data is forwarded to the requesting processing unit as a ~unction o~ the state of the control means. That i5~ when the control means has ! '~
been switched to a diferent state, this denotes ~hat ¦
the reguested da~a i~ not the most recent data (e.g. l has been modified by another proce~sing unit).
Therefore~ the latest version of the data stored in the ~,i~
buffer memory is read out and transferred to the requesting processing unit while the data received from 10 main memo~y is discarded. However~ when the control mean~ has remained in an initially preset state, this denote3 that the data received from main m~mory is the most recent data. Thu~, the requested data is written in~o the buffer memory and transferr~d ~o the requesting processing unit.
The arrangement of the present invention is able to maintain coherency without decreasing the cache efficiency, since requested main memory data will be written into the buff er store unless it proves to be stale (not the most recent). Of course, this presumes that the data meets those requirements necessa~y for insuring data integrity (e.g. error free).
In addition to maintaining coherency, the arrangement of the present invention ensures that the 25 operation of the cache memory subsystem is properly synchronized with the asynchronous operations of the system bus. mat is, the same memory request sent to main memory on the system bus is also presented to the cache memory subsystem during the same cycle resulting ~ in the preallocation of the directory memory. This ensures reliable operation within the pipelined stages.

- 6 - 7243~-51 In addition to the above, the arrangement of the present invention can be used to maintain cache coherency between a pair of the processing units sharing the cache memory subsystem.
In such circumstances, a second memory is associated with the other pipeline and is conditioned to store the same information as the RIP memory. When the other processing unit makes a memory read request for the same data, the state of the second memory blocks the loading of the other processing unit's data registers causing a miss condition. This allows the other processing unit to stall its operations until such time as the requested main memory data can be provided.
A single memory associated with the directory stage may be used to maintain coherency between processing units. How-ever, for performance reasons, it is preferable to use a memory in each pipeline stage. This eliminates the need to have to perform read modify write operations which would incxease the time of a pipeline stage to perform the required operations. Also, the arrangement allows sufficient time for signalling the process-ing unit to stall its operations.
In accordance with the present invention, there is provided a multiprocessing system comprising a plurality of processing units and a main memory coupled in common to an asyn-chronous system, each processing unit including a cache unit for providing high speed access to coherent main memory data in response to requests and data transmitted on said system bus by said processing units, each request containing first and second - 6a - 72434-51 address portions of a cache memory address yenerated by one of said processing units, said cache unit comprising: a first stage including directory store means organized into a plurality of levels containing groups of storage locations, each location for storing said first address portion of a memory read request generated by said processing unit associated therewith and each different group of locations within said directory store levels being defined by a dif~erent one of said second address portions;
~ a second stage including data store means organized into the same number of levels of said groups of locations as in said directory store means and each different group of locations within said data store levels being accessed by a different one of said second address portions; read in process (RIP) memory means included in one of said first and second cache stages, said RIP memory means including a plurality of locations, each location being accessed by a different one of said second address portions; decode and control means coupled to said directory store means, to said RIP
memory means and to said data store means, said decode and control means being operative during a cache allocation cycle in response to each request received from said processing unit for data not stored in said data store means to generate signals for placing the location specified by said second address portion in said RIP
memory means in a predetermined state for identifying the data store location which has been preallocated; and control means coupled to said RIP memory means and to said decode and control means, said control means being conditioned by said decode and ,,,, . ~

- 6b - 72~34-51 control means during a cache update cycle to switch to said predetermined state when said RIP memory means signals that a portion of the contents of said data store location which has been preallocated is being updated to maintain coherency prior to the receipt of the requested data and said control means being opera-tive to condition said data store means for transferring the most recent version of said requested data to said processing unit~
In accordance with another aspect of the invention, there is provided a multiprocessing system comprising a plurality of data processing subsystems and at least one main memory sub-system coupled in common to an asynchronous system bus, each data processing subsystem including a plurality of processing units, each processing unit being operative to generate memory requests for data, each request including an address; and a pipelined cache memory subsystem coupled to each of said processing units for receiving said data requests, said cache subsystem comprising:
input selection means for selecting a request address from one of said processing units subsystems during an allocated time slot interval; a first pipeline cache stage coupled to said input selection means, said pipeline stage including a directory store organized into a plurality of levels containing groups of storage locations, each location for storing said first address portion of a memory read request generated by one of said processing units during said allocated time slot interval and each different group of locations within said directory store levels being accessed by a different one of said second address portions; a second cache , 3~
- 6c - 72434-51 pipeline stage including a data store organized into the same number of levels of said groups of locations as in said directory store and each different group of locations within said data store levels being accessible by a different one of said second address portions during a succeeding time slot interval for transfer of the data contents to the requesting one of said processing units;
read in process (RIP) memory means included in one of said first and second cache stages, said RIP memory means including a plural-ity of locations, each location being accessed by a different one of said second address portions; decode and control means coupled to said directory store, to said RIP memory means, and to said data store, said decode and control means being operative during a cache allocation cycle in response to each request received from one of said processing units for data not stored in said data store to generate signals for placing the location specified by said second address portion in said RIP memory means in a predeter-mined state for identifying the data store location which has been preallocated; and control means coupled to said RIP memory means and to said decode and control means, said control means being conditioned by said decode and control means during a cache update cycle corresponding to an unused allocated time-slot inter-val to switch to said predetermined state when said RIP memory means signals that a portion of the contents of said data store location which has been preallocated is being updated to maintain coherency prior to the receipt of the requested data to be stored in the same data location and said control means being operative , .:
-, ~æ~
- 6d - 72434-51 to selectively condition said data store for transferring the most recent version of said requested data to said requesting proces-sing unit.
In accordance with another aspect of the invention, there is provided a cache unit for providing to a processor request-ing data only the most current version of the requested data;
wherein said cache unit operates in a data processing system which includes a plurality of processors coupled for communication with a main memory, wherein at least one of said processors is coupled to request and receive data from said cache unit by supplying a main memory request address having first and second parts, and wherein said cache unit includes an addressable data store for holding a data unit in each addressable location thereof and a directory store for holding a first address part in each location thereof, said data store and directory store being accessed by said second address part; said cache unit being characterized by:
an additional store, said store holding a bit or each accessible location of said data and directory stores; said additional store being accessed by said second address part; a first control unit coupled to said data store, said directory store and said addi-tional store and responsive to each data request received by said one processor for data not found in said data store for control-ling (i) entering of the first address part of said request address in the directory store in the location of said directory store accessed by the second address part of said request address and (ii) setting to a predetermined state the bit in said . .
. .

- 6e - 72434-51 additional store in the location of said additional store accessed by the second address part of said request address; a flip-flop;
a control circuit coupled to said additional memory and enab:Led by the entry of a data unit in a location of said data store for which the corresponding bit in said additional store is in said predetermined state to cause said flip-flop to operate in one of the states thereof; and a second control unit coupled to said flip-flop and said data store and responsive to the state of opera-tion if said flip-flop for controlling said data store to deliver l~ only the most current version of data units requested by said one processor.
The novel features which are believed to be character-istic of the invention both as -to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying drawings. It is to be expressly understood, however, that each of the drawings is given for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.

' ~

B~ E_1;~5~ Ql~ g55 Figure 1 is a block diagram of a system which includes the apparatus of the present invention~
Figure 2 is a block diagram of one of the central 5 subsyst~ns of Figure 1 construs:ted according to the pr esent i nv enti on .
Figure 3 shows in greater detail th~ ci rcui ts of the pipeline stages of Figure 2~
Figures 4a and 4b are ~low diagrams used ~o explain the operation ~ the apparatus of the present inven~ion used in differen~ly configured central subsystems.

~l5~ $au~5Y5$~LgE~ 9~

Figure 1 shows a multiprocessor data processing 15 sys~em 10 which includes a plurality of subsys~ems 14 through 30 which couple in common to a system bus 12.
The illustrative subsystems include a plurality of central subsystems 14 through 16, a plurality of memory subsystems 20 through 28 and a periphe~al subsystem 20 30. Each memory subsystem is organized to include even and odd memory modules. An example of such an arrangement is disclosed in U.S. Patent No. 4,432,055.
Each subsystem includes an interface area which enables the unit or units associated therewith to 25 transmit or receive requests in the form of commands, interrupts, data or responses/status to another unit on system bus 12 in an asynchronous mannerO mat is, each interface area can be assumed to include bus interface logic circuits such as those disclosed in U. SO Patent ~,,q~

No. 3,995~258, entitled "Data Pro~e~sing 5ystem Having a Data Integrity Technique~, invented by George J.
B arl ow .
The organization of each of the central subsystems 5 14 ~hrou~h 16 i the same. Figure 2 sh~ws in block diagram form ~e organization o~ central subsystem 14.
Subsystem 14 includ~s a pair of central processing unit (CPtJ) subsy~tems 14-2 and 14-4 coupled to share a cache subs3rstem 14-6. The cache subsystem 14-6 couples to 10 system bus 12 through a f irst in f irs~c out (~IEO) subsystem 14-10 which can be consider~d as being included within interface area 14-1..
A~ seen ~ra~ Figure 2, both CPU subsystems 14-2 and 14-4 are identical in construction. mat is~ each 15 CPU subsystem includes a 32-bit central processing unit (CPU) ( iO e. , CPU ' s 14-20 and 14 40), and a virtual memory management unit (VMMU) ( i. e., VMMU 14-26 and 14-46) for translating CPU virtual addresses into physical addresses for presentation to cache subsyst~
20 14-6 as part of the memory requests. Also, each CPU
subsyste~ includ~s a read only store (ROS) and a 16-bit ROS data output register (RD~) (i.e., E~OS 14-25, 14-44 and RDR 14-25, 14-45).
At the beginning of each cycle, each ~OS is 25 conditioned to read out a 16-bit microinstru tion word into its data output (RDR) register which defines - the type of operation to be perf ormed during the cycle ~firmware step/box). me clock circuits within each CPU subsystem ~ i. e., circuits 14-22 and 14-4~) 30 establish the basic timing for its subsystem under the control or cache subsystem 14-6 as explained herein.
The elements of each CPU subsystem can be constructed from standard integrated circuit chips.

--9.--As seen from Figure 2 t cache subsytstem 14-6 is organized into a source address generation section and two separate pipeline ~t~ges~ each with its own decode and control circuits. me source address generation section includ~ blocks 14-62 and 14--64 which perform the f unctions o souxce addres~ selecting and incrementing. me first pipeline stage iq an address stage and includes ~he directory and associated memory circuits of blocks 14-66 through 14-769 arranged as shown. This stage per~orms the functions of latching the ge~erated source address, directory searching and hit comparing. me firs~ pipeline stag~ provides as an output informa~ion in the form of a l~vel number and a column address. The opera~ions of the first pipeline stage are clocked by timing ~ignals generated by the timing and control circuits of block 14-600 The information from the first stage is immediately passed onto the second pipeline stage leaving the first stage available for the next source 20 request. The second pipeline s~age is a data stage and includes the data buffer and associated memory circuits of blocks 14 80 through 14-96, arranged as shown. This stage performs the functions of accessing the requested data from the buff er memories 14-88 and 14-90, or replacing/storing data with data received from FIF0 subsystem 14-10. m us, the second pipeline stage provides a 36-bit data word for transfer to one of the CPU subsystems. Again, the operations of the second pip~line ~tage are clocked by timing signals generated 30 by the timing and control circuits of block 14-60~
The different blocks of the Pirst and second pipeline stages are constructed from standard integrated circuits, such as those described in the --10~
~The TqL ~ata Book, Column 3ni Copyrighted 1984, by Texa~ Instr~nen~ Inc. and in 'che "Ad~anced Micro Devices Programm~ble Array Logic Handbookn, Copyright 1983, by A~vanced ~licro Devices, Inc. Por ~xample, the 5 addre~ selector circuit of block 14-62 i~ constructed from 'cwo sets of ~ix 7~AS857 mul~iplexer chips cascaded to select one of four addres~es. The swap multiplexer of block 14-92 i5 construc ed ~rom the same ~ype chip~., me latches of blocks 14-68 and 14-72 are 10 con~3truc:ted from 74AS843 D~type latch chips. The swap muli:iplexer and data register circuits of block 14-70 are constructed fram a single clocked programmabl~
array logic element, such as part number A~PAl~R6A, manufactul:ed by Advanced Micro Devices9 Inc.
me directory and associated memories 14-74 and 14~76, shown in grea~er detail in Figure 3, are constructed from 8-bit slice cache address comparator circuits having part number TMS2150JL, manufac~ured by Texas Instruments Incorporated and a 4X x 4-bit memory 20 chip having part number IMSli21, manufactured by INMOS
Corporation. The address and data registers 14-80 through 14-84 and 14-94 and 14-96 are constructed from 9-bit interface flip-flops having part number SN74A5823, manufactured by Texas Ins~ruments, Inc.
The buffer memories 14-88 and 14-90, shown in greater detail in Figure 3, are also constructed from 4K x 4-bit memory chips having part number IMS1421, manufactured by INMOS Corporation and a 4096 x 1 static RAM chip having part number AM2147, manuf actured by 30 Advanced Micro Devicesr Inc. The address increment - circuits of block 14-64 are constructed f rom standard ALU chips designated by part number 74AS181A and a progra~nmable array logic element having part number P~P~L16L8~, manufactured by AdYanced Micro Devices, Inc..
The first and s~cond levels of command regist~r 5 and decode circuits of block 14-66 and 14-86, respectively, utilize clocked programmable array logic elements having part numbers hnPAL16R4A and hn~AI,16R6A~
manufactured by Advanced Micro De~Jices, Inc. These circuits generate the required selection, read and 10 write control signals as indicated in Figure 2 ti.e., signal s SWAPLT~00, SWAP~T~00, POLDD~OL, PlLDDT~OL, P0LDDEOR, FlLD~T-OR). For ~urther detailsD reerence may be made to the ~quation~ of the App~ndixO
AS seen f ro~n Figure 2, cache subsystem 14-6 is 15 organized into even and odd sections which pennit two data words to be accessed simultaneously in reæponse to ei ther an odd or even m~nory address. For f ur ther information about this type of cache addressing arrangement, ref erence may be made to U. 5. Patent No.
20 ~ ,378,591 which is assigned to the same assignee as named her ei n.
Figure 2 also shows in block form, FIEO subsystem 14-10 which incluaes the FIEO control and clocking circuits of block 14-11 which couples to a replacement 25 address register 14-12 and to system bus 12. FIFO
subsystem 14 10 receives all of the inf ormation tran~ferred between any two subsystems on system bus 12. When the information is for updating data in main memory, the information is coded to indicate such 30 updating or replacement operation. FIFO subsystem 14-10 also receives any new data resulting from a memory request being ~orwarded to system bus 12 by cache subsystem 14-6, Both update and new data are ~. :

.' ' .

? ~

stored as requests within a bufer memory included within subsystem 14-10. FIF0 control circuits decode each request and initiate the appropriate cycles of operation which result in address, data and commands being applied to dif~erent parts o~ cach~ subsystem 14-6 as seen ~rom F~gure 20 For the purpose of ~he present invention, FI~:) subsystem can be considered conventional in design and take the f 4n~ of the type of FIF0 circuits discloæ d in U~S9 Paten~ No, 4,195,340 10 which is assigned ~o the 5am~ assigne2 a~ named herein.
me basic timing for each of the subsystems of Figure 2 i5 established b~ the timing and control circuits of block 14-600 Such control permits the conflict-free sharing of cache subsystem 14-6 by CPU
lS subsystems 14-2 and 14-4 and FIFO 9ub8ystem 14-10. The circui~s of block 14-60 are describ~d in greater detail in the first related patent applicationO Briefly, these circuits include address select logic circuits which generate control signalR for conditioning addr~ss 20 selector 14-62 to select one of the subsystems 1~-2, 1~-4 and 14-10 as a reques~ address source.
Also, block 14-60 includes pipeline clock circuits of block 14-620 which define the different types of cache memory cycles which can initiate the start of the 25 pipeline resulting in the generation of a predetermined sequence of signals in response to each request. mat is~ first and second signals, respectively, indica~e a cache request f or service by CPU0 subsystem 14-2 and C~l) 1 subsy st em 14- 4 whil e oth er 9i ynal s i ndi ca te ca ch e 30 requests for service by FIF0 subsystem 14-10.

,' . " . '' ~13--me~e requests can be summarized as follow~:

1. 5PYQRl~
A CPU0 read occurs i~ re~pon~e to a cache reques~ initlated by ROS 1~-24 during a fir~t S time slot/interva:L when CPU port 0 within interface 14-1 is not busy. Th~ addre~s ~upplied ~ CPU0 subsystem 14-2 is furnished to the fir~t pipelin~ ~tage and the directory is read. When a hit is detected, indicati~g that the reque3ted da~ca is stored in the data buffer, ~he buffer is read and ~che data is docked into the CPU0 data register. When a miss is detected, the c~un port is made bu~y, the reguest is forwarded to memory ~o fetch the requested data.

2. 5P~ Y~
A CPUl read occurs in response to a cache res~uest initiated by PsOS 14-44 during a third time slot/intervai when Cl?U port 1 within interface 14-1 is not busy.

3O ~5g~ E_B~_5yC~
A second half bus cycle occu~s in response to a f irs type of cache request ini~iated by FIFO subsystem 14-10 ~or data requested ~rom either main memory or an I/O device being returned on system bus 12 during a first or third time slot/interval when FIEO subsystem 14-10 has a request stored. When FIFO
subsystem 1~-10 furnishes data f rom an I/O
device to the irst pipeline state, it passes :

~, . . .

~: , .
,. . . . .
'- , .

therethrough without changirlg the sta~es of any memories and is clocked into the appropriate CPU data register. ~ata f rom main m~oory is written into the cache data buffer.q and is clocked into the ~ppropriate CPU data regis~ers.

A m~nory write update c~cle occurs in r~spollse to a second type of cache r~uest ini-tiated by FIFO subsystem 14-10 f or replacement or update data received f rom system bus 12, upon acknowledgemen'c of such data during a f irst or third time slot~interval when FIEO subsystem 14-10 has a request stored. ~IEO subsystem 14-10 furnishes data ~o the first pipeline stage resul ting in the reading of the di rectory memory. When a hit is detected, the r epl acement da ta i s w ri tten i nt o th e buf f er melr.ory.

5 - ~IE5! ~QS~91Y_Ç~C.~
A FIFO allocation occurs in response to a CPUO or CPUl READ CYCLE which resul ts in a miss being detected~, The CPU port is made busy and the request is forwarded to memory to fetch the requested data. Upon the memory read request being acknowledged, the CPU read request is loaded in~co the FIEO subsystem r egi st er 5 a nd cont r ol ci r cui ts i ncl ude d i n , 30 the subsystem initiate a request for a FIEO
cy cl e of operati on ( i . e., f or ce si gnal , :

' .

-15~
CYFIFO=l). Signals specifying the type of request and level number in~onmation are applied as input~ to the command register and decode circuit~ of block 14-66. m ese signals include FIMREF (memory ref~rence), FrWRIT (memory read) and FIDT16-18/19-21 (level number). The signals FIMREF and FrWRIT initiate a FIFO allocation cycle (i~e., FI~LOCYC-l).

Figure 3 ~how3 the organization of the ~ven and odd directory memo~y and buffer memory pipeline stages, according to the present invention. As seen from Figure 3 9 the directory and associated menory circuits of blocks 14-74 and 14-76, each include a multilevel 4K
x 20-bit directory memory 14-740 and a second half bus cycle in process 512 x 4-bit memory (SIP) 14-742.
The directory memory 14-740, in response to a cache address~ generates eight hit output signals (HIT0-7) which are applied to the hi~ decode circui~s of block 14-860 me directory memory 14-740 is written in response to write enable signals LVWR0 through ~VWR7 ~rom circuits 14-66. m e SIP memory 14-742, in response to the cache column address, generates four output signals (SIP0-3) which are also applied to the hit decode circuits of block 14-86~ The SIP memory 14-742 is written with level number and SIP bit signals as a function of input data signals SIPDT0-3 and a write enable signal SIPWRT generated by the circuits of block 14-66. For f ur'cher details as to how t:hese 30 signals~are generated, reference may be made to the equations of the Appendix.

~q~

The hit decode circuits of block 14-86 include ~he hit decode circuits of block 14-860 and the SIP hit decode circuits of block 14-862. In the preerred embodiment, separate clocked PLA elemen s are used to construct the hit decode circuits of blocks 14-860 and 14-86~.
In re3ponse to the input signals shown, hit decode circuits of block 14-860 general:e hit level n~ber output signals HIT0~2 and HIT. These signal~ indicate 10 the level at which the hit occurred and the hit condi~ion occurrence during a second half bus c~ycle or update cycle (i.e., nonmal operation). Simultaneously, tha pri~rity encoded SIP hit decode circuit3 of bloc~s 14-862 generate a SIP output signal 5IPHIT upon 15 detecting the presence of a potential coheren~y problem due to both processing units having requested the same d~ta. For further detail~ as to how these signal~ are generated, . reference may be made to the equations in the Appendix.
The outputs f rom the circuits of blocks 14-860 and 14-682 are applied as inputs to the circuits of the data buffer pipeline stage as shown. According to the preferred embodiment of the present invention, the data buffer and associated memory circuits of blocks 14-88 and 14 90, each include 4K x 16-bit data buffer 14-880, a 4K x l-bit read in process ~RIP) memory 14-882 and associated D-type clocked control fli~flop 14-884 and AND ga.te 14-886 all of which connect as shown. The RIP
memory 14-882 is written as a f unction of an input data signal RIPDTI and write a enable signal RIPWRT
generated by the circuits of block 14-86. For further details as to how these signals are yenerated, reference may be made to the equations in the Appendi~.

, The data output signal RIPDAT f rom the RIP memory 14 8~2 is loaded into control flip-flop 14-884 via AUD
gate 14-886 in response to a EIIT signal from block 14-860 during an update c~ycle ( i. e., when signal 5 FIU~DATE is a ONE~o m~ flip~flop 14-884 i~ preset to a binary ZE~O state during an allocate ~ycle (i. e.
when signal FIALOCYCLE is a binAry ZEROlo The outpu~
signal RIPELP from the flop-flop 14-884 is applied as an input 'co a dlata selector circuit 14-866 which is 10 also incl~ded within ~lock 14-86.
As a function of the state o~ signal RIPFLP, data selector circuit 14-86G generat~s output enable signals CADREX;;EN, CADBWEN and CADBEN which selectively condition data register 14-82 and da~a buff er memory 15 14-880 to provide the data signals to be loaded into one of the CPU data registers 14-94 and 14-96. For f urther details as to how these signals are generated, reference may be made to the equa~ions in ~he Appendix.

RIP~ _9P~lg~
With ref erence to the flow diagrams of Figures 4a and 4b, the operation of the apparatus of the present invention shown in Figure 3 will now be describedO In the system of Figure 1, each processing unit can initiate processing requests to memory. The memory 25 subsystem nonmally operates to queue the memory requests and process each request in sequence. In certain instances, one processing unit may be reading a word or a part of a word from the same location another processing unit is updating or writing. Since the 30 processing unit can apply the write request to system bus 12 before the me~ory subsystem delivers the in~ormation reques~ed by the first proces-~in~ unit, the infonDation actually delivered to the first processing unit will be incoherent. That is, the information will contain a mixture of old and new ir~onnation.
While it may be po~ible to avoid incoherency by requiring the imposition o~ locking instruc~ions for processiny units sharing common area~ of memory, ~his may not alway~ be practical. Also, it may b~ difficult to d~tect and prevent violations.
The apparatu~ o~ ~he present i~vention not only enable~ the cache subsystem to provide coherent data to its as~ociated proces~ing unit or unit~, ~ut it provid~s the latest version of the data being requested., In the preferred e~bodiment, the read in 15 proces~ RIP memory apparatus is included in both the even and odd cache section~ so as to provide coherency for each of the two 16-bit words. Of course, this apparatu~ can be expanded o provide the same degree of coherency for each byte of each word. mis can be done 20 b~ duplicating RIP memory apparatus.
The sequence of operations shown in Figure 4a illustrate the operation of a system. configured as a single processing unit which couples to system bus 12 through a single cache~ In this case, Figure 3 need 25 only include a single memory (either SIP memory 14-742 or RIP memory 14-882)~, For performance reasons, it is deslrable to include the memory in the f irst pipeline stage. It will be appreciated that both memories 14-742 and 14-882 are equivalent. To increase the 30 speed of operation of RIP memory 14-882, the memory is organized into a 4R by l-bit memory. Each of the 512 addres~able locations has 8 levels associated therewith. Thus; the levels are encoded and their value d~termined through acces~ing 'che memory.

--lg--In the case of the single processing unit configuration, the output SIP~IT from the SIP hit decode circui~s of block 14-862 are connected as an input to ~ND gate 14-886 in place of the ~IPDAT
signal. The RIP memory 14-882 is therefore elimina~ed.
~ ith re~ere~ce to the flcw chart of Figure 4a, i~
will be assumed that C~U 0 couples to cache subsystem 14-6. ~h~n CPU 14-2 r~quests the data word 0 stored at location A0 of the even memory module of me~ory 10 subsystem 20-2 o~ Figurer the cache sub~ystem 14-6 hit decode circuits of block 14-660 signal a m~ 8~ condition ~i.e. r signal HIT-0). This signal cause~ ~he interface area 14-1 to issue a read request to the main memory subsystem 20-2 requesting the data word stored at lS location A0. m at is, memory read request is applied to the system bus 12 and, upon being accepted by the memory subsystem 20-2, is acknowledged by its interface area 20-1 circui~s.
Upon being acknowledged, the same read request is 20 loaded into the FIFO subsystem 14-10 from system bus 12. In this manner, the bus request is synchronized with the cache pipeline timing. The cache memory address is applied via address sele~tor circuit 14-62 to the address latches 14-68/74 while certain command 25 and data bits are applied to the comm2nd register and decode circuits of block 14-66~ This occurs whenever there is a free pipeline stage and FIFO subsystem 14-10 initiated a request for a FIFO cycle ~iOe., signal CYFIFO~l).
The command bits specifying the type of FIFO
request are decoded by the circuits of block 14-66.
Since the request is a memory read request, signals FIMREF and FrWRIT are binary ONES. These signals force signal FIUPDATE to a bin~ry ~IE initiating a FI~
allocatiotl cycle of operation ( i. e., signal FIALOCYCa~l ) .
As seen f rom Figure 4a, the cache subsystem S preallocat2s the directory b~ writing the row address portion of the read request cache add~ess into the location specified b~ the cache column address at ~he designated level. me level i determined by decoding the level n~nber bits FIDT16-18 or MDT19-21 from ~IEO
10 sub~y~tem 14-10 which forces a corresponding one of the dir~cto~y write ~nable sigrlal s LVWR0 through LVWR7 to a bi nary Z EROo At the same time~ RIP memory 14-742 is set to a predete~mined state for denoting that there is a read 15 in proces~O That is, the level number and read in process bits are written into the location specified by the cache column address in response to sigalals SIPDT0-3 and SIP~RT. At that time, signal FIALOCYC
pre~ets RIP fli~flop 14-884 to a binary ZERO s ate~
As seen fr~n Figure 4a, the FIEO subsy~tem 14-11 list~ns for any write request made by another processing unit. me receipt of any such request causes the FIFO subsystem 14 10 to initlate a memory write update cycle (i.e., forces signal FIUPDATE to a 25 binary ONE). In the same manner as described above, the memory write cache address is applied via latches 14 68/72 to the directory memory 14-740 and memory 14 742 of Figure 3.
During each write update c~cl~ the directory and 30 associated memory circuits are accessed by cache column address signals COL~D0-8. me directory memory 14-740 operates to generate hit signals ~IIT0-7 indicating whether or not the data being updated is stored in dal:a 7~

-21~
buffer 14 B80 and i~ stored, the level in th~ data buffer where it is stored. At the same time, m~mory 14~742 operates to generate level number and RIP bit signals SIP0-3 indicating if a read in proces~ state 5 exists for that column addre s.
The hit decode circuit3 14-680 operate to co~vert signal~ ~IT0-7 into a three-bit hit level code which corr~sponds to signals ~IT~0-2, in addition to gen~rating hit output signal ~IT indicating the 10 occurrence of a hit conditionO m e SIP hit decode circuit~ 14~862 operate to compare the l~vel number signals SIP0-2 with the converted signals ~IT~0-2 for determining whether or not th~ data being updated will produce a coheren~y problem. When there is a match and 15 the RIP bit signal is true~ the hit decode cirruit~
14-862 force signal SIP~IT to a binary ONE indicating the potential coherency problem.
As seen fram Figure 4a, the occurrence of a directory hit causes the write update data loaded into data register 14-8~ to be written into data buffer 14-880, in response to signals CADBWEN and CADBEN
~i.e., both binary ZEROS). The data word is written into the location specified by cache column address si gnal s COLAD0 - 8 an d hi t 1 ev el number signals HIT~0-2 25 which were loaded into buffer address register 14-80/84, in response to clocking signal PIPEOB~OA. At this time, the occurrence of a RIP hit causes RIP
flip-flop 14-884 to be switched to a binary ONE, in response to ~ignals SIPHIT and FIUPDATE and timing 3 si gnal PI PEO~ OB ~
When set to a binary ONE, the RIP flip~flop ensures that the cache subsystem f urnishes coherent data to the processing unit notwithstanding the fact ~æ~

that the requested data word was updated by another processing unit. More speciically, when the requested data word is received f~om main memory, during a second half bus cycle, another FIFO cycle is initiated. This results in the requested data word being transferred via the S~APMUX data register 14-70 ~o da~a register 14-~2 .
~ lso, the cache address signals ar~ applied to the directory memory 14-740. Since the directory has been 10 preallocated, reading the directory results in the generation of a hit condition by hit decode circuits 14-680, as indicated in Figure 4a. Next, the state of RIP fllp-flop 14-884 is exa~ined. The ~tate of signal ~IPFLP conditions data selector 14-866 to select the latest version of the data word for trans~er to the processing unit. That is, when ~ignal RIPFLP is a binary ONE as in the present example, this indicates that an update cycle to the same location as the requested data had occurred before the SEBC cycle which returned the requested data word. Therefore, data selector circui s 14-866 are conditioned to discard the returned data word stored in data register 14-82 and enable data buffer 14-880 for reading out the updated data word to the CPU data register 14-94/96~ This is accomplished by data selector circuit 14-866 forcing signal DREGEN to a binary ONE disabling data register 14-82 and signal CADBEN to a binary ZERO enabling data buffer 14-880.
As seen from Figure 4a, when RIP flip-flop 14-886 has not been set to a binary ONE, the data word is written into data buffer 14-880 and transferred to the CPU 0. That is, data selector circuit 14 860 enable~
data register 14-82 to apply the data word to CPU data register 14-94/96 (i.e., forces signal DREGEN to a L~;3 q~/

binary Z~RO) while conditioning data bu~fer 14-880 for writing the data word into locatio~ A0 at the qpecified level (i.e., forces signal CADBWEN to a binary ZERO and sig~al C~DBEN to a bina~y ONE).
Also, during the S~BC cy d e, the RXP bit is reset to a binary ZERO in respo~se to signal SIPWRT. At that time, data input signal SIPDT3 is a binary ZERO.
From the above, it is seen that the apparatus of the present invention is able to both maintain 10 cohere~ y and provide a processing unit with the latest version of a coherent pair of data word~. That is, the e~en and odd sections of 'che cache subsystem include the apparatus of Figure 3 and operate in the manner illustrated in Figure 4a to inhibit the transf er of incoherent data words to CPU 14-2.
Figure 4b illustrates the operation of the present invention when used in a system conf igured as cache subsystem 14-6 of Figure 2. To ensure coherency between the two processing units CPU0 and CPUl which 20 ~hare the cache subsystem, both of the memories 14-742 and 14-882 are utilzed ln the different pipeline stages as shown in Figure 3. The arrangement provides performance benefits in that the memories in each pipeline stage are either written or read ~i.e., no 25 read modify write operations are required) thereby minimizing pipeline delays.
As seen f rom ~igure 4b~ it is assumed tha~ CPU0 generates a memory read request ~o re~d word 0 from location A0 of main memory subsystem 20 which produces 30 a miss condition. In the same manner as described above, the interf ace 14-l issues a memory read request to main memory subsystem 20 f or data word 0 in location A0. Also, CPU0 signals the memory subsystem 20 that . , the read is a double word read ~io eO ~ forces a double word bus line to a bina~y 0NE state) . This cau~e~
subsystem 20 to furnish data word~ from the even and odd memory modules ( i. e., locations A0 and A0~1~ to S CPUOO
As shown in Figure 4b, the memory read request is lsades~ in~o FIFO subsystem 14-10 and re5ults in a F~E~
cache cycle reques~. As previously discu.~sed, the directory memory 14-7 4û of the even and odd cache 10 section~ is preallocated h~ writing the read regues'~
cache rc~w address into the location AO speci~ied by the cache column address at the designated level tXi~. At the same time, the SIP memory 14-742 is accessed and ~he level nl~mber Xi along with the SIP bit are written 15 into 'che location A0 specified by the c che col~n addr ess.
Next, the RIP memory 14-882 of the ~ven and odd cache sections is accessed. The RIP bit location specif ied by the cache column address hit number signals ~IT#0-2 is set to a binary ONE in response to signals RIPDTI and RIPWRT generated by the circuits 14~86. Also, the RIP flip-flop 14-884 of each cache section is preset to a binary ZERO in the manner previously described.
As seen f rom Figur e 4b, the cache subsystem 14-6 in response to each request determines whether the request is for a CPU read, write update or second half bus cycle. Assuming it is a CPU read c~ycle, the memory read request generated by CPU1 ( i. e., CPU0 awaiting 30 results of memo~y read request) is examined when it is presented to the cache subsystem 20 by CPUl. That is, ~he cache address loaded into latches 14-68/72 accesses directory memory 14-740 and SIP memory 14-742 of ~he ~r~ 3 even and odd cach~ sections. When CPU1 is requesting on~ of the data words previously requ~sted by CPU0, the directory memory 14-740 generates output hit signals ~IT 0 7 indicating th~ occurrence o~ a hit condition at level Xi~
The SIP hit decode circuits 14-862 operate to force signal SIP~IT to a binary ONE when the SIP bit is a binary ONE and the hit output signals ~IT0-7 match the level number signal~ SIP0-2. The SIP~I~ signal is u~ed to indicate the presence of a coherency problem between CPU0 and CPUl.
~ indi~a~ed in Figure 4b, SIP~IT signal is used to block the loading of CPU1 data reyister~ with data ~rom data buffer 14-880 which produce~ a mi38 condition at interface area 14-1. This efec~ively cancels the request as far as cache Rubsystem 20 is concerned. To simplify the de~ign, the interface area 14-1 still forwards CPUl'~ memory request to main memory subsystem 20 in response to the miss condition.
In response to each write update cycle, the memory write reques~ is examined when it is presented to the cache subsystem 20 by ~IFO subsystem 14-10 follawing the receipt of an acknowledgement from system bus 12, As described above, the cache address is applied via latches 14-68/72 to directory memory 14-740 and when the data word being updated is the same as that requested by CPU0, hit decode circuits 14-860 operate to generate a hit.
This results in the updated data word being 30 written into data buff er 14-880 at the location specified by the cache column address and hit level sig~als HIT~0-2. Also, at that time, RIP memory 14 882 is accessedO When the RIP bit has b~en set to a binary ONE ~ignalling that a read is in process, thi5 forces signal RIPDAT to a binary ONE, l~li8 causes RIP
flip-flop 14-884 to be switched to a binary CNE in respon~e to signals ~IT, FIUPDAT~ and PIPEO~CB as 5 shown in Figure 4b~, When main memory subsystem 20 complete~ its p~ocessing of the CPV0 m~ory read r equest, it opera~es to return da~a word~ 0 and 1 on system bu~ 12 durin~ a second half bus cycle. As ~een fr~m Figure 4b, a S~C
10 cycl~ is ini~iated during which the returned data words togeth~r wi~h the cachè addres3 i~ prssented to cache subsy~em 20 by FI EO subsystem 14 10 .
As de-~cribed above, the cache addre~ is applied ~o direc~ory memory 14-740 which results in hit d~code 15 circuits 14-860 generatiny a hit. Also! SIP memory 14-742 is acces~ed and the SIP bit is reset to a binary 2ERO in response to signal SIPWWRT.
As seen f rom Figure 4b, the state of RIP flip-flop 14-884 detennines whether data buff er 14-880 or data ~ register 14-82 contains the latest version of the reque~ted data words. When set to a binary ONE, signal RIPELP conditions data selector circuit 14-866 to inhibit the transf er of the SHBC data from register 14-82. It also enables the rereading of cache data 25 buffer 14-880 by forcing signal CADBEN to a binary ZERO
and the delivery of the updated data words to the CPU0 data registerO
When RIP fli~flop 14~884 is a binary ZERO, this indicates that the requested data words have not been 30 updated. Hence, the S~BC data word5 stored in regis~er 14-82 are written into dlata buff er 14-880 and delivered to the CPU0 data register. A~ seen f rom Figure 4b, during the SBC cycle, the P~IP bit accessed from RIP

. . .

memory 14-882 is reset to a binary ZERO in response to signals E~IPDTI and RIPWRT. Since thi3 is not an allocation cycle, signal RIPDTI i-q a ~inary Z~RO.
From the above" it is seen how the apparatus of 5 the presen~ in~7ention ensures that each processing unit within a system or in the case of a shared cache, each pair of processing unit~ within a ~;ystem receive coherent data and the latest version o~ such coherent data. This i5 accomplished with the addi tlon of few 10 circuits within one or more of the pipeline stages of the cache subsystem of the pref erred embodiment. For per~ormance reasons, m~mory circuits are included in each pipel ine stage of the shared cache subsystem.

The equation~ for gene~ating the signal~ of F~gures 2 and 3 are given ~y the ~oll~ g Boolean e xpr essi ons:

S 1. *POLDD~OL~CPUCYL CPUNUM DBWDRD EVN~IT
( ODDH IT ODDSI PH I T
, ,, ~
CPU READ C:~tCLE

CPU CYL CPU NUM DBWDRD CMAD2 2 CMAD 23 o EVNH IT EVNS I ~ T T
CPU READ CYCL E

10~CPUC~ CPUNUM D~WDRD-CMAD22 CMAD23-ODDHIT-ODDSIP~IT
CPU READ S:YC1E

+CPUCYL- FIADl7- FISHBA-RPMREF
I/O SH~ C

.
+~CPUCYL- FIADl7- FISHBA-RPMREF.
l 5MEM SHB C

2 . *POLDDT--OR=CPUCYL- CPUNUM- DBWDRD EVNHIT-OD~I ~ DSIP IT
CPU READ

+CPU CYL CPU NUM D~WDRD CMAD2 2 EVNE~ I T ~VMS I PH I T
- J

C~U AD

~CPUCYL- FIAD17- FISHBA-RPMREF
I/O SHBC

+CPUCYL- FIAD17- FISBA-RP~lREFo I/O SHB C

3. *pll:,DDr OL=same as 1 except -CPU-NUMi CPUNUMo 4. *PlLDDT-ORssame as 2 excep'c CPUNUM~CPUNUM.

5 . *SWAPL~r,Pucy~
CPU READ

+~CPUCYL~ FISHBA RPMREF-RPAD22. J

6. *SWAPRT~ CPUCYLDBWDRDCIIAD22 ~J
CPU READ

15 ~c~ <~ _ CPU READ

*These signals are clocked with signal PIPEO~+OA.

, , ~ '.

-30-' ~CPUCYLoFISHBA-RPMREF-(FIDBWD~RPAD22+FI.DBW~-RPAD22)..
ME~ SHBC

7. HIT~HITO~ITl+~IT2+~IT3+MIT4+HIT5~HIT6~IT7.

8. HIT~O-HITO~HITl+HIT2+~IT3.

, . . ~

9. ~IT$1-HITO~HITl+~IT4-~IT2-HIT3~HIT5-~IT20HIT3.

10. ~IT~2=~ITO~IT2-HITl~IT4-HIT3-HITl+~IT6-HIT5-HIT3~ITl.

11. SIPHIT=SIP3- [HIT0-SIP0-SIPl-SIP2+HIT0O
~ITl SIP0 SIPl SIP2 +HITII-HITl-HIT2-SIP0~SIPl-+EI IT0 oE~ITl E~IT2 HIT3 SIP0 SIPl SIP2 ~}IIT0-EIITl-~IT2~EIIT3-E~IIT4-SI P0 SIPl SIP2 +H I T0 H ITl ~ IT2 ~ El IT3 H IT4 HIT5-SIPO-SIPl-SIP2 +HIT0~HITl-HIT2-HIT3 HIT4-HIT5 HIT6 SIP0 SIPl SIP2 +HIT0-EIITl-EIIT2-E~IT3-HIT4-H IT5 H IT6 H IT7 ~SI P0 SIPl -SIP2 .

12,. SIPWRT=~ ~FIALOCYC~CYFIEO~FISHBC) WRTPLS] .

. ~` ' ~ ` ": ' ~ ' `

13 . SIPDT0-2=FI~T16~18/FIDT19-21.

14. SIPl~T3=FIAl-OCYC.

15 . RIPWRT~ 1 (FIALOCYC+CYFIEOo FIS~3BC) WRTPLS3 .

1 6 . RIPDT I-FIAL OCYC .

5 17D CADBWE~[FIUPDATE^~IT+CYFIFO-FISHBC:-RIPF$D] -WRTPL S O

18 " CZ~DR~GEN=G~DBEN- FISElBC-RIPE~.P+CADBEN~
DB EN - FIUPDATE ) .

19. CADBEN=HIT- (CPUCY~FIUPD~TE+MMSHBC~ wherein MMSHBC=FIS~BC-RPMREF.

20 . FIALOCYC=FIMREF- FrWRIT.

21. FIUPDATE=FIMREF FIWRIT.

.. . . ..
22 . LVWR0=WRTPLS ~ FIDT16 FIDT17 FIDT18+FIDTl9 FI DT2 0 FI DT2 1 ) .

15 23. LVWR7aWRTPLS- (FIDT16 FIDT17 FIDT18+FIDTl9-FIDT20~FIDT21) .

, 1. DBWDRD = Double word read command defined by ROS
data bit 4 ~ 1 and ROS data bit 5 = O g~nerated by th~ decode circuit~ of block 14-66 which i5 clocked S with signal PIPEO~+OA.

2. CPUNUM ~ CPU number (CPUO or CPUl) signal generated by ~he circuits of block 14-66 which is clocked with signal PIP~OA+OA.

3. CPUCYL 3 CPU cycle signal generated b~ the circuits of block 14 66 and which is clocked with signal PIPEOA+OAo 40 ~VN~IT - ~IT signal generated by the hit decode circuits 14-680 associated with the ~ven directory memory 14-76 .

50 CM~D22 = Cache memory address bit 22 generated at the output of selector 14-620 6. CMAD22 = Cache memory address bit 23, generated at ~he outpu'c of selector 14-62, specifies which half ~lef t or right) of data register 14-94 or 14-96 is to be loaded with a data word.

7. FIAD17 = FIEO address bit 17 from FIFO subsystem 14-10 defines which CPU is to receiYe the repl acement da ta .

: , .

8~ FIDBWD - FIFO double-wide word command bit f rom FIE~ subsystem 14-11 specifies when the data being returned has two words.

9. FXSHBA 3 FIEO second half bus cycle acknowl~dge signal from 14-11 specifies that the FIEO subsystem r eq ui r es a ca ch e ~y cl e to pr oce s~ da ta r eceiv ed fr~m an I/O device or m~nory during a second half bus cycl e SHBC.

10. ODDHIT - EIIT signal generated by the hit decode circuits 14-680 as-~ocia~ed with ~he odd directory memory 14-740 11. RPMREF = Memory ref erence signal provided by R~R
14-12 which permits any exception conditions to be taken into account.

15 12. RPAD22 = Replacement address bit 22 from RAR 14-12.

13 . FIDT16-18/19-21 = The even/odd data bits def ining the cache level provided by the ~IFO subsyst~n 14-10 .

14. CYFIEO - A cycle signal generated by the FIFO cycle select logic circui~s of block 14-60 during a free pi pel ine stage .

15. FISHBC - The second half buq cycle signal from FIEO
subsystem 14-10.

`:

... .. ,, ., , ... . ,~ .. . . ........ . .
~. ... :

16 . WRTPLS ~ write pul5e signal generated by the circuits of block 14-60 which cccurs midway be~leen either clocking ignals PIPEOA+OA AND PIPEOAIOB or clocking signals PIPEOB+OA and PIPEOB~OB.

5 17 . ~IMREF =~ The bus me~nory ref erence signal BSMREF
from FIFO subsystem 14-10.

18. FIWRIT - m e bus memory write signal BSWRIT fr~m FIFO subsystem 14 10.

It will be appreciated by those skilled in the art that many changes may be made to the preferred embodiment of the present invention. As mentioned, there need only be a single memory associated with the directory memory stage in the case of a single processing unit. Where there is more than one lS processing unit sharing the cache subsystem, one memory is used f or detecting incoherency between processing units while the second memory is used to indicate a read in process state included f or increased performance. ~owever, it still may be possible to combine both within a single memory, as well as the function of the RIP control flip-flop with attendant decrease in pipeline stage speed.
While in accordance with the provisions and s~atu~es there has been illustrated and described the best form of the invention~ certain changes may be made without departing fram the spirit of the invention as set forth in the appended claims and that in some cases, certain features of the invention may be used to advantage without a corresponding use of other 30 features.
What i s cl aimed i ~:

:

Claims

1. A multiprocessing system comprising a plurality of processing units and a main memory coupled in common to an asynchronous system, each processing unit including a cache unit for providing high speed access to coherent main memory data in response to requests and data transmitted on said system bus by said processing units, each request containing first and second address portions of a cache memory address generated by one of said processing units, said cache unit comprising:
a first stage including directory store means organized into a plurality of levels containing groups of storage locations, each location for storing said first address portion of a memory read request generated by said processing unit associated there-with and each different group of locations within said directory store levels being defined by a different one of said second address portions;
a second stage including data store means organized in-to the same number of levels of said groups of locations as in said directory store means and each different group of locations within said data store levels being accessed by a different one of said second address portions;
read in process (RIP) memory means included in one of said first and second cache stages, said RIP memory means including a plurality of locations, each location being accessed by a different one of said second address portions;
decode and control means coupled to said directory store means, to said RIP memory means and to said data store means, said decode and control means being operative during a cache allocation cycle in response to each request received from said processing unit for data not stored in said data store means to generate signals for placing the location specified by said second address portion in said RIP memory means in a predetermined state for identifying the data store location which has been preallocat-ed; and control means coupled to said RIP memory means and to said decode and control means, said control means being conditioned by said decode and control means during a cache update cycle to switch to said predetermined state when said RIP memory means signals that a portion of the contents of said data store location which has been preallocated is being updated to maintain coherency prior to the receipt of the requested data and said control means being operative to condition said data store means for transferring the most recent version of said requested data to said processing unit.

2. The cache unit of claim 1 wherein said directory store includes means for generating a plurality of hit signals indicat-ing whether or not the requested data is stored in said cache data store means.

3. The cache unit of claim 2 wherein said RIP memory means is included in said first stage, said signals from said decode and control means conditioning said location specified by said second address portion of said cache address during said alloca-tion cycle to store level number signals coded to specify the level in which said location has been preallocated together with a signal indicating that a memory read request is in process.

4. The cache unit of claim 3 wherein said decode and control means includes programmable logic array (PLA) circuit means coupled to said RIP memory means and to said means for generating, said PLA circuit means being operative upon detecting a match between said plurality of hit signals, said level number signals and said signal indicating that said memory read request is in process to generate an output hit control signal for switch-ing said control means to said predetermined state.

5. The cache unit of claim 2 wherein said second cache stage further includes input data register means coupled to said system bus, to said data store means and to said processing unit, for receiving data from said bus to be written into said data store means and to be transferred to said processing unit in response to said requests and said control means being coupled to said input data register means, said control means being opera-tive to selectively condition said input data register means and said data store means for transferring the most recent coherent version of said requested data to said processing unit.

6. The cache unit of claim 2 wherein said RIP memory means is included in said second stage, said RIP memory being organized into the same number of levels of said groups of locations as in said directory store means, and each location containing at least one bit position, said signals conditioning said bit location specified by certain ones of said signals and said second address portion of said cache address during said allocation cycle to store a signal indicating that a memory read request is in process.

7. The cache unit of claim 1 wherein said cache unit fur-ther includes FIFO listener means coupled to said system bus and to said first stage and to said decode and control means, said FIFO listener means being operative in response to each memory write request applied to said system bus by any other processing unit and acknowledged by said main memory to store and subsequently present said first and second portions of said each memory write request to said directory store means and to said RIP memory means during said update cycle thereby synchronizing the receipt of asynchronously generated requests from said bus with the cycling of said cache unit.

8. The cache unit of claim 2 wherein said directory store means includes:
an even directory memory having a plurality of locations for storing a plurality of even addresses; and an odd directory memory having a plurality of locations for storing a plurality of odd addresses; and wherein said buffer memory means includes:

an even buffer memory having a plurality of storage locations associated with a different one of a plurality of even addresses; and an odd buffer memory having a plurality of storage locations associated with a different one of a plurality of odd addresses; and said RIP memory means includes:
an even RIP memory; and an odd RIP memory associated with said even directory memory and said odd directory memory respectively when included in said first cache stage and said even buffer memory and said odd buffer memory respectively when included in said second cache stage.

9. A multiprocessing system comprising a plurality of data processing subsystems and at least one main memory subsystem coupled in common to an asynchronous system bus, each data pro-cessing subsystem including a plurality of processing units, each processing unit being operative to generate memory requests for data, each request including an address; and a pipelined cache memory subsystem coupled to each of said processing units for receiving said data requests, said cache subsystem comprising:
input selection means for selecting a request address from one of said processing units subsystems during an allocated time slot interval;
a first pipeline cache stage coupled to said input selection means, said pipeline stage including a directory store organized into a plurality of levels containing groups of storage locations, each location for storing said first address portion of a memory read request generated by one of said processing units during said allocated time slot interval and each different group of locations within said directory store levels being acces-sed by a different one of said second address portions;
a second cache pipeline stage including a data store organized into the same number of levels of said groups of locations as in said directory store and each different group of locations within said data store levels being accessible by a different one of said second address portions during a succeeding time slot interval for transfer of the data contents to the requesting one of said processing units;
read in process (RIP) memory means included in one of said first and second cache stages, said RIP memory means including a plurality of locations, each location being accessed by a dif-ferent one of said second address portions;
decode and control means coupled to said directory store, to said RIP memory means, and to said data store, said decode and control means being operative during a cache allocation cycle in response to each request received from one of said processing units for data not stored in said data store to generate signals for placing the location specified by said second address portion in said RIP memory means in a predetermined state for identifying the data store location which has been preallocated;
and control means coupled to said RIP memory means and to said decode and control means, said control means being conditioned by said decode and control means during a cache update cycle corresponding to an unused allocated time slot interval to switch to said predetermined state when said RIP memory means signals that a portion of the contents of said data store location which has been preallocated is being updated to maintain coherency prior to the receipt of the requested data to be stored in the same data location and said control means being operative to selectively con-dition said data store for transferring the most recent version of said requested data to said requesting processing unit.

10. A cache unit for providing to a processor requesting data only the most current version of the requested data; wherein said cache unit operates in a data processing system which includes a plurality of processors coupled for communication with a main memory, wherein at least one of said processors is coupled to request and receive data from said cache unit by supplying a main memory request address having first and second parts, and wherein said cache unit includes an addressable data store for holding a data unit in each addressable location thereof and a directory store for holding a first address part in each location thereof, said data store and directory store being accessed by said second address part; said cache unit being characterized by:
an additional store, said store holding a bit for each accessible location of said data and directory stores; said additional store being accessed by said second address part;

a first control unit coupled to said data store, said directory store and said additional store and responsive to each data request received by said one processor for data not found in said data store for controlling (i) entering of the first address part of said request address in the directory store in the location of said directory store accessed by the second address part of said request address and (ii) setting to a predetermined state the bit in said additional store in the location of said additional store accessed by the second address part of said request address;
a flip-flop;
a control circuit coupled to said additional memory and enabled by the entry of a data unit in a location of said data store for which the corresponding bit in said additional store is in said predetermined state to cause said flip-flop to operate in one of the states thereof; and a second control unit coupled to said flip-flop and said data store and responsive to the state of operation if said flip-flop for controlling said data store to deliver only the most current version of data units requested by said one processor.