[go: nahoru, domu]

WO2003088036A1 - System and method for instruction level multithreading - Google Patents

System and method for instruction level multithreading Download PDF

Info

Publication number
WO2003088036A1
WO2003088036A1 PCT/IB2003/001234 IB0301234W WO03088036A1 WO 2003088036 A1 WO2003088036 A1 WO 2003088036A1 IB 0301234 W IB0301234 W IB 0301234W WO 03088036 A1 WO03088036 A1 WO 03088036A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
processor
computer system
processing pipeline
sets
Prior art date
Application number
PCT/IB2003/001234
Other languages
French (fr)
Inventor
Selim Ben-Yedder
Narcisse Duarte De Freitas
Menno M. Lindwer
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to AU2003215845A priority Critical patent/AU2003215845A1/en
Publication of WO2003088036A1 publication Critical patent/WO2003088036A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming

Definitions

  • the invention relates to a computer system as the defined in the precharacterizing portion of claim 1.
  • the invention further relates to a method for operating a computer system as defined in the precharacterizing portion of claim 6.
  • Modern processors employed in computer systems use various techniques to improve their performance.
  • One of them is multithreading.
  • a multithreaded computer system may contain hardware support for multiple threads of execution.
  • the threads can be independent programs, or related execution streams of a single parallel program.
  • Multithreading enables a better performance of a computer system, if the system is configured to allow the processor to continue with another thread if a delay occurs in processing the current thread, for example because a cache miss occurs and the memory has to be accessed which generally has a relatively long latency. Otherwise thread switch means may be present which periodically select an active thread from a pool of available threads.
  • a computer system and method referred to in the opening paragraph are known from WO 00/6878.
  • the computer system allows a fast context switching between different states, by selectively coupling a different set of state saving elements (e.g. registers) to the processing pipeline. After a context switch another set is selected.
  • state saving elements e.g. registers
  • a disadvantage of the known computer system is that it does not provide a solution for the case that more threads than sets of state saving elements are present. Hence toportee that the system is also suitable for switching between a high number of threads a relatively high number of sets of state saving elements is necessary. This however has the disadvantage that a significant amount of state saving elements is unused in the case that few threads are available.
  • the state transfer unit may transfer a state between a set of state saving elements and the memory while the processing pipeline accesses the other state. If a thread switch has occurred from a first thread using a first one of the sets of state saving elements to a second thread using a second one of the sets of state saving elements, the state transferring unit may transfer the state from the first one of the sets of state saving elements to the memory, while the processing pipeline continues with the second thread. In this way an unlimited amount of threads can be handled with only two sets of state saving elements.
  • An embodiment comprising at least three sets is described in claim 2. This embodiment is advantageous if a thread is interrupted very early after its start of execution. In this circumstance there was not enough time available to complete a state transfer between a set of state saving elements and the memory during execution of said thread. By completion of a state transfer is understood the saving of the 'old' state from the set to the memory and the restoring of a 'new state'from the memory to the set. By coupling a third of the sets, in which the state of a further thread is saved, the processing pipeline can immediately continue with the further thread.
  • the transfer of a state takes place with a low bus priority as compared to communication between the processing pipeline and the memory. In this way, even when the processing pipeline and the state transfer unit share the same memory, the active thread can be executed without being significantly delayed by intervening accesses to the bus by the state transfer unit.
  • the embodiment of claim 4 enables a further improvement of the efficiency of the processing pipeline.
  • the processor serves as a translator for translating (converting) instruction code of a first type to instruction code of a second type suitable to be processed by a native processor.
  • this type of computer system it is particularly important to perform thread execution and state saving in parallel, because a complete state swapping is essential for each thread switch in such a system. The reason is that a thread switch may occur in the middle of the translation of a code of the first type.
  • the code of the first type to be translated is JAVA byte code. An overview thereof is discussed in "Implementing the JAVA Virtual Machine", by Brian Case, Microdesign Resources, March 25, 1996, pp.12-17.
  • a typical translator requires some dozens of state saving elements for storing intermediate values, e.g. an index to the current byte code, and indexes to several tables and buffers used in the translator.
  • the translator may comprise state saving elements, i.e. in the form of registers or stack locations for saving parameters which are operated on by the instruction codes.
  • Such a translator essentially differs from a computing processor in that it converts instructions of a first type without carrying them out itself. As the number of generated instructions of the second type usually is greater than the number of instructions read from the memory, this leaves the state transfer unit ample time to access the bus to the memory, even when it is operating at a low priority.
  • a processor acting as a translator for a further processor is described in WO 99/18484.
  • this document describes a processor which is capable of refeeding a sequence of instructions after the further processor has been interrupted.
  • the subject-matter of this PCT-applications is considered to be included by reference herein.
  • Instructions of the first type maybe too complex to translate them in dedicated hardware. Examples thereof are the JAVA bytecodes invokevirtual (search object classes for method and call), invokestatic (also method call), getfleld (search object classes for field data and load an object's field value onto the stack), new (search object class and create new object accordingly).
  • Such bytecodes may be passed to the further processor for processing by a dedicated subroutine,
  • Figure 1 shows a first embodiment of a computer system according to the invention
  • Figure 1A shows an organization of state saving elements in state saving units and in sets of state saving elements
  • Figure 2 A shows in more detail a portion of the computer system of Figure 1
  • Figure 2B shows in more detail a portion of Figure 2A
  • Figure 2C shows in more detail another portion of Figure 2 A
  • Figure 3 shows a flow chart of a state transfer
  • Figure 4 shows the synchronization between the processing pipeline and the state transfer unit
  • Figure 5 A-D shows four different embodiments of a computer system according to the invention in which the processor serves as a preprocessor for converting instruction codes
  • Figure 6A-D shows four examples of implementation of the embodiment shown in Figure 5B.
  • FIG. 1 shows a first embodiment of a computer system according to the invention.
  • the computer system shown comprises a processor 10 which is arranged for multi- thread processing.
  • the processor comprises a processing pipeline 11, 12, 13 and at least a first and a second set 20, 30 of state saving elements.
  • the processing pipeline comprises a first stage 11 for fetching instructions, a second stagel2 for translating the instructions, and a third stagel3 for providing the results of the second stagel2 to an output, such as a bus.
  • the processor comprises an instruction cache 14 via which it is coupled to the memory 70 via a communication means 90 such as a bus or a point to point connection. Point-to-point connections enable a fast data transfer.
  • the processing pipeline comprises several stages, between which sets 20, 30, 30', 30" of state saving elements are arranged.
  • the stages of the pipeline 11, 12, 13 pass information to each other via a selected set e.g. 20 of the state saving elements. Additional state saving elements could be part of the sets. Although three stages 11, 12, 13 are shown any number of stages is possible.
  • a first set 20 of state saving elements is indicated with a first F-shaped area bounded by a solid line
  • a second set 30 as well as a third and a fourth set 30', 30" of state saving elements is indicated with an F-shaped area with a dashed boundary.
  • Each of the sets of stage saving elements 20, 30, 30', 30" may have a plurality of state saving elements between each pair of mutually coupled stages.
  • the set of stage saving elements 20 comprises the state saving elements 21 ⁇ , 2 In and 21m between the pipeline stages 11 and 12.
  • a corresponding state saving elements in the different sets of state saving elements form state saving units.
  • state saving unit I comprises state saving elements 21 ⁇ , 31 ⁇ , 31 'i and 31" ⁇ from sets 20, 30, 30' and 30" respectively.
  • the processing pipeline 11, 12, 13 has to interrupt processing a first thread for example because of a cache miss, it can rapidly start processing another thread of which the state is saved in an other e.g. set 30 of state saving elements.
  • the computer system shown comprises selection means for selectably coupling one of the sets 20, 30, 30', 30" to the processing pipeline 11, 12, 13.
  • FIG 2A For clarity, a part of the processing pipeline including the relevant part of the sets of state saving elements and the selection means is shown therein.
  • stages 11 and 12 of the processing pipeline are shown as well as state saving units I, II and III. Each of the state saving units is capable of containing state information.
  • Each of the state saving units I comprises a plurality of state saving elements, as is shown in more detail in Figure 2B and 2C.
  • One of these state saving elements is used during the processing of a current thread.
  • the one or more other state saving elements are used to store information about currently inactive threads.
  • the other state saving units ⁇ and III are preferably equivalent to state saving unit I in order to facilitate designing and manufacturing the device.
  • the state saving units I, II and III have a first input, II for receiving state information from a first stage 11 of the pipeline and a first output 13 to enable a second pipeline stage 12 to read out the information.
  • the state saving units further have a second input 12 to enable restoring of information from the data memory 70 into the state saving unit II and a second output 14 to enable saving of state information from the state saving unit I to the data memory 70. Furthermore the state saving units have inputs for receiving the m-valued signals SelBl and selB2.
  • the state saving units further comprise an input for receiving a unit selection signal Rl, R2, R3 and an input for receiving a clock signal.
  • the unit selection signals Rl, ..., Rn identifies the state saving unit which should receive the data which is loaded from memory 70 during state restoring.
  • the computer system according to the invention further comprises a state transferring unit 60 for transferring a state between a set not coupled to the processing pipeline and a memory. The state transferring unit 60 is controlled by the controller 50.
  • FIG. 2B shows a state saving unit I in more detail.
  • the state saving unit I comprises a first input II for receiving state information from a processing stage 11.
  • a demultiplexer 41 redirects this state information to one of a set of state saving elements 21 ⁇ , 31 ⁇ , ... in reponse to a bank select signal SelBl.
  • a multiplexer 43 selects one of the output signals of the state saving elements 2 li, 31 1 , ... as the output signal at output 13 in response to the same bank select signal selBl. This output signal can be read by the next processing stage 12 in the pipeline.
  • a second input 12 is coupled to the bus 90 for receiving information which is to be restored from the memory 70.
  • a demultiplexer 42 redirects this state information to another one of a set of state saving elements 21 ⁇ , 31 ⁇ , ... in reponse to a second bank select signal SelB2.
  • a multiplexer 44 selects another one of the output signals of the state saving elements 21 ⁇ , 31 ⁇ , ... as the output signal at output 14 in response to the same bank select signal selB2. This output 14 is coupled to the memory 70 to enable saving of state information.
  • a state saving element e.g. 21 ⁇ can read information either via the first demultiplexer 41 or via the second demultiplexer 42 when it is activated by an activation signal Clij, e.g. state saving element 21 is activated with signal Oil.
  • Fig 2A and Fig. 2B shows selection means in the form of multiplexers
  • other ways well known to the skilled person are possible to perform the function of the multiplexers.
  • the same function could be implemented by a 5-1 lookuptable, having the 4 outputs of the state saving elements 21 ⁇ , 31 l3 .. coupled to four of its inputs, while the fifth input receives the signal SelBl.
  • the demultiplexers 41, 42, or any other logic function could be implemented in a look-up table.
  • Figure 2C shows by way of example a circuit for generating the activation signals Oil, Oi2 etc.
  • the circuit comprises a first decoder 25 for decoding the first bank select signal selBl and a second decoder 26 for decoding the second bank select signal SelB2.
  • the circuit comprises first combination means (here A D-gates 24, 34, ..) for combining the proper unit selection signal Ri with the output signals of the second decoder 26.
  • the circuit further comprises second combination means (the OR-gates 23, 33, ...) for combining the output signals of the first combination means 24, 34 with the outputs of the first decoder 25.
  • the circuit further comprises third combination means (the AND-gates 22, 32, ...) for combining the outputs of the second combination means 23, 33, ..
  • the processor 10 has four sets of state saving elements i.e. 20, 30, 30', 30".
  • the control unit 60 is capable of coupling a third of the sets e.g. 30' to the processing pipeline 11, 12, 13 upon detecting that a thread using a first of the sets e.g. 20 finishes execution before the state transfer between a second e.g. 30 of the sets is complete.
  • the state saving unit comprises 4 state saving elements each forming part of a set of state saving elements.
  • the computer system comprises a further processor 80.
  • the processor 10 serves as a preprocessor for converting instruction code of a first type to instruction code of a second type suitable to be processed by the further processor 80.
  • the processing pipeline coupled to a first set 20 of state saving elements, e.g. set 20 processes a thread
  • the state transfer unit 60 performs a state transfer to the memory e.g. to the second set 30 of state saving elements.
  • step SI an internal counter 61 of the state transfer unit 60 is initialized.
  • This count CNT registered in this counter 61 is indicative for a particular state saving element in a set of state saving elements.
  • the counter 61 may for example be initialized at 1 corresponding to the first state saving element 21 ⁇ of a set 20 of state saving elements.
  • step S2 it is verified whether the bus 90 to the memory 70 is available. As long as this is not the case step S2 is repeated. As soon as the bus 90 becomes available the address in the memory 70 corresponding to the selected state saving element of the set which is to be saved is provided to the bus in step S3.
  • the addres is calculated in this embodiment by adding the value CNT to an offset value which depends on the thread for which the state is saved.
  • the offset value is saved in a register by the processor saved, the address indicating the beginning of the thread save space.
  • the signal of the counter CNT is provided to a multiplexer 62 in step S4 so that an input of the multiplexer 62 is enabled which is coupled to said first state saving element e.g. 21 ⁇ of set 20.
  • the data stored in the state saving element is written to the bus 90.
  • the transfer unit waits until a bus acknowledge indicates that the data is stored in memory 70.
  • step S6 it is checked whether for each of the state saving elements the content has been saved to the memory 70. If this is not the case the value CNT is increased and the loop is repeated from step S2 after the count CNT is incremented in step S6a.
  • step S8 the output signal CNT addresses a decoder 63 which is now enabled by a signal WR.
  • One of the output signals e.g. Ri of the decoder 63 corresponds to a state saving unit having the same index I, which is selected. If for example the value of SelB2 is "1" and the signal Ri is activated the signal Oi2 for the state saving element 31 ⁇ will be enabled at the next clock pulse Cl.
  • step S9 it is checked whether the bus is available.
  • step S10 an address is provided in step S10 in the same way as in step S4.
  • step SlOa the transfer unit waits until a bus acknowledge indicates that the memory 70 has the data available at the bus 90.
  • step SI 1 data available at the bus 90 is transferred via the multiplexer 42 in the state saving element, e.g. 31 which is enabled by the signal Qi2 from the circuit shown in Figure 2C.
  • step S12 it is verified whether the state restore operation is complete. If this is not the case the value CNT is increased and the loop is repeated from step S8 after the count CNT is incremented in step S13. If the state restore operation is complete then the WR signal is reset in step S14.
  • the state transfer unit 60 first saves the state stored in a set of state saving elements to the memory and subsequently loads a state corresponding to another thread to the set of state saving elements. Otherwise the state transfer unit 60 could load a new value in a state saving element before it starts to save the value of a next state saving element. Any order is allowed as long as a value in a state saving element is not overwritten before it is saved.
  • the memory 70 may be used exclusively by the state transfer unit 61.
  • the memory 70 is shared by other modules. In the embodiment shown it shares the bus 90 to the memory 70 with the cache 14 of the processor 10. This allows for a smaller hardware implementation. It further allows for a flexible use of the memory 70. If for example an application has relatively few threads, then a relatively larger amount of the memory can be used by the other module, e.g the processor pipeline. In such an embodiment the transferring of a state preferably takes place with a low bus priority as compared to communication between the processing pipeline and the memory.
  • the priority to access the memory 70 may be determined dynamically, on a per access basis, i.e. for each request to communicate with the memory. On the other hand the priority may have a fixed value for each bus agent.
  • the thread switch frequency is in the order of 1 kHz.
  • the state is determined by about 25 state saving elements of 32 bits. Saving a state and restoring another state requires about 250 cycles. In a machine operating on 80 MHz this equals to 6 ⁇ s.
  • the minimum time required for state saving and restoring is usually a factor 2 to 3 higher because system buses and memory usually run at a reduced clock speed. In a computer system with only one set of state saving elements this required time would noticably delay the conversion process.
  • the state transfer can take place while the processing pipeline is processing a thread, it does not hamper the latter process. It is even possible to allow the state transfer to take place at a pace which is about 50 times slower than at maximum transfer speed.
  • the controller 50 shown in Figure 1 may autonomously select a thread to be processed by the processor 10. Such an autonomous selection may be realized by a timer which periodically selects a new thread each time a certain time interval has elapsed. Otherwise the controller 50 may initiate a thread switch in response to signals II, 12 from outside, e.g. from the processing pipeline 11, 12, 13, for example a signal that the processing pipeline 11, 12, 13 is delayed by a cache miss.
  • FIG. 4 shows an example wherein four threads TI , T2, T3, T4 are processed periodically.
  • the activities of the state transfer unit 60 are schematically indicated by the lower bar referred to with STU in the figure.
  • the activities of the processing pipeline 11, 12, 13 are symbolized with the upper bar eferred to with PP.
  • the processing pipeline starts processing thread Tl using set SI (e.g. 20) of state saving elements.
  • the state transfer unit 60 loads the state corresponding to thread T2 from memory section M2 to set S2 of state saving elements.
  • the processing pipeline 11, 12, 13 continues with processing thread T2 using set S2 (e.g. 30) of state saving elements.
  • the state transfer unit 60 In a time interval from t2 to t2' the state transfer unit 60 first saves the state stored in set SI to memory section Ml assigned to thread SI. In the time interval from t2' to t3 it loads a state stored in memory section M3 assigned to thread T3 into the set SI. At t3 the processing pipeline 11, 12, 13 starts processing thread T3 using set SI of state saving elements in which a previous state of thread T3 was restored in the time interval t2'-t3. In this way every thread is processed. At t5 the processing pipeline continues to proces thread Tl again.
  • a plurality of computer system described herein may be combined to form a parallel computer system, in which threads not only are computed sequentially, but also in parallel.
  • FIG. 5A-C shows some examples of embodiments of the computer system according to the invention wherein the processor serves as a preprocessor for converting instruction code of a first type to instruction code of a second type suitable to be processed by the further processor.
  • Figure 5 A schematically again shows the architecture of Figure 1.
  • the processor 10 directly communicates its translated instructions to the further processor 80. This is advantageous in that the bus 90 is not loaded with the translated instructions, and remains available for transfer of instructions of the first type and for state transfers.
  • Figure 5B shows an alternative embodiment, wherein the processor 10 and the further processor 80 are integrated. This embodiment is described in some more detail in Figure 6A-D.
  • the processor 10 is attached as a peripheral to the bus 90 of the further processor 80.
  • This has the advantage of a simple architecture.
  • the bus 90 however is loaded with the translated instruction stream.
  • the bus 90 should preferably be a fast on-chip bus.
  • Figure 5D shows an embodiment in which the memory 70 is coupled to the bus 90 via the processor 10. This embodiment is less suitable in that for every memory read issued by the further processor 80, the processor 10 has to decide whether it has to generate the data itself, or retrieve the data from memory 70.
  • Figure 6A-D shows in more detail four examples of an embodiment in which the processor 10 is integrated with a further processor 80.
  • the further processor shown here is a RISC CPU with a pipeline 83, instruction cache 82, and bus interface 81 indicated by solid lines. Besides these modules, such CPUs also contain a data cache, a register file, a write back buffer, etc. Those components are not shown in Figure 6A-D, however, since they would unncessarily complicate the Figures.
  • the further processor 80 is coupled to the bus 90 via a bus wrapper 95, such as defined by the Virtual Socket Alliance (VSIA) in its Virtual Component Interface (VCI) proposals.
  • VSIA Virtual Socket Alliance
  • VCI introduces bus wrappers, in order to allow bus agents (such as CPUs) to abstract from the actual on-chip bus.
  • the wrappers translate (proprietary) bus protocols to a standardised point-to-point (P2P) protocol.
  • P2P protocol reduces protocol overhead if no bus is placed inbetween (when connecting the processor 10 to the bus wrapper and the bus wrapper to the further processor 80).
  • FIG. 6A-D The examples shown in Figure 6A-D are arranged in increasing level of integration between the procesor 10 and the further processor 80.
  • the performance level is expected to increase as the integration level increases.
  • Embodiments having a low integration level however, have the advantage of a low design complexity.
  • FIG. 6B shows an example wherein the processor 10 is coupled to the bus interface 81 of the further processor 80.
  • bus interface and bus wrapper will be integrated, so that this embodiment does not substantially differ from the embodiment of Figure 6 A.
  • FIG. 6C wherein the processor 10 is coupled to the bus 90 via the bus interface 81 of the further processor is advantageous in that the processor 10 does not need its own bus interface.
  • Figure 6D A particular advantageous embodiment is shown in Figure 6D.
  • the processor 10 retrieves the instruction code of the first type from the instruction cache 82 of the processor 80 and writes the translated instructions directly to the pipeline 83 of the further processor 80.
  • This embodiment is advantageous in that it allows the processor 10 utilizing the instruction cache 82 of the further processor 80.
  • the scope of protection of the invention is not restricted to the embodiments described herein. Neither is the scope of protection of the invention restricted by the reference numerals in the claims.
  • the word 'comprising' does not exclude other parts than those mentioned in a claim.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A computer system according to the invention comprises a processor (10) which is arranged for multi-thread processing. The processor (10) includes a processing pipeline (11, 12, 13) and at least a first (20) and a second set (30) of state saving elements. The processor (10) further includes selection means (41, 42, 43) for selectably coupling one of the sets (20, 30) to the processing pipeline (11, 12, 13) and a controller (50) for controlling the selection means (41, 42, 43). The computer system according to the invention further comprises a state transfer unit (60) for transferring a state between a set (30, 20) not coupled to the processing pipeline (11, 12, 13) and a memory (70).

Description

SYSTEM AND METHOD FOR INSTRUCTION LEVEL MULTITHREADING
The invention relates to a computer system as the defined in the precharacterizing portion of claim 1.
The invention further relates to a method for operating a computer system as defined in the precharacterizing portion of claim 6. Modern processors employed in computer systems use various techniques to improve their performance. One of them is multithreading. A multithreaded computer system may contain hardware support for multiple threads of execution. The threads can be independent programs, or related execution streams of a single parallel program. Multithreading enables a better performance of a computer system, if the system is configured to allow the processor to continue with another thread if a delay occurs in processing the current thread, for example because a cache miss occurs and the memory has to be accessed which generally has a relatively long latency. Otherwise thread switch means may be present which periodically select an active thread from a pool of available threads. A computer system and method referred to in the opening paragraph are known from WO 00/6878. The computer system allows a fast context switching between different states, by selectively coupling a different set of state saving elements (e.g. registers) to the processing pipeline. After a context switch another set is selected. A disadvantage of the known computer system is that it does not provide a solution for the case that more threads than sets of state saving elements are present. Hence to garantee that the system is also suitable for switching between a high number of threads a relatively high number of sets of state saving elements is necessary. This however has the disadvantage that a significant amount of state saving elements is unused in the case that few threads are available.
It is a purpose of the invention to provide a computer system which is capable of rapidly switching between threads, while requiring only a limited amount of sets of state saving elements. In order to achieve this purpose, the computer system according to the invention is characterized by the characterizing portion of claim 1. In the inventive computer system, the state transfer unit may transfer a state between a set of state saving elements and the memory while the processing pipeline accesses the other state. If a thread switch has occurred from a first thread using a first one of the sets of state saving elements to a second thread using a second one of the sets of state saving elements, the state transferring unit may transfer the state from the first one of the sets of state saving elements to the memory, while the processing pipeline continues with the second thread. In this way an unlimited amount of threads can be handled with only two sets of state saving elements.
However one or more additional sets of state saving elements could be included to save states of frequently occuring threads. An embodiment comprising at least three sets is described in claim 2.This embodiment is advantageous if a thread is interrupted very early after its start of execution. In this circumstance there was not enough time available to complete a state transfer between a set of state saving elements and the memory during execution of said thread. By completion of a state transfer is understood the saving of the 'old' state from the set to the memory and the restoring of a 'new state'from the memory to the set. By coupling a third of the sets, in which the state of a further thread is saved, the processing pipeline can immediately continue with the further thread.
In the embodiment of claim 3 the transfer of a state takes place with a low bus priority as compared to communication between the processing pipeline and the memory. In this way, even when the processing pipeline and the state transfer unit share the same memory, the active thread can be executed without being significantly delayed by intervening accesses to the bus by the state transfer unit. The embodiment of claim 4 enables a further improvement of the efficiency of the processing pipeline.
In the embodiment of the computer system according to the invention as claimed in claim 5 the processor serves as a translator for translating (converting) instruction code of a first type to instruction code of a second type suitable to be processed by a native processor. In this type of computer system it is particularly important to perform thread execution and state saving in parallel, because a complete state swapping is essential for each thread switch in such a system. The reason is that a thread switch may occur in the middle of the translation of a code of the first type. In an example the code of the first type to be translated is JAVA byte code. An overview thereof is discussed in "Implementing the JAVA Virtual Machine", by Brian Case, Microdesign Resources, March 25, 1996, pp.12-17. A typical translator requires some dozens of state saving elements for storing intermediate values, e.g. an index to the current byte code, and indexes to several tables and buffers used in the translator. In addition the translator may comprise state saving elements, i.e. in the form of registers or stack locations for saving parameters which are operated on by the instruction codes.
Such a translator essentially differs from a computing processor in that it converts instructions of a first type without carrying them out itself. As the number of generated instructions of the second type usually is greater than the number of instructions read from the memory, this leaves the state transfer unit ample time to access the bus to the memory, even when it is operating at a low priority.
A processor, acting as a translator for a further processor is described in WO 99/18484. In particular this document describes a processor which is capable of refeeding a sequence of instructions after the further processor has been interrupted. The subject-matter of this PCT-applications is considered to be included by reference herein. Instructions of the first type maybe too complex to translate them in dedicated hardware. Examples thereof are the JAVA bytecodes invokevirtual (search object classes for method and call), invokestatic (also method call), getfleld (search object classes for field data and load an object's field value onto the stack), new (search object class and create new object accordingly). Such bytecodes may be passed to the further processor for processing by a dedicated subroutine,
These and other aspects of the invention are described in more detail in the drawing. Therein
Figure 1 shows a first embodiment of a computer system according to the invention,
Figure 1A shows an organization of state saving elements in state saving units and in sets of state saving elements, Figure 2 A shows in more detail a portion of the computer system of Figure 1,
Figure 2B shows in more detail a portion of Figure 2A, Figure 2C shows in more detail another portion of Figure 2 A, Figure 3 shows a flow chart of a state transfer,
Figure 4 shows the synchronization between the processing pipeline and the state transfer unit,
Figure 5 A-D shows four different embodiments of a computer system according to the invention in which the processor serves as a preprocessor for converting instruction codes, Figure 6A-D shows four examples of implementation of the embodiment shown in Figure 5B.
Figure 1 shows a first embodiment of a computer system according to the invention. The computer system shown comprises a processor 10 which is arranged for multi- thread processing. The processor comprises a processing pipeline 11, 12, 13 and at least a first and a second set 20, 30 of state saving elements. Byway of example the processing pipeline comprises a first stage 11 for fetching instructions, a second stagel2 for translating the instructions, and a third stagel3 for providing the results of the second stagel2 to an output, such as a bus. In the embodiment shown the processor comprises an instruction cache 14 via which it is coupled to the memory 70 via a communication means 90 such as a bus or a point to point connection. Point-to-point connections enable a fast data transfer. On the other hand using a bus as commuiύcation means simplifies the wiring scheme. In the embodiment shown the processing pipeline comprises several stages, between which sets 20, 30, 30', 30" of state saving elements are arranged. The stages of the pipeline 11, 12, 13 pass information to each other via a selected set e.g. 20 of the state saving elements. Additional state saving elements could be part of the sets. Although three stages 11, 12, 13 are shown any number of stages is possible. In Figure 1 a first set 20 of state saving elements is indicated with a first F-shaped area bounded by a solid line, and a second set 30 as well as a third and a fourth set 30', 30" of state saving elements is indicated with an F-shaped area with a dashed boundary. Each of the sets of stage saving elements 20, 30, 30', 30" may have a plurality of state saving elements between each pair of mutually coupled stages. By way of example the set of stage saving elements 20 comprises the state saving elements 21ι, 2 In and 21m between the pipeline stages 11 and 12. As schematically shown in Figure 1 A corresponding state saving elements in the different sets of state saving elements form state saving units. E.g. state saving unit I comprises state saving elements 21ι, 31ι, 31 'i and 31"ι from sets 20, 30, 30' and 30" respectively.
If the processing pipeline 11, 12, 13 has to interrupt processing a first thread for example because of a cache miss, it can rapidly start processing another thread of which the state is saved in an other e.g. set 30 of state saving elements. To that end the computer system shown comprises selection means for selectably coupling one of the sets 20, 30, 30', 30" to the processing pipeline 11, 12, 13. This is shown in more detail in Figure 2A. For clarity, a part of the processing pipeline including the relevant part of the sets of state saving elements and the selection means is shown therein. In Figure 2 A stages 11 and 12 of the processing pipeline are shown as well as state saving units I, II and III. Each of the state saving units is capable of containing state information. Each of the state saving units I comprises a plurality of state saving elements, as is shown in more detail in Figure 2B and 2C. One of these state saving elements is used during the processing of a current thread. The one or more other state saving elements are used to store information about currently inactive threads. The other state saving units π and III are preferably equivalent to state saving unit I in order to facilitate designing and manufacturing the device. The state saving units I, II and III have a first input, II for receiving state information from a first stage 11 of the pipeline and a first output 13 to enable a second pipeline stage 12 to read out the information. The state saving units further have a second input 12 to enable restoring of information from the data memory 70 into the state saving unit II and a second output 14 to enable saving of state information from the state saving unit I to the data memory 70. Furthermore the state saving units have inputs for receiving the m-valued signals SelBl and selB2. The state saving units further comprise an input for receiving a unit selection signal Rl, R2, R3 and an input for receiving a clock signal. The unit selection signals Rl, ..., Rn identifies the state saving unit which should receive the data which is loaded from memory 70 during state restoring. The computer system according to the invention further comprises a state transferring unit 60 for transferring a state between a set not coupled to the processing pipeline and a memory. The state transferring unit 60 is controlled by the controller 50.
Figure 2B shows a state saving unit I in more detail. The state saving unit I comprises a first input II for receiving state information from a processing stage 11. A demultiplexer 41 redirects this state information to one of a set of state saving elements 21ι, 31ι, ... in reponse to a bank select signal SelBl. A multiplexer 43 selects one of the output signals of the state saving elements 2 li, 311, ... as the output signal at output 13 in response to the same bank select signal selBl. This output signal can be read by the next processing stage 12 in the pipeline. A second input 12 is coupled to the bus 90 for receiving information which is to be restored from the memory 70. A demultiplexer 42 redirects this state information to another one of a set of state saving elements 21ι, 31ι, ... in reponse to a second bank select signal SelB2. A multiplexer 44 selects another one of the output signals of the state saving elements 21ι, 31ι, ... as the output signal at output 14 in response to the same bank select signal selB2. This output 14 is coupled to the memory 70 to enable saving of state information.
A state saving element, e.g. 21ι can read information either via the first demultiplexer 41 or via the second demultiplexer 42 when it is activated by an activation signal Clij, e.g. state saving element 21 is activated with signal Oil.
Although Fig 2A and Fig. 2B shows selection means in the form of multiplexers, other ways well known to the skilled person are possible to perform the function of the multiplexers. E.g. the same function could be implemented by a 5-1 lookuptable, having the 4 outputs of the state saving elements 21τ, 31l3 .. coupled to four of its inputs, while the fifth input receives the signal SelBl. Likewise the demultiplexers 41, 42, or any other logic function could be implemented in a look-up table.
Figure 2C shows by way of example a circuit for generating the activation signals Oil, Oi2 etc. The circuit comprises a first decoder 25 for decoding the first bank select signal selBl and a second decoder 26 for decoding the second bank select signal SelB2. The circuit comprises first combination means (here A D-gates 24, 34, ..) for combining the proper unit selection signal Ri with the output signals of the second decoder 26. The circuit further comprises second combination means (the OR-gates 23, 33, ...) for combining the output signals of the first combination means 24, 34 with the outputs of the first decoder 25. The circuit further comprises third combination means (the AND-gates 22, 32, ...) for combining the outputs of the second combination means 23, 33, .. with a clock signal Cl and generating the activation signals Oil, Oi2 etc. The activation signals could be generated by different means, e.g. by using a lookup table. Preferably the activation signals for each of the state saving elements is generated in the same way, so as to facilitate the design. In the embodiment shown in Figure 1, the processor 10 has four sets of state saving elements i.e. 20, 30, 30', 30". The control unit 60 is capable of coupling a third of the sets e.g. 30' to the processing pipeline 11, 12, 13 upon detecting that a thread using a first of the sets e.g. 20 finishes execution before the state transfer between a second e.g. 30 of the sets is complete. In the embodiment of Figure 2B the state saving unit comprises 4 state saving elements each forming part of a set of state saving elements. However, any number is in principal possible. A higher number of sets may be favorable in which thread switches may occur with a relatively high frequency due to internal processor events, such as cache misses. In the embodiment shown in Figure 1 the computer system comprises a further processor 80. The processor 10 serves as a preprocessor for converting instruction code of a first type to instruction code of a second type suitable to be processed by the further processor 80. While the processing pipeline coupled to a first set 20 of state saving elements, e.g. set 20 processes a thread, the state transfer unit 60 performs a state transfer to the memory e.g. to the second set 30 of state saving elements.
This proces is shown schematically in Figure 3.
In a first step SI an internal counter 61 of the state transfer unit 60 is initialized. This count CNT registered in this counter 61 is indicative for a particular state saving element in a set of state saving elements. The counter 61 may for example be initialized at 1 corresponding to the first state saving element 21ι of a set 20 of state saving elements. In step S2 it is verified whether the bus 90 to the memory 70 is available. As long as this is not the case step S2 is repeated. As soon as the bus 90 becomes available the address in the memory 70 corresponding to the selected state saving element of the set which is to be saved is provided to the bus in step S3. The addres is calculated in this embodiment by adding the value CNT to an offset value which depends on the thread for which the state is saved. In an embodiment the offset value is saved in a register by the processor saved, the address indicating the beginning of the thread save space. The signal of the counter CNT is provided to a multiplexer 62 in step S4 so that an input of the multiplexer 62 is enabled which is coupled to said first state saving element e.g. 21ι of set 20. Then the data stored in the state saving element is written to the bus 90. In step S5 the transfer unit waits until a bus acknowledge indicates that the data is stored in memory 70. In step S6 it is checked whether for each of the state saving elements the content has been saved to the memory 70. If this is not the case the value CNT is increased and the loop is repeated from step S2 after the count CNT is incremented in step S6a.
If the total content of the set of state saving elements is saved the state transfer unit 60 starts to load the state corresponding to another thread in the set. To that end the counter is initialized again in step S7 and the signal WR is set. In step S8 the output signal CNT addresses a decoder 63 which is now enabled by a signal WR. One of the output signals e.g. Ri of the decoder 63 corresponds to a state saving unit having the same index I, which is selected. If for example the value of SelB2 is "1" and the signal Ri is activated the signal Oi2 for the state saving element 31ι will be enabled at the next clock pulse Cl. In step S9 it is checked whether the bus is available. If this is the case then an address is provided in step S10 in the same way as in step S4. In step SlOa the transfer unit waits until a bus acknowledge indicates that the memory 70 has the data available at the bus 90. In step SI 1 data available at the bus 90 is transferred via the multiplexer 42 in the state saving element, e.g. 31 which is enabled by the signal Qi2 from the circuit shown in Figure 2C. In step S12 it is verified whether the state restore operation is complete. If this is not the case the value CNT is increased and the loop is repeated from step S8 after the count CNT is incremented in step S13. If the state restore operation is complete then the WR signal is reset in step S14.
In the embodiment described here the state transfer unit 60 first saves the state stored in a set of state saving elements to the memory and subsequently loads a state corresponding to another thread to the set of state saving elements. Otherwise the state transfer unit 60 could load a new value in a state saving element before it starts to save the value of a next state saving element. Any order is allowed as long as a value in a state saving element is not overwritten before it is saved. The memory 70 may be used exclusively by the state transfer unit 61.
However preferably the memory 70 is shared by other modules. In the embodiment shown it shares the bus 90 to the memory 70 with the cache 14 of the processor 10. This allows for a smaller hardware implementation. It further allows for a flexible use of the memory 70. If for example an application has relatively few threads, then a relatively larger amount of the memory can be used by the other module, e.g the processor pipeline. In such an embodiment the transferring of a state preferably takes place with a low bus priority as compared to communication between the processing pipeline and the memory.
The priority to access the memory 70 may be determined dynamically, on a per access basis, i.e. for each request to communicate with the memory. On the other hand the priority may have a fixed value for each bus agent.
In a practical embodiment the thread switch frequency is in the order of 1 kHz. The state is determined by about 25 state saving elements of 32 bits. Saving a state and restoring another state requires about 250 cycles. In a machine operating on 80 MHz this equals to 6 μs. However, the minimum time required for state saving and restoring is usually a factor 2 to 3 higher because system buses and memory usually run at a reduced clock speed. In a computer system with only one set of state saving elements this required time would noticably delay the conversion process. However, in the computer system according to the invention the state transfer can take place while the processing pipeline is processing a thread, it does not hamper the latter process. It is even possible to allow the state transfer to take place at a pace which is about 50 times slower than at maximum transfer speed. This makes it possible to execute the state transfer while using a common memory with a low priority as compared to the processing pipeline. The state transfer therefore can take place to/from said common memory in cycles not used by the processing pipeline for processing the active thread. In this way the operation of the processing pipeline is not hampered by the state transfer unit.
The controller 50 shown in Figure 1 may autonomously select a thread to be processed by the processor 10. Such an autonomous selection may be realized by a timer which periodically selects a new thread each time a certain time interval has elapsed. Otherwise the controller 50 may initiate a thread switch in response to signals II, 12 from outside, e.g. from the processing pipeline 11, 12, 13, for example a signal that the processing pipeline 11, 12, 13 is delayed by a cache miss.
By way of example Figure 4 shows an example wherein four threads TI , T2, T3, T4 are processed periodically. The activities of the state transfer unit 60 are schematically indicated by the lower bar referred to with STU in the figure. The activities of the processing pipeline 11, 12, 13 are symbolized with the upper bar eferred to with PP. At point tl in time the processing pipeline starts processing thread Tl using set SI (e.g. 20) of state saving elements. Meanwhile the state transfer unit 60 loads the state corresponding to thread T2 from memory section M2 to set S2 of state saving elements. At t2 the processing pipeline 11, 12, 13 continues with processing thread T2 using set S2 (e.g. 30) of state saving elements. In a time interval from t2 to t2' the state transfer unit 60 first saves the state stored in set SI to memory section Ml assigned to thread SI. In the time interval from t2' to t3 it loads a state stored in memory section M3 assigned to thread T3 into the set SI. At t3 the processing pipeline 11, 12, 13 starts processing thread T3 using set SI of state saving elements in which a previous state of thread T3 was restored in the time interval t2'-t3. In this way every thread is processed. At t5 the processing pipeline continues to proces thread Tl again.
It is noted that the skilled person may consider many variations to the computer system described herein. A plurality of computer system described herein may be combined to form a parallel computer system, in which threads not only are computed sequentially, but also in parallel.
Fig. 5A-C shows some examples of embodiments of the computer system according to the invention wherein the processor serves as a preprocessor for converting instruction code of a first type to instruction code of a second type suitable to be processed by the further processor. Figure 5 A schematically again shows the architecture of Figure 1. In this case the processor 10 directly communicates its translated instructions to the further processor 80. This is advantageous in that the bus 90 is not loaded with the translated instructions, and remains available for transfer of instructions of the first type and for state transfers. Figure 5B shows an alternative embodiment, wherein the processor 10 and the further processor 80 are integrated. This embodiment is described in some more detail in Figure 6A-D.
In the embodiment of Fig 5C the processor 10 is attached as a peripheral to the bus 90 of the further processor 80. This has the advantage of a simple architecture. The bus 90 however is loaded with the translated instruction stream. When using such an architecture, the bus 90 should preferably be a fast on-chip bus.
Figure 5D shows an embodiment in which the memory 70 is coupled to the bus 90 via the processor 10. This embodiment is less suitable in that for every memory read issued by the further processor 80, the processor 10 has to decide whether it has to generate the data itself, or retrieve the data from memory 70. Figure 6A-D shows in more detail four examples of an embodiment in which the processor 10 is integrated with a further processor 80. The further processor shown here is a RISC CPU with a pipeline 83, instruction cache 82, and bus interface 81 indicated by solid lines. Besides these modules, such CPUs also contain a data cache, a register file, a write back buffer, etc. Those components are not shown in Figure 6A-D, however, since they would unncessarily complicate the Figures.
Preferably the further processor 80 is coupled to the bus 90 via a bus wrapper 95, such as defined by the Virtual Socket Alliance (VSIA) in its Virtual Component Interface (VCI) proposals. VCI introduces bus wrappers, in order to allow bus agents (such as CPUs) to abstract from the actual on-chip bus. The wrappers translate (proprietary) bus protocols to a standardised point-to-point (P2P) protocol. The P2P protocol reduces protocol overhead if no bus is placed inbetween (when connecting the processor 10 to the bus wrapper and the bus wrapper to the further processor 80).
The examples shown in Figure 6A-D are arranged in increasing level of integration between the procesor 10 and the further processor 80. The performance level is expected to increase as the integration level increases. Embodiments having a low integration level however, have the advantage of a low design complexity.
The example of Figure 6 A in which the processor 10 is directly coupled to the bus wrapper 95 enables a performance increased by 15 to 20% as compared to the embodiment shown in Figure 5 A, as a result of the reduced bus overhead, when transferring translated instructions to the further processor 80.
Figure 6B shows an example wherein the processor 10 is coupled to the bus interface 81 of the further processor 80. In practice however bus interface and bus wrapper will be integrated, so that this embodiment does not substantially differ from the embodiment of Figure 6 A.
The embodiment of Figure 6C, wherein the processor 10 is coupled to the bus 90 via the bus interface 81 of the further processor is advantageous in that the processor 10 does not need its own bus interface. A particular advantageous embodiment is shown in Figure 6D. Therein the processor 10 retrieves the instruction code of the first type from the instruction cache 82 of the processor 80 and writes the translated instructions directly to the pipeline 83 of the further processor 80. This embodiment is advantageous in that it allows the processor 10 utilizing the instruction cache 82 of the further processor 80. It is remarked that the scope of protection of the invention is not restricted to the embodiments described herein. Neither is the scope of protection of the invention restricted by the reference numerals in the claims. The word 'comprising' does not exclude other parts than those mentioned in a claim. The word 'a(n)' preceding an element does not exclude a plurality of those elements. Means forming part of the invention may both be implemented in the form of dedicated hardware or in the form of a programmed general purpose processor. The invention resides in each new feature or combination of features.

Claims

CLAIMS:
1. Computer system comprising a processor (10) which is arranged for multi-thread processing, the processor comprising a processing pipeline (11, 12, 13) and at least a first (20) and a second set (30) of state saving elements, - selection means (41, 42, 43) for selectably coupling one of the sets (20, 30) to the processing pipeline (11, 12, 13), a controller (50) for controlling the selection means (41, 42, 43), characterized in that the computer system further comprises a state transfer unit (60) for transferring a state between a set (30, 20) not coupled to the processing pipeline (11, 12, 13) and a memory (70) .
2. Computer system according to claim 1, characterized by at least three sets of state saving elements, wherein the control unit (60) is capable of coupling a third of the sets (30') to the processing pipeline (11, 12, 13) upon detecting that a thread using a first of the sets (20) finishes execution before the state transfer between a second (30) of the sets is complete.
3. Computer system according to claim 1, characterized in that the transferring of a state takes place with a low bus priority as compared to communication between the processing pipeline (11, 12, 13) and the memory (70).
4. Computer system according to claim 3, wherein the processing pipeline (11, 12, 13) is capable of interrupting the state transfer unit (60).
5. Computer system according to one of the previous claims including a further processor (80), and wherein the processor (10) serves as a preprocessor for converting instruction code of a first type to instruction code of a second type suitable to be processed by the further processor (80).
6. Method for operating a computer system comprising a processor (10) which is arranged for multi-thread processing, the processor (10) comprising a processing pipeline (11, 12, 13) and at least a first (20) and a second set (20) of state saving elements, according to which method one of the sets (20, 30) is selectably coupled to the processing pipeline (11, 12, 13), characterized in that a state is transferred between a set (30, 20) not coupled to the processing pipeline (11, 12, 13) and a memory (70).
7. Method for operating a computer system according to the previous claims, the computer system comprizing at least three sets of state saving elements (20, 30, 30'), wherein the control unit (50) upon detecting that a thread using a first of the sets (20) finishes execution before the state transfer between a second (30) of the sets is complete is capable of coupling a third (30') of the sets to the processing pipeline (11, 12, 13).
PCT/IB2003/001234 2002-04-12 2003-03-27 System and method for instruction level multithreading WO2003088036A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003215845A AU2003215845A1 (en) 2002-04-12 2003-03-27 System and method for instruction level multithreading

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP02290935.2 2002-04-12
EP02290935 2002-04-12

Publications (1)

Publication Number Publication Date
WO2003088036A1 true WO2003088036A1 (en) 2003-10-23

Family

ID=29225734

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2003/001234 WO2003088036A1 (en) 2002-04-12 2003-03-27 System and method for instruction level multithreading

Country Status (2)

Country Link
AU (1) AU2003215845A1 (en)
WO (1) WO2003088036A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2864660A1 (en) * 2003-12-30 2005-07-01 St Microelectronics Sa Logic or arithmetic data processing device for computer, has controller controlling transfer of processing threads between RAM memory and one of two stations when multithread processor processes threads on another station
EP1703375A2 (en) * 2005-03-18 2006-09-20 Marvell World Trade Ltd Real-time control apparatus having a multi-thread processor
EP1703377A2 (en) * 2005-03-18 2006-09-20 Marvell World Trade Ltd Multi-threaded processor
US7747989B1 (en) 2002-08-12 2010-06-29 Mips Technologies, Inc. Virtual machine coprocessor facilitating dynamic compilation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778243A (en) * 1996-07-03 1998-07-07 International Business Machines Corporation Multi-threaded cell for a memory
US6006293A (en) * 1998-04-21 1999-12-21 Comsat Corporation Method and apparatus for zero overhead sharing for registered digital hardware
WO2000045258A1 (en) * 1999-01-27 2000-08-03 Xstream Logic, Inc. Register transfer unit for electronic processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778243A (en) * 1996-07-03 1998-07-07 International Business Machines Corporation Multi-threaded cell for a memory
US6006293A (en) * 1998-04-21 1999-12-21 Comsat Corporation Method and apparatus for zero overhead sharing for registered digital hardware
WO2000045258A1 (en) * 1999-01-27 2000-08-03 Xstream Logic, Inc. Register transfer unit for electronic processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HASKINS J W ET AL: "INEXPENSIVE THROUGHPUT ENHANCEMENT IN SMALL-SCALE EMBEDDED MICROPROCESSORS WITH BLOCK MULTITHREADING: EXTENSIONS, CHARACTERIZATION AND TRADEOFFS", CONFERENCE PROCEEDINGS OF THE 2001 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE. (IPCCC). PHOENIX, AZ, APRIL 4 - 6, 2001, IEEE INTERNATIONAL PERFORMANCE, COMPUTING AND COMMUNICATIONS CONFERENCE, NEW YORK, NY: IEEE, US, vol. CONF. 20, 4 April 2001 (2001-04-04), pages 319 - 328, XP001049966, ISBN: 0-7803-7001-5 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747989B1 (en) 2002-08-12 2010-06-29 Mips Technologies, Inc. Virtual machine coprocessor facilitating dynamic compilation
US9207958B1 (en) 2002-08-12 2015-12-08 Arm Finance Overseas Limited Virtual machine coprocessor for accelerating software execution
US10055237B2 (en) 2002-08-12 2018-08-21 Arm Finance Overseas Limited Virtual machine coprocessor for accelerating software execution
US11422837B2 (en) 2002-08-12 2022-08-23 Arm Finance Overseas Limited Virtual machine coprocessor for accelerating software execution
FR2864660A1 (en) * 2003-12-30 2005-07-01 St Microelectronics Sa Logic or arithmetic data processing device for computer, has controller controlling transfer of processing threads between RAM memory and one of two stations when multithread processor processes threads on another station
US7424638B2 (en) 2003-12-30 2008-09-09 Stmicroelectronics S.A. Multipath processor with dedicated buses
EP1703375A2 (en) * 2005-03-18 2006-09-20 Marvell World Trade Ltd Real-time control apparatus having a multi-thread processor
EP1703377A2 (en) * 2005-03-18 2006-09-20 Marvell World Trade Ltd Multi-threaded processor
EP1703377A3 (en) * 2005-03-18 2007-11-28 Marvell World Trade Ltd Multi-threaded processor
EP1703375A3 (en) * 2005-03-18 2011-05-04 Marvell World Trade Ltd. Real-time control apparatus having a multi-thread processor
US8195922B2 (en) 2005-03-18 2012-06-05 Marvell World Trade, Ltd. System for dynamically allocating processing time to multiple threads
US8468324B2 (en) 2005-03-18 2013-06-18 Marvell World Trade Ltd. Dual thread processor

Also Published As

Publication number Publication date
AU2003215845A1 (en) 2003-10-27

Similar Documents

Publication Publication Date Title
US4648034A (en) Busy signal interface between master and slave processors in a computer system
US7590774B2 (en) Method and system for efficient context swapping
EP1236122B1 (en) Cache memory system for a digital signal processor
US6963962B2 (en) Memory system for supporting multiple parallel accesses at very high frequencies
US20080270707A1 (en) Data processor
JPH0612527B2 (en) Interrupt processing device
JPH0484253A (en) Bus width control circuit
JP4226085B2 (en) Microprocessor and multiprocessor system
JP2007133456A (en) Semiconductor device
US11392407B2 (en) Semiconductor device
US6915414B2 (en) Context switching pipelined microprocessor
US20030177288A1 (en) Multiprocessor system
US6101589A (en) High performance shared cache
JP2001525568A (en) Instruction decoder
WO2003088036A1 (en) System and method for instruction level multithreading
US8402260B2 (en) Data processing apparatus having address conversion circuit
US6732235B1 (en) Cache memory system and method for a digital signal processor
JP7468112B2 (en) INTERFACE CIRCUIT AND METHOD FOR CONTROLLING INTERFACE CIRCUIT - Patent application
US5677859A (en) Central processing unit and an arithmetic operation processing unit
US5327565A (en) Data processing apparatus
JPH06103223A (en) Data processor
JPS6352240A (en) Data processor
JP2682186B2 (en) Microprocessor
JP2696578B2 (en) Data processing device
JP2000347931A (en) Cache memory and method for controlling cache memory

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP