GB2351627A

GB2351627A - Image processing apparatus

Info

Publication number: GB2351627A
Application number: GB9907103A
Authority: GB
Inventors: Simon Michael Rowe; Michael James Taylor
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1999-03-26
Filing date: 1999-03-26
Publication date: 2001-01-03
Anticipated expiration: 2019-03-26
Also published as: GB9907103D0; GB2351627B

Abstract

Image data showing the movements of a number of people, for example in a meeting, and sound data defining the words spoken by the people is processed by a computer processing apparatus 24 to archive the data in a meeting archive database 60. The image data is processed to determine the three-dimensional position and orientation of each person's head and to determine at whom each person is looking. Processing is carried out to determine who is speaking by determining at which person most people are looking. Having determined which person is speaking, the personal speech recognition parameters for that person are selected and used to convert the sound data to text data. The image data, sound data, text data and data defining at whom each person is looking is stored in the meeting archive database 60.

Description

2351627 IMAGE PROCESSING APPARATUS The present invention relates to the

processing of image data and sound data to generate data to assist in 5 archiving the image and sound data.

Many databases exist for the storage of data. However, the existing databases suffer from the problem that the ways in which the database can be interrogated to retrieve information therefrom are limited.

The present invention has been made with this problem in mind.

According to the present invention, there is provided an apparatus or method in which image data is processed to determine which person in the images is speaking by determining which person has the attention of the other people in the image, and sound data is processed to generate text data corresponding to the words spoken by the person using processing parameters selected in dependence upon the speaking participant identified by processing the image data.

The present invention also provides an apparatus or method in which image data is processed to determine at 2 whom each person in the images is looking and to determine which of the people is speaking based thereon, and sound data is processed to perform speech recognition for the speaking participant.

In this way, the speaking participant can be readily identified to enable the sound data to be processed.

The present invention further provides an apparatus or method for processing image data in such a system.

The present invention further provides instructions, including in signal and recorded form, for configuring a programmable processing apparatus to become arranged as an apparatus, or to become operable to perform a method, in such a system.

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 illustrates the recording of sound and video data from a meeting between a plurality of participants; Figure 2 is a block diagram showing an example of notional functional components within a processing 3 apparatus in an embodiment; Figure 3 shows the processing operations performed by processing apparatus 24 in Figure 2 prior to the meeting shown in Figure I between the participants starting; Figure 4 schematically illustrates the data stored in meeting archive database 60 at step S4 in Figure 3; Figure 5 shows the processing operations performed at step S34 in Figure 3; Figure 6 shows the processing operations performed by processing apparatus 24 in Figure 2 while the meeting 15 between the participants is taking place; Figure 7 shows the processing operations performed at step S72 in Figure 6; Figure 8 shows the processing operations performed at step S80 in Figure 7; Figure 9 illustrates the viewing ray for a participant used in the prQcessing performed at step S114 in Figure 8; 4 Figure 10 illustrates the angles calculated in the processing performed at step S114 in Figure 8; Figure 11 shows the processing operations performed at 5 step S84 in Figure 7; Figure 12 schematically illustrates the storage of information in the meeting archive database 60; Figures 13A and 13B show examples of viewing histograms defined by data stored in the meeting archive database 6 0; Figure 14 shows the processing operations performed by processing apparatus 24 to retrieve information from the meeting archive database 60; Figure 15A shows the information displayed to a user at step S200 in Figure 14; Figure 15B shows an example of information displayed to a user at step S204 in Figure 14; and Figure 16 schematically illustrates an embodiment in which a single database stores information from a plurality of meetings and is interrogated from one or more remote apparatus.

Referring to Figure 1, a video camera 2 and one or more microphones 4 are used to record image data and sound data respectively from a meeting taking place between a group of people 6, 8, 10, 12.

The image data from the video camera 2 and the sound data from the microphones 4 is input via cables (not shown) to a computer 20 which processes the received data and stores data in a database to create an archive record of the meeting from which information can be subsequently retrieved.

Computer 20 comprises a conventional personal computer having a processing apparatus 24 containing, in a conventional manner, one or more processors, memory, sound card etc., together with a display device 26 and user input devices, which, in this embodiment, comprise a keyboard 28 and a mouse 30.

The components of computer 20 and the input and output of data therefrom are schematically shown in Figure 2.

Referring to Figure 2, the processing apparatus 24 is programmed to operate in accordance with programming 6 instructions input, f or example, as data stored on a data storage medium, such as disk 32, and/or as a signal 34 input to the processing apparatus 24, for example from a remote database, by transmission over a communication network (not shown) such as the Internet or by transmission through the atmosphere, and/or entered by a user via a user input device such as keyboard 28 or other input device.

When programmed by the programming instructions, processing apparatus 24 effectively becomes configured into a number of functional units for performing processing operations. Examples of such functional units and their interconnections are shown in Figure 2. The illustrated units and interconnections in Figure 2 are, however, notional and are shown for illustration purposes only, to assist understanding; they do not necessarily represent the exact units and connections into which the processor, memory etc of the processing apparatus 24 become configured.

Referring to the functional units shown in Figure 2, a central controller 36 processes inputs from the user input devices 28,, 30 and receives data input to the processing apparatus 24 by a user as data stored on a storage device, such as disk 38, or as a signal 40 7 transmitted to the processing apparatus 24. The central controller 36 also provides controller and processing for a number of the other functional units. Memory 42 is provided for use by central controller 36 and other 5 functional units.

Head tracker 50 processes the image data received from video camera 2 to track the position and orientation in three dimensions of the head of each of the participants 6, 8, 10, 12 in the meeting. In this embodiment, to perform this tracking, head tracker 50 uses data defining a threedimensional computer model of the head of each of the participants and data defining features thereof which is stored in head model store 52, as will be described below.

Voice recognition processor 54 processes sound data received from microphones 4. Voice recognition processor 40 operates in accordance with a conventional voice recognition program, such as "Dragon Dictate" or IBM "ViaVoice", to generate text data corresponding to the words spoken by the participants 6, 8, 10, 12. To perform the voice recognition processing, voice recognition processor 54 uses data defining the speech recognition parameters for each participant 6, 8, 10, 12, which is stored in speech recognition parameter store 56.

8 More particularly, the data stored in speech recognition parameter store 56 comprises data defining the voice profile of each participant which is generated by training the voice recognition processor in a conventional manner. For example, the data comprises the data stored in the "user files" of Dragon Dictate after training.

Archive processor 58 generates data for storage in meeting archive database 60 using data received from head tracker 50 and voice recognition processor 54. More particularly, as will be described below, the video data from camera 2 and sound data from microphones 4 is stored in meeting archive database 60 together with text data from voice recognition processor 54 and data defining at whom each participant in the meeting was looking at a given time.

Text searcher 62, in conjunction with central controller 36, is used to search the meeting archive database 60 to find and replay the sound and video data for one or more parts of the meeting which meet search criteria specified by a user, as will be described in further detail below.

Display processor 64 under control of central controller 36 displays information to a user via display device 26 9 and also replays sound and video data stored in meeting archive database 60.

Output processor 66 outputs part or all of the data from 5 archive database 60, for example on a storage device such as disk 68 or as a signal 70.

Before beginning the meeting, it is necessary to initialise computer 20 by entering data which is necessary to enable processing apparatus 24 to perform the required processing operations.

Figure 3 shows the processing operations performed by processing apparatus 24 during this initialisation.

Referring to Figure 3, at step S2, central controller 36 causes display processor 64 to display a message on display device 26 requesting the user to input the names of each person who will participate in the meeting.

At step S4, upon receipt of data defining the names, for example input by the user using keyboard 28, central controller 36 allocates a unique participant number to each participant, and stores data, for example table 80 shown in Figure 4, defining the relationship between the participant numbers and the participants, names in the meeting archive database 60.

At step S6, central controller 36 searches the head model store 52 to determine whether data defining a head model is already stored for each participant in the meeting.

If it is determined at step S6 that a head model is not already stored for one or more of the participants, then, at step S8, central controller 36 causes display processor 64 to display a message on display device 26 requesting the user to input data defining a head model of each participant for whom a model is not already stored.

In response, the user enters data, for example on a storage medium such as disk 38 or by downloading the data as a signal 40 from a connected processing apparatus, defining the required head models. Such head models may be generated in a conventional manner, for example as described in "An Analysis /Synthesis Cooperation for Head Tracking and Video Face Cloning" by Valente et al in Proceedings ECCV 198 Workshop on Perception of Human Action, University of Freiberg, Germany, June 6 1998.

At step S10, central controller 36 stores the data input by the user in head model store 52.

11 At step S12, central controller 36 and display processor 64 render each three-dimensional computer head model input by the user to display the model to the user on display device 26, together with a message requesting the 5 user to identify at least seven features in each model.

In response, the user designates using mouse 30 points in each model which correspond to prominent features on the front, sides and, if possible, the back, of the participant's head, such as the corners of eyes, nostrils, mouth, ears or features on glasses worn by the participant, etc.

At step S14, data defining the features identified by the user is stored by central controller 36 in head model store 52.

On the other hand. if it is determined at step S6 that a head model is already stored in head model store 52 for each participant, then steps S8 to S14 are omitted.

At step S16, central controller 36 searches speech recognition parameter store 56 to determine whether speech recognition parameters are already stored for each participant.

12 If it is determined at step S16 that speech recognition parameters are not available for all of the participants, then, at step S18, central controller 36 causes display processor 64 to display a message on display device 26 requesting the user to input the speech recognition parameters for each participant for whom the parameters are not already stored.

In response, the user enters data, for example on a storage medium such as disk 38 or as a signal 40 from a remote processing apparatus, defining the necessary speech recognition parameters. As noted above, these parameters define a profile of the user's speech and are generated by training a voice recognition processor in a conventional manner. Thus for example, in the case of a voice recognition processor comprising Dragon Dictate, the speech recognition parameters input by the user correspond to the parameters stored in the "user files" of Dragon Dictate.

At step S20, the data input by the user is stored by central controller 36 in the speech recognition parameter store 56.

On the other hand, if it is determined at step S16 that the speech recognition parameters are already available for each of the participants, then steps S18 and S20 are omitted.

At step S22, central controller 36 causes display 5 processor 64 to display a message on display device 26 requesting the user to perform steps to enable the camera 2 to be calibrated.

In response, the user carries out the necessary steps and, at step S24, central controller 36 performs processing to calibrate the camera 2. More particularly, in this embodiment, the steps performed by the user and the processing performed by central controller 36 are carried out in a manner such as that described in Appendix A herewith. This generates calibration data defining the position and orientation of the camera 2 with respect to the meeting room and also the intrinsic camera parameters (aspect ratio, focal length, principal point, and first order radial distortion coefficient).

The calibration data is stored in memory 42.

At step S26, central controller 36 causes display processor 64 to display a message on display device 26 requesting the next participant in the meeting (this being the first participant the first time step S26 is performed) to sit down.

14 At step S28, processing apparatus 24 waits for a predetermined period of time to give the requested participant time to sit down, and then, at step S30, central controller 36 processes image data from camera 2 to determine an estimate of the position of the seated participant's head. More particularly, in this embodiment, central controller 36 carries out processing in a conventional manner to identify each portion in a frame of image data from camera 22 which has a colour corresponding to the colour of the skin of the participant (this colour being determined from the data defining the head model of the participant stored in head model store 52), and then selects the portion which corresponds to the highest position in the meeting room (since it is assumed that the head will be the highest skin-coloured part of the body). Using the position of the identified portion in the image and the camera calibration parameters determined at step S24, central controller 36 then determines an estimate of the threedimensional position of the head in a conventional manner.

At step S32, central controller 36 determines an estimate of the orientation of the participant's head in three dimensions. More particularly, in this embodiment, central controller 36 renders the threedimensional computer model of the participant's head stored in head model store 52 for a plurality of different orientations of the model to produce a respective two-dimensional image of the model for each orientation, compares each 5 two-dimensional image of the model with the part of the videoframe from camera 2 which shows the participant's head, and selects the orientation for which the image of the model best matches the video image data. In this embodiment, the computer model of the participant's head is rendered in 108 different orientations to produce image data for comparing with the video data from camera 2. These orientations correspond to 36 rotations of the head model in 10' steps for each of three head inclinations corresponding to 0' (looking straight ahead), +45 (looking up) and -45' (looking down). When comparing the image data produced by rendering the head model with the video data from camera 2, a conventional technique is used, for example as described in "Head Tracking Using a Textured Polygonal Model" by Schbdl, Haro & Essa in Proceedings 1998 Workshop on Perceptual User Interfaces.

At step S34, the estimate of the position of the participant's head generated at step S30 and the estimate of the orientation of the participant's head generated at step S32 are input to head tracker 50 and frames of 16 image data received f rom camera 2 are processed to track the head of the participant. More particularly, in this embodiment, head tracker 50 performs processing to track the head in a conventional manner, for example as described in "An Analysis/Synthesis Cooperation for Head Tracking and Video Face Cloning" by Valente et al in Proceedings EECV 198 Workshop on Perception of Human Action, University of Freiberg, Germany, June 6 1998.

Figure 5 summarises the processing operations performed by head tracker 50 at step S34.

Referring to Figure 5, at step S50, head tracker 50 reads the current estimates of the 3D position and orientation of the participant's head, these being the estimates produced at steps S30 and S32 in Figure 3 the first time step S50 is performed.

At step S52, head tracker 50 uses the camera calibration data generated at step S24 to render the three dimensional computer model of the participant's head stored in head model store 52 in accordance with the estimates of position and orientation read at step S50.

At step S54, head tracker 50 processes the image data for the current frame of video data received from camera 2 17 to extract the image data from each area which surrounds the expected position of one of the head features identified by the user and stored at step S14, the expected positions being determined from the estimates read at step S50 and the camera calibration data generated at step S24.

At step S56, head tracker 50 matches the rendered image data generated at step S52 and the camera image data extracted at step S54 to find the camera image data which best matches the rendered head model.

At step S58, head tracker 50 uses the camera image data identified at step S56 which best matches the rendered head model to determine the 3D position and orientation of the participant's head for the current frame of video data.

At the same time that step S58 is performed, at step S60, the positions of the head features in the camera image data determined at step S56 are input into a conventional Kalman filter to generate an estimate of the 3D position and orientation of the participant's head for the next frame of video data. Steps S50 to S60 are performed repeatedly for the participant as frames of video data are received from video camera 2.

18 Referring again to Figure 3, at step S36, central controller 36 determines whether there is another participant in the meeting, and steps S26 to S36 are repeated until processing has been performed for each participant in the manner described above. However, while these steps are performed for each participant, at step S34, head tracker 50 continues to track the head of each participant who has already sat down.

When it is determined at step S36 that there are no further participants in the meeting and that accordingly the head of each participant is being tracked by head tracker 50, then, at step S38, central controller 36 causes an audible signal to be output from processing apparatus 24 to indicate that the meeting between the participants can begin.

Figure 6 shows the processing operations performed by processing apparatus 24 as the meeting between the participants takes place.

Referring to Figure 6, at step S70, head tracker 50 continues to track the head of each participant in the meeting. The processing performed by head tracker 50 at step S70 is the same as that described above with respect to step S34, and accordingly will not be described again 19 here.

At the same time that head tracker 50 is tracking the head of each participant at step S70, at step S72 processing is performed to generate and store data in meeting archive database 60.

Figure 7 shows the processing operations performed at step S72.

Referring to Figure 7, at step S80, archive processor 58 generates a socalled "viewing parameter" for each participant defining at whom the participant is looking.

Figure 8 shows the processing operations performed at step S80.

Referring to Figure 8, at step S110, archive processor 58 reads the current three-dimensional position of each participant's head from head tracker 50, this being the position generated in the processing performed by head tracker 50 at step S58 (Figure 5).

At step S112., archive processor 58 reads the current 25 orientation of the head of the next participant (this being the first participant the first time step S112 is performed) from head tracker 50. The orientation read at step S112 is the orientation generated in the processing performed by head tracker 50 at step S58 (Figure 5).

At step S114, archive processor 58 determines the angle between a ray defining where the participant is looking (a so-called "viewing ray") and each notional line which connects the head of the participant with the centre of the head of another participant.

More particularly, referring to Figures 9 and 10, an example of the processing performed at step S114 is illustrated for one of the participants, namely participant 10 in Figure 1. Referring to Figure 9, the orientation of the participant's head read at step S112 defines a viewing ray 90 from a point between the centre of the participant's eyes which is perpendicular to the participant's head. Similarly, referring to Figure 10, the positions of all of the participant's heads read at step S110 define notional lines 92, 94, 96 from the point between the centre of the eyes of participant 10 to the centre of the heads of each of the other participants 6, 8, 12. At step S114, archive processor 58 determines the angles 98, 100, 102 between the viewing ray 90 and each of the notional lines 92, 94, 96.

21 Referring again to Figure 8, at step S116, archive processor 58 selects the angle 98, 100 or 102 which has the smallest value. Thus, referring to the example shown in Figure 10, the angle 100 would be selected.

At step S118, archive processor 58 determines whether the selected angle has a value less than 101.

If it is determined at step S118 that the angle is less than 10', then, at step S120, archive processor 58 sets the viewing parameter for the participant to the number (allocated at step S4 in Figure 3) of the participant connected by the notional line which makes the smallest angle with the viewing ray. Thus, referring to the example shown in Figure 10, if angle 100 is less than 10', then the viewing parameter would be set to the participant number of participant 6 since angle 100 is the angle between viewing ray 90 and notional line 94 which connects participant 10 to participant 6. 20 On the other hand, if it is determined at step S118 that the smallest angle is not less than 10', then, at step S122, archive processor 58 sets the value of the viewing parameter for the participant to 11011. This indicates that the participant is determined to be looking at none of the other participants since the viewing ray 90 is not 22 close enough to any of the notional lines 92, 94, 96. Such a situation could arise, for example, if the participant was looking at notes or some other object in the meeting room.

At step S124, archive processor 58 determines whether there is another participant in the meeting, and steps S112 to S124 are repeated until the processing described above has been carried out for each of the participants. 10 Referring again to Figure 7, at step S82, central controller 36 and voice recognition processor 54 determine whether any speech data has been received from the microphones 4 for the current frame of video data.

If it is determined at step S82 that speech data has been received, then, at step S84, archive processor 58 processes the viewing parameters generated at step S80 to determine which of the participants in the meeting is speaking.

Figure 11 shows the processing operations performed at step S84 by archive processor 58.

Referring to Figure 11, at step S140, the number of occurrences of each viewing parameter value generated at 23 step S80 is determined, and at step S142, the viewing parameter value with the highest number of occurrences is selected. More particularly, the processing performed at step S80 in Figure 7 will generate one viewing parameter value for the current frame of video data for each participant in the meeting (thus, in the example shown in Figure 1, four values would be generated). Each viewing parameter will have a value which corresponds to the participant number of one of the other participants or "0". Accordingly, at step S140 and S142, archive processor 58 determines which of the viewing parameter values generated at step S80 occurs the highest number of times for the current frame of video data.

At step S144, it is determined whether the viewing parameter with the highest number of occurrences has a value of "0" and, if it has, at step S146, the viewing parameter value with the next highest number of occurrences is selected. On the other hand, if it is determined at step S144 that the selected value is not "011, then step S146 is omitted.

At step S148, the participant defined by the selected viewing parameter value (that is, the value selected at step S142 or, if this value is 11011 the value selected at step S146) is identified as the participant who is 24 speaking, since the majority of participants in the meeting will be looking at the speaking participant.

Referring again to Figure 7, at step S86, archive processor 58 stores the viewing parameter value for the speaking participant, that is the viewing parameter value generated at step S80 defining at whom the speaking participant is looking, for subsequent analysis, for example in memory 42.

At step S88, archive processor 58 informs voice recognition processor 54 of the identity of the speaking participant determined at step S84. In response, voice recognition processor 54 selects the speech recognition parameters for the speaking participant from speech recognition parameter store 56 and uses the selected parameters to perform speech recognition processing on the received speech data to generate text data corresponding to the words spoken by the speaking participant.

on the other hand, if it is determined at step S82 that the received sound data does not contain any speech, then steps S84 to S88 are omitted.

At step S90, archive processor 58 encodes the current frame of video data received from camera 2 and the sound data received from microphones 4 as MPEG 2 data in a conventional manner, and stores the encoded data in meeting archive database 60.

Figure 12 schernatically illustrates the storage of data in meeting archive database 60. The storage structure shown in Figure 12 is notional and is provided for illustration purposes only, to assist understanding; it does not necessarily represent the exact way in which data is stored in meeting archive database 60.

Referring to Figure 12, meeting archive database 60 stores time information represented by the horizontal axis 200, on which each unit represents a predetermined amount of time, for example one frame of video data received from camera 2. The MPEG 2 data generated at step S90 is stored as data 202 in meeting archive database 60, together with timing information (this timing information being schematically represented in Figure 12 by the position of the MPEG 2 data 202 along the horizontal axis 200).

Referring again to Figure 7, at step S92, archive 25 processor 58 stores any text data generated by voice recognition processor 54 at step S88 for the current 26 frame in meeting archive database 60 (indicated at 204 in Figure 12). More particularly, the text data is stored with a link to thecorresponding MPEG 2 data, this link being represented in Figure 12 by the text data being stored in the same vertical column as the MPEG 2 data. As will be appreciated, there will not be any text data for storage from participants who are not speaking. In the example shown in Figure 12, text is stored for the first ten time slots for participant 1 (indicated at 206), for the twelfth to twentieth time slots for participant 3 (indicated at 208), and for the twentyfirst time slot for participant 4 (indicated at 210). No text is stored for participant 2 since, in this example, participant 2 did not speak during the time slots shown in Figure 12.

At step S94, archive processor 58 stores the viewing parameter value generated for each participant at step S80 in the meeting archive database 60 (indicated at 212 in Figure 12). Referring to Figure 12, a viewing parameter value is stored for each participant together with a link to the associated MPEG 2 data 202 and the associated text data 204 (this link indicated in Figure 12 by the viewing parameters values being stored in the same column as the associated MPEG 2 data 202 and associated text data 204). Thus, referring to the first 27 time slot by way of example, the viewing parameter value for participant 1 is 1,311, indicating that participant 1 is looking at participant 3, the viewing parameter value for participant 2 is 1,111, indicating that participant 2 is looking at participant 1, the viewing parameter value for participant 3 is also etilt, indicating that participant 3 is also looking at participant 1, and the viewing parameter value for participant 4 is "0", indicating that participant 4 is not looking at any of the other participants (in the example shown in Figure 1, the participant indicated at 12 is looking at her notes rather than any of the other participants).

At step S96, central controller 36 and archive processor 58 determine whether one of the participants in the meeting has stopped speaking. In this embodiment, this check is performed by examining the text data 204 to determine whether text data for a given participant was present for the previous time slot, but is not present for the current time slot. If this condition is satisfied for a participant (that is, a participant has stopped speaking), then, at step S98, archive processor 58 processes the viewing parameter values for the participant who has stopped speaking previously stored when step S86 was performed (these viewing parameter values defining at whom the participant was looking 28 during the period of speech which has now stopped) to generate data defining a viewing histogram. More particularly, the viewing parameter values for the period in which the participant was speaking are processed to generate data defining the percentage of time during that period that the speaking participant was looking at each of the other participants.

Figures 13A and 13B show the viewing histograms corresponding to the periods of text 206 and 208 respectively in Figure 12.

Referring to Figure 12 and Figure 13A, during the period 206 when participant 1 was speaking, he was looking at participant 3 for six of the ten time slots (that is, 60% of the total length of the period for which he was talking), which is indicated at 300 in Figure 13A, and at participant 4 for four of the ten time slots (that is, 40% of the time), which is indicated at 310 in Figure 13A.

Similarly, referring to Figure 12 and Figure 13B, during the period 208, participant 3 was looking at participant 1 for approximately 45% of the time, which is indicated at 320 in Figure 13B, at participant 4 for approximately 33% of the time, indicated at 330 in Figure 13B, and at 29 participant 2 for approximately 22% of the time, which is indicated at 340 in Figure 13B.

Referring again to Figure 7, at step S100, the viewing histogram generated at step S98 is stored in the meeting archive database 60 linked to the associated period of text for which it was generated. Referring to Figure 12, the stored viewing histograms are indicated at 214, with the data defining the histogram for the text period 206 indicated at 216, and the data defining the histogram for the text period 208 indicated at 218. In Figure 12, the link between the viewing histogram and the associated text is represented by the viewing histogram being stored in the same columns as the text data.

On the other hand, if it is determined at step S96 that, for the current time period, one of the participants has not stopped speaking, then steps S98 and S100 are omitted.

At step S102, central controller 36 determines whether another frame of video data has been received from camera 2. Steps S80 to S102 are repeatedly performed while image data is received from camera 2.

When data is stored in meeting archive database 60, then the meeting archive database 60 may be interrogated to retrieve data relating to the meeting.

Figure 14 shows the processing operations performed to search the meeting archive database 60 to retrieve data relating to each part of the meeting which satisfies search criteria specified by a user.

Referring to Figure 14, at step S200, central controller 36 causes display processor 64 to display a message on display device 26 requesting the user to enter information defining the search of meeting archive database 60 which is required. More particularly, in this embodiment, central controller 100 causes the display shown in Figure 15A to appear on display device 26.

Referring to Figure 15A, the user is requested to enter information defining the part or parts of the meeting which he wishe.s to find in the meeting archive database 60. More particularly, in this embodiment, the user is requested to enter information 400 defining a participant who was talking, information 410 comprising one or more key words which were said by the participant identified 25 in information 400, and information 420 defining the participant to whom the participant identified in 31 information 400 was talking. In addition, the user is able to enter time information defining a portion or portions of the meeting for which the search is to be carried out. More particularly, the user can enter information 430 defining a time in the meeting beyond which the search should be discontinued (that is, the period of the meeting before the specified time should be searched), information 440 defining a time in the meeting after which the search should be carried out, and information 450 and 460 defining a start time and end time respectively between which the search is to be carried out. In this embodiment, information 430, 440, 450 and 460 may be entered either by specifying a time in absolute terms, for example in minutes, or in relative terms by entering a decimal value which indicates a proportion of the total meeting time. For example, entering the value 0.25 as information 430 would restrict the search to the first quarter of the meeting.

In this embodiment, the user is not required to enter all of the information 400, 410 and 420 for one search, and instead may omit one or two pieces of this information. If the user enters all of the information 400, 410 and 420, then the search will be carried out to identify each part of the meeting in which the participant identified in information 400 was talking to the participant 32 identified in information 420 and spoke the key words defined in information 410. On the other hand, if information 410 is omitted, then a search will be carried out to identify each part of the meeting in which the participant defined in information 400 was talking to the participant defined in information 420 irrespective of what was said. If information 410 and 420 is omitted, then a search is carried out to identify each part of the meeting in which the participant defined in information 400 was talking, irrespective of what was said and to whom. If information 400 is omitted, then a search is carried out to identify each part of the meeting in which any of the participants spoke the key words defined in information 4 10 to the participant defined in information 420. If information 400 and 410 is omitted, then a search is carried out to identify each part of the meeting in which any of the participants spoke to the participant defined in information 420. If information 420 is omitted, then a search is carried out to identify each part of the meeting in which the participant defined in information 400 spoke the key words defined in information 410, irrespective of to whom the key word was spoken. Similarly, if information 400 and 420 is omitted, then a search is carried out to identify each part of the meeting in which the key words identified in information 410 were spoken, irrespective of who said the 33 key words and to whom.

In addition, the user may enter all of the time information 430, 440, 450 and 460 or may omit one or more pieces of this information.

once the user has entered all of the required information to define the search, he begins the search by clicking on area 470 using a user input device such as the mouse 30.

Referring again to Figure 14, at step S202, the search information entered by the user is read by central controller 36 and the instructed search is carried out.

More particularly, in this embodiment, central controller 36 converts any participant names entered in information 400 or 420 to participant numbers using the table 80 (Figure 4), and considers the text information 204 for the participant defined in information 400 (or all participants if information 400 is not entered). if information 420 has been entered by the user, then, for each period of text, central controller 36 checks the data defining the corresponding viewing histogram to determine whether the percentage of viewing time in the histogram for the participant defined in information 420 is equal to or above a threshold which, in this 34 embodiment, is 25%. In this way, periods of speech (text) are considered to satisfy the criteria that a participant defined in information 400 was talking to the participant defined in information 420 even if the speaking participant looked at other participants while speaking, provided that the speaking participant looked at the participant defined in information 420 for at least 25% of the time of the speech. Thus, a period of speech in which the value of the viewing histogram is equal to or above 25% for two or more participants would be identified if any of these participants were specified in information 420. If the information 410 has been input by the user, then central controller 36 and text searcher 62 search each portion of text previously identified on the basis of information 400 and 420 (or all portions of text if information 400 and 420 was not entered) to identify each portion containing the key word(s) identified in information 410. If any time information has been entered by the user, then the searches described above are restricted to the meeting times defined by those limits.

At step S204, central controller 36 causes display processor 64 to display a list of relevant speeches identified during the search to the user on display device 26. More particularly, central controller 36 causes information such as that shown in Figure 15B to be displayed to the user. Referring to Figure 15B, a list is produced of each speech which satisfies the search parameters, and information is displayed defining the start time for the speech both in absolute terms and as a proportion of the full meeting time. The user is then able to select one of the speeches for playback by clicking on the required speech in the list using the mouse 30.

At step S206, central controller 36 reads the selection made by the user at step S204, and plays back the stored MPEG 2 data 202 for the relevant part of the meeting from meeting archive database 60. More particularly, central controller 36 and display processor 64 decode the MPEG 2 data 202 and output the image data and sound via display device 26.

At step S208, central controller 36 determines whether the user wishes to cease interrogating the meeting archive database 60 and, if not, steps S200 to S208 are repeated.

Various modifications and changes can be made to the embodiment of the invention described above.

36 For example, in the embodiment above the microphones 4 are provided on the meeting room table. However, instead, a microphone on video camera 2 may be used to record sound data.

In the embodiment above, image data is processed from a single video camera 2. However, to improve the accuracy with which the head of each participant is tracked, video data from a plurality of video cameras may be processed.

For example, image data from a plurality of cameras may be processed as in steps S50 to S56 of Figure 5 and the resulting data from all of the cameras input to a Kalman filter at step S60 in a conventional manner to generate a more accurate estimate of the position and orientation is of each participant's head in the next frame of video data from each camera. If multiple cameras are used, then the MPEG 2 data 202 stored in meeting archive database 60 may comprise the video data from all of the cameras and, at steps S204 and S206 in Figure 14 image data from a camera selected by the user may be replayed.

In the embodiment above, the viewing parameter for a given participant defines at which other participant the participant is looking. However, the viewing parameter may also be used to define at which object the participant is looking, for example a display board, 37 projector screen etc. Thus, when interrogating the meeting archive database 60, information 420 in Figure 15A could be used to specify at whom or at what the participant was looking when he was talking.

In the embodiment above, at step S202 (Figure 14), the viewing histogram for a particular portion of text is considered and it is determined that the participant was talking to a further participant if the percentage of gaze time for the further participant in the viewing histogram is equal to or above a predetermined threshold.

Instead, however, rather than using a threshold, the participant to whom the speaking participant was looking during the period of text may be def ined to be the participant having the highest percentage gaze value in the viewing histogram (for example participant 3 in Figure 13A, and participant 1 in Figure 13B).

In the embodiment above, the MPEG 2 data 202, the text data 204, the viewing parameters 212 and the viewing histograms 214 are stored in meeting archive database 60 in real time as data is received from camera 2 and microphones 4. However, instead, the video and sound data may be stored and data 202, 204, 212 and 214 generated and stored in meeting archive database 60 in non-real-time.

38 In the embodiment above, the MPEG 2 data 202, the text data 204, the viewing parameters 212 and the viewing histograms 214 are generated and stored in the meeting archive database 60 before the database is interrogated to retrieve data for a defined part of the meeting.

However, some, or all, of the data 204, 212 and 214 may be generated in response to a search of the meeting archive database 60 being requested by the user by processing the stored MPEG 2 data 202, rather than being generated and stored prior to such a request. For example, although in the embodiment above the viewing histograms 214 are calculated and stored in real-time at steps S98 and S100 (Figure 7), these histograms could be calculated in response to a search request being input by the user.

In the embodiment above, text data 204 is stored in meeting archive database 60. Instead, audio data may be stored in the meeting archive database 60 instead of the text data 204. The stored audio data would then either itself be searched for key words using voice recognition processing or converted to text using voice recognition processing and the text search using a conventional text searcher.

In the embodiment above, processing apparatus 24 includes 39 functional components for receiving and generating data to be archived (for example, central controller 36, head tracker 50, head model store 52, voice recognition processor 54, speech recognition parameter store 56 and archive processor 58), functional components for storing the archive data (for example meeting archive database 60), and also functional components for searching the database and retrieving information therefrom (for example central controller 36 and text searcher 62).

However, these functional components may be provided in separate apparatus. For example, one or more apparatus for generating data to be archived, and one or more apparatus for database searching may be connected to one or more databases via a network, such as the Internet.

Also, referring to Figure 16, video and sound data from one or more meetings 500, 510, 520 may be input to a data processing and database storage apparatus 530 (which comprises functional components to generate and store the archive data), and one or more database interrogation apparatus 540, 550 may be connected to the data processing and database storage apparatus 530 for interrogating the database to retrieve information therefrom.

In the embodiment above, processing is performed by a computer using processing routines defined by programming instructions. However, some, or all, of the processing could be performed using hardware.

Although the embodiment above is described with respect to a meeting taking place between a number of participants, the invention is not limited to this application, and, instead, can be used for other applications, such as to process image and sound data on 10 a film set etc.

Different combinations of the above modifications are, of course, possible and other changes and modifications can be made without departing from the spirit and scope of the invention.

The contents of the applicant's co-pending UK applications 9905191.4, 9905197.1, 9905202.9, 9905158.3, 9905201.1, 9905186.4, 9905160.9, 9905199.7 and 9905187.2 are hereby incorporated by reference.

Claims

I Apparatus for processing image data and sound data, comprising:

image processing means for processing image data recorded by at least one camera showing the movements of a plurality of people to determine where each person is looking and to determine which of the people is speaking based on where the people are looking; and sound processing means for processing sound data defining words spoken by the people to generate text data therefrom in dependence upon the result of the processing performed by the image processing means.
2. Apparatus according to claim 1, wherein the sound processing means includes storage means for storing respective voice recognition parameters for each of the people, and means for selecting the voice recognition parameters to be used to process the sound data in dependence upon the person determined to be speaking by the image processing means.
3. Apparatus according to claim 1 or claim 2, wherein the image processing means is arranged to determine where each person is looking by processing the image data using camera calibration data defining the position and 4-1 orientation of each camera f rom which image data is processed.
4. Apparatus according to any preceding claim, wherein the image processing means is arranged to determine where each person is looking by processing the image data to track the position and orientation of each person's head in three dimensions.
5. Apparatus according to any preceding claim, wherein the image processing means is arranged to determine which person is speaking based on the number of people looking at each person.

is
6. Apparatus according to claim 5, wherein the image processing means is arranged to generate a value for each person defining at whom the person is looking and to process the values to determine the person who is speaking.
7. Apparatus according to any preceding claim, wherein the image processing means is arranged to determine that the person who is speaking is the person at whom the most other people are looking.
8. Apparatus according to any preceding claim, further 3 comprising a database for storing the image data, the sound data, the text data produced by the sound processing means and viewing data defining where each person is looking, the database being arranged to store the data such that corresponding text data and viewing data are associated with each other and with the corresponding image data and sound data.
9. Apparatus according to claim 8, further comprising means for compressing the image data and the sound data for storage in the database.
10. Apparatus according to claim 9, wherein the means for compressing the image data and the sound data is comprises means for encoding the image data and the sound data as MPEG data.
11. Apparatus according to any of claims 8 to 10, further comprising means for generating data defining, for a predetermined period, the proportion of time spent by a given person looking at each of the other people during the predetermined period, and wherein the database is arranged to store the data so that it is associated with the corresponding image data, sound data, text data and viewing data.

+4-
12. Apparatus according to claim 11, wherein the predetermined period comprises a period during which the given person was talking.
13. Apparatus for processing image data, comprising image processing means for processing image data recorded by at least one camera showing the movements of a plurality of people to determine where each person is looking and to determine which of the people is speaking based on where the people are looking.
14. Apparatus according to claim 13, wherein the image processing means is arranged to determine where each person is looking by processing the image data using is camera calibration data defining the position and orientation of each camera from which image data is processed.
15. Apparatus according to claim 13 or claim 14, wherein the image processing means is arranged to determine where each person is looking by processing the image data to track the position and orientation of each person's head in three dimensions.
16. Apparatus according to any of claims 13 to 15, wherein the image processing means is arranged to 45' determine which person is speaking based on the number of people looking at each person.
17. Apparatus according to claim 16, wherein the image processing means is arranged to generate a value for each person defining at whom the person is looking and to process the values to determine the person who is speaking.
18. Apparatus according to any of claims 13 to 17, wherein the image processing means is arranged to determine that the person who is speaking is the person at whom the most other people are looking.
19. A method of processing image data and sound data, comprising:

an image processing step of processing image data recorded by at least one camera showing the movements of a plurality of people to determine where each person is looking and to determine which of the people is speaking based on where the people are looking; and a sound processing step of processing sound data defining words spoken by the people to generate text data therefrom in dependence upon the result of the processing performed in the image processing step.
20. A method according to claim 19, wherein the sound processing step includes selecting, from stored respective voice recognition parameters for each of the people, the voice recognition parameters to be used to process the sound data in dependence upon the person determined to be speaking in the image processing step.
21. A method according to claim 19 or claim 20, wherein, in the image processing step, it is determined where each person is looking by processing the image data using camera calibration data defining the position and orientation of each camera from which image data is processed.
22. A method according to any of claims 19 to 21, wherein, in the image processing step, it is determined where each person is looking by processing the image data to track the position and orientation of each person's head in three dimensions. 20
23. A method according to any of claims 19 to 22, wherein, in the image processing step, it is determined which person is speaking based on the number of people looking at each person.
24. A method according to claim 23, wherein, in the image processing step, a value is generated for each person defining at whom the person is looking and the values are processed to determine the person who is speaking.
25. A method according to any of claims 19 to 24, wherein, in the image processing step, it is determined that the person who is speaking is the person at whom the most other people are looking.
26. A method according to any preceding claim, further comprising the step of storing the image data, the sound data, the text data produced by in sound processing step and viewing data defining where each person is looking in a database, the database being arranged to store the data such that corresponding text data and viewing data are associated with each other and with the corresponding image data and sound data.
27. A method according to claim 26, wherein the image data and the sound data are stored in the database in compressed form.
28. A method according to claim 27, wherein the image data and the sound data are stored as MPEG data.

+R
29. A method according to any of claims 26 to 28, further comprising the steps of generating data defining, for a predetermined period, the proportion of time spent by a given person looking at each of the other people during the predetermined period, and storing the data in the database so that it is associated with the corresponding image data, sound data, text data and viewing data.
30. A method according to claim 29, wherein the predetermined period comprises a period during which the given person was talking.
31. A method according to any of claims 26 to 30, is further comprising the step of generating a signal conveying the database with data therein.
32. A method according to claim 31, further comprising the step of recording the signal either directly or indirectly to generate a recording thereof.
33. A method of processing image data, comprising processing image data recorded by at least one camera showing the movements of a plurality of people to determine where each person is looking and to determine which of the people is speaking based on where the people are looking.
34. A method according to claim 33, wherein it is determined where each person is looking by processing the image data using camera calibration data defining the position and orientation of each camera from which image data is processed.
35. A method according to claim 33 or claim 34, wherein it is determined where each person is looking by processing the image data to track the position and orientation of each person's head in three dimensions.
36. A method according to any of claims 33 to 35, wherein it is determined which person is speaking based on the number of people looking at each person.
37. A method according to claim 36, wherein a value is generated for each person defining at whom the person is looking and the values are processed to determine the person who is speaking.
38. A method according to any of claims 33 to 37, wherein it is determined that the person who is speaking is the person at whom the most other people are looking.

5-0
39. A storage device storing instructions for causing a programmable processing apparatus to become conf igured as an apparatus as set out in any of claims 1 to 18.
40. A storage device storing instructions for causing a programmable processing apparatus to become operable to perform a method as set out in any of claims 19 to 38.
41. A signal conveying instructions for causing a programmable processing apparatus to become configured as an apparatus as set out in any of claims 1 to 18.
42. A signal conveying instructions for causing a programmable processing apparatus to become operable to perform a method as set out in any of claims 19 to 38.