Overview
This WP aims to build audiovisual synthetic objects using camera
and “phone” views of natural objects to be used in virtual fully
immersive environment. The critical challenges will be naturalness of composed
immersive audiovisual scenes including audiovisual synchronisation accuracy.
Natural visual 3D object modelling: automatic
modelling using 2D and 3D, mono, stereo and multiple camera views will support
manual modelling within graphics packages. Static and dynamic 3D data from
multiple cameras will be used to get reference models in the given class of
objects (e.g. faces) while 2D views of the individual object will be processed
to produce the individual 3D model ready for animation. The concepts of feature
detection, 3D correspondence field, principal subspace and discriminating
models for 3D shape and texture will serve as the starting point of our
developments in parallel to image based rendering techniques (IBR). Next, the
neural networks will be explored for creating the 3D models by the use of
neuro-informative zones which can define polygon vertices, e.g. lips, mouth
extremities, noses main vertices. Finally, the method based on camera
defocusing will be investigated in 3D object modelling. The feasibility of the
approaches will be verified for human faces using head views with general pose,
illumination, and face decorations (moustache, glasses, beard).
Manufactured objects (e.g. vehicles, roads, amusement parks,
buildings) will be modelled in computer graphics environments (optionally
augmented by 3D scanner, laser range finder, and video camera) using 3D natural
textures and confronted with automatic modelling by 2D views. Moreover, the
integration of video and laser complementary sensors in the development of
techniques for 3D reconstruction will be further explored in order to get both,
the 3D model accuracy and high 3D spatial resolution. By this means we will
create high quality, photo-realistic, dimensionally accurate 3D models for a
wide range of distances. Finally, we will investigate feasibility of radial
basis functions in modelling of architectural objects using images of real
monuments.
Natural audio object modelling: Automatic modelling based on
sample audio records will embrace works on: individual human voice model;
musical instruments sound model; birds’ voices models. The accuracy of
audio models will be verified by psychoacoustics experiments and additionally
for human voice modelling on text to speech application. In particular musical
sound synthesis based on physical modeling will be used to create virtual
instruments. Interfaces to control engine will be designed and implemented.
Audiovisual synchronisation: Dynamic morphing models will be
compared and most efficient for immersive environments will be chosen in order
to synchronise audio object voice or sound with object visual appearance. Tools
for audiovisual synchronisation will be elaborated and verified in several
generic applications: reading head application integrated with text to speech
functionality;
Audiovisual scene rendering and interaction
This WP aims to implement hybrid audiovisual scene rendering
using hybrid (natural/synthetic) camera and “microphone” 3D models.
Moreover, the animation and interaction control engine for the fully virtual
interactive immersive environment will be added. The critical challenges will
be merging of natural audiovisual objects represented by audiovisual data
streams with synthetic audiovisual objects to avoid spatial and temporal
collisions while creating the realistic behaviour of objects in 3D audiovisual
scene.
Rendering of hybrid audiovisual scene: The
hybrid (natural/synthetic) camera and microphone models will be designed and
implemented. We will also investigate the problem of image based rendering for
complex scenes when individual objects were previously viewed from incompatible
(to each other) viewpoints. 3D sound rendering module will be elaborated. In
particular the efficient spatialization of sound sources in a virtual
environment using information on the geometry of the environment itself will be
of concern at design of virtual microphones. Finally the single and multi view
audiovisual rendering engine will be built. Web3D format will be used for model
representation in storage and transmission while for rendering initially Web3D
viewers will be used and next replaced by our own renderer based on OpenGL (to
get the final interfacing with the animation and interaction control engine of
the immersive audiovisual environment).
Animation and interactivity in hybrid audiovisual scene: The
animation and interaction control engine to be used in immersive environment
will be designed and implemented. The synthetic object control (by text, voice
and/or sensorial devices) will be incorporated together with interfacing to
natural object tracker (WP4) to avoid collisions and enhance naturalness of the
immersive environment.
AV content coding
In WP2.1, the coding of AV material will include the development
of object-based scalable and error resilient methods aiming at optimizing the perceptual
quality through the efficient utilization of available resources. Joint source-channel
coding will also be used with fine granular scalability (FGS). Adaptive error control will
be investigated with a view to improve the performance of the H.264/MPEG-4 AVC standard under
error-prone conditions. WP2.1 will also focus on the scalable and error-resilient
compression of 3D AV content. Research will also explore the use of metadata information
in the encoding of still pictures and video sequences in order to improve the coding
efficiency of standard compression algorithms.
AV content transcoding
WP2.2 addresses content adaptation for inter-network
communications. In this WP, compressed-domain downscaling techniques for
object-based video data will be developed aiming at scalability of shape,
motion and texture data. New transcoding algorithms will be developed with view
to convert object-based AV data from DVB quality (e.g. MPEG-2) to 3G quality
(e.g. MPEG-4, H.264). The research activities will also focus on developing new
strategies for 2D/3D transcoding. 3D to 2D content conversion will be used to
enable users with limited display and mobile capabilities to access a 2D
version of the high-detail 3D video scene. Also, techniques for converting 2D
to 3D using single and multiple views will be developed. WP2.2 will also
investigate new transcoding methods supported by content and environment
descriptions. These metadata-based transcoding strategies will be instrumental
in achieving content adaptation based on the network and user environment
characteristics.
Transmission over heterogeneous networks
This workpackage addresses the problem of QoS optimisation for inter-network
audiovisual communications. The work will be carried out along two main lines,
one based on the development of channel simulators and media adaptation
gateways and the other one based on the deployment of a testbed, with real
systems and applications and comprising heterogeneous networks. Development and
evaluation of QoS optimisation tools is common to both approaches and therefore
constitute a basis for convergence and integration.
QoS optimisation: The end-to-end media transfer is
accomplished using adaptation gateways at the edges of the networks. The
gateways carry out the adaptation process using the required QoS levels
obtained from the network, and hence provide a number of AV streams at
different error resilience and bit/frame rate levels. The network produces
periodic reports for the media gateways using available transport protocols,
such as RTCP. The media gateways take necessary actions to:
-
Adapt the media according to network conditions
-
Perform parameter mapping from one network to another according to reported
networks conditions and users profiles and requirements.
Networking protocols: In the future, transport of AV
information will likely occur in heterogeneous environments, which include
different network technologies and protocols. It is therefore necessary to test
and implement interoperable solutions with provision of QoS. This will be
carried out over a real test-bed, according to a stepped approach:
-
Specification of the architecture of a heterogeneous network and associated
protocols capable of supporting the transmission of A/V content with QoS;
-
Deployment of the heterogeneous test-bed;
-
Evaluation of different interoperability solutions (measure their effectiveness
and conduct experiments of A/V services over the test-bed).
Efficient storage and search schemes
This workpackage studies and develops efficient storage and search schemes and
harmonises among different standardised formats to promote interoperability.
Efficient storage and search schemes: The main
focus of the work will be on:
-
The development of a query-by-humming system for audio data.
-
The optimisation of indexing data structures for visual descriptors. Index
representation and its parameters will be matched to particular visual
descriptions such as dominant colour, texture, face recognition index, etc.
-
the introduction of robust shot cut detection algorithms using both static and
dynamic features. The problem will be treated as a classification problem and
audio information will be used in addition to video data, when appropriate.
Research will focus on techniques that automatically recognize camera effects
like zoom ins/outs, fades and dissolves or camera motion.
-
the generation of hierarchical audio-visual summaries , using representative
keyframes, salient stills and associated audio/textual keywords in order to
present content to the end-user in an intelligent way.
Interoperability among different formats and data models:
The work towards this objective will focus on:
-
The use of middleware layers and distributed technologies to allow the access
and retrieval of multimedia content and associated descriptions regardless of
their location and to promote the adaptation among different formats;
-
Harmonisation of formats for the storage and exchange of multimedia content and
associated descriptions, in particular between MPEG-21, MPEG-7 and MXF.
-
Interoperability between different metadata formats regarding: descriptions for
AV content protection and rights management and interoperability with other
metadata formats different from MPEG-7 and MPEG-21.
Audio/speech analysis
Development of audio analysis tools. The work proposed will consolidate the
integration of VISNET by an effective exchange of knowledge between the
cooperating institutions, aiming the efficient analysis of audio and speech,
and creating a potential basis for further integration and joint dissemination
activities. To enhance the cooperation, exchanges of researchers (e.g. PhD
students) are foreseen. Part of the work will be used as input for the
multimodal analysis workpackage.
Audio analysis: Development of automatic audio
analysis tools. The main work will focus on:
-
Audio segmentation: Audio recordings are classified and
segmented into voice, music, various kinds of environmental sounds, and
silence. Morphological and statistical analysis of temporal curves of the low
short-time energy ratio, high zero-crossing ratio, high centroid, and high
harmonicity of audio signals, are among the techniques to exploit.
-
Sound Recognition: Classification with MPEG-7 Descriptors can
be used for sound recognition. The media can be automatically indexed using
trained sound classes in a pattern recognition framework. For this goal, a
generalized sound recognition system using reduced-dimension features, based on
independent component analysis and a hidden Markov model classifier are
considered.
-
Scene classification using audio information: Audio analysis
will be used for scene classification. For instance, the audio in a sports
scene is very different from that in a news report, or from an action scene,
and even various sports programs may have very different background sounds,
which may allow identifying them.
Speech analysis: Development of automatic speech
analysis tools. The main work will focus on:
-
Speech modelling: Use of nonlinear techniques, e.g. AM-FM
modulation to model speech resonances, and/or speech dynamics (e.g. nonlinear
predictors).
-
Speaker segmentation, identification, recognition and verification:
Use of robust clustering techniques for speaker segmentation; MPEG-7 low-level
audio feature descriptors using spectral basis representations are used to
model and identify different speakers; exploitation of higher-levels of
information such as speaker statistical language modelling, aiming to exploit
his/her word usage, or usage of prosodic features and coping with spontaneous
speech, as well as long-term signal measures; verification can use
discriminating power of commonly used features, and exploit relevance feedback
techniques to boost the systems performance.
-
Spoken content extraction and retrieval: Automatic speech
recognition systems are used to extract MPEG-7 Spoken Content Descriptors from
speech inputs. These provide compact representations of speech content,
consisting of recognition hypotheses lattices (possibly mixing word and phone
hypotheses). The extracted MPEG-7 Descriptors can be used to index audio-visual
databases. Depending on the desired application, these Descriptors are
extracted either from some spoken annotations or directly from the audio
stream. Spoken queries can then be supported.
Video analysis and processing of human
faces
Develop a set of video analysis tools, with special emphasis on the processing
of human faces. The work proposed will consolidate the integration of VISNET by
an effective exchange of knowledge between the cooperating institutions, aiming
the efficient human faces in video content, and creating a potential basis for
further integration and joint dissemination activities. To enhance the
cooperation exchanges of researchers (e.g. PhD students) are foreseen. The work
developed in this workpackage will receive input from the segmentation and
tracking workpackage and will be used as input for the multimodal analysis
workpackage.
Face analysis: Development of video analysis
tools for processing information related to human faces. The main work will
focus on:
-
Face detection and tracking: Probabilistic frameworks for face
detection and tracking will be investigated. Probabilistic models of facial
features will be built using a training stage and will be used in the
subsequent detection and tracking operations. Invariance to illumination
changes and robustness to occlusions will be pursued.
-
Face recognition: Algorithmic structure refinement and
performance enhancement of the advanced face recognition descriptor of MPEG-7
will be considered, namely: removing PCA pre-processing step for linear
discriminative analysis, reducing the number of channels in the Fourier pyramid
for the face, refinement of the iterative query concept, novel pose estimation,
and novel mapping of arbitrary pose to front pose using online extracted
approximated 3D model of face. In addition, the concept of 3D shape for face
recognition will be explored and combined with the 2D approach.
-
Facial expression analysis: The dynamics of the facial
expressions will be used to facilitate the expression recognition task.
Automatic video based facial expression recognition will be pursued using the
face deformations (FACS), relating the face action units with node deformations
of generic face models and learning them through statistical techniques. Model
registration, exploiting video information will be studied and the derived
facial animation parameters will be made available in an MPEG-4 syntax (FAPs,
FDPs) for animation of 3D face models.
-
Facial feature tracking: A novel algorithm for eyes tracking
will be developed, assuming that a face has been detected in the sequence. This
is also useful for image normalisation before face recognition, where a
multiresolution Gabor filtering in novel colour space will be used.
-
Detection of video shots including people: This may be a
preliminary processing step for some applications, such as news sequences
processing, where a subset of shots containing people is selected, to reduce
the computational burden of the subsequent face processing steps
Semantic video segmentation and tracking
Work will focus on developing intelligent semantic segmentation and tracking
techniques that are robust and efficient for natural video. The segmentation
and tracking results will be used as input for the facial analysis workpackage.
The folowing tasks will be investigated:
Segmentation system: Development of generic and
robust video analysis system for supervised and unsupervised segmentation and
object tracking. Different advanced techniques will be integrated to obtain a
robust, efficient and modular segmentation and tracking system for natural
video. Novel robust segmentation/tracking algorithms will be proposed. The
video analysis system will entail several independent modules. Topics that will
be investigated include soft detection techniques for object segmentation,
fusion of intermediate results, statistical machine learning and classification
techniques for object segmentation, etc.
2-D articulated person tracking: Dynamic 2-D
articulated models of the human body will be used to achieve efficient,
accurate and low computational complexity person segmentation and tracking.
3-D articulated person tracking: Techniques for 3-D
articulated person tracking from video sequences will be investigated. The
topics that will be treated include selection of suitable 3D models , use of
multiple image cues, incorporation of appropriate constraints (physiological,
anatomical - kinematic), etc.
Multimodal analysis
Develop a set of multimodal analysis tools, with special emphasis on the
detection and recognition of people in audiovisual sequences. The work proposed
will consolidate the integration of VISNET by an effective exchange of
knowledge between the cooperating institutions, aiming the efficient
integration of audio and visual analysis techniques, and creating a potential
basis for further integration and joint dissemination activities. To enhance
the cooperation exchanges of researchers (e.g. PhD students) are foreseen. The
work developed in this workpackage will receive input from the audio and video
analysis workpackages.
Shot selection: A module for the detection of
shots where the person on the scene are speaking will be developed. This module
allows to greatly reduce the computational load of the analysis system, since
the number of shots to be processed will be much smaller. Also the recognition
results will benefit from this module, since a verification of the
correspondence between the detected speech and the faces on the scene can be
performed.
Audio and video extraction for each selected shot: The audio
and video information will be processed in parallel and confidence values for
the speaker and face recognition techniques will be extracted for the selected
shots.
People location and recognition: Based on the audio and video
confidence values, it is possible to determine if a particular person is
present on the scene. The multimodal analysis for detection of people in
environments will also be conducted using the generalized probability theory.
This work exploits several independent information sources to allow the
achievement of better classification results than when considering each source
independently.
Secure transmission and encryption of AV
data
This workpackage deals with the definition of protocols and creation of
software platforms for the secure distribution of multimedia material using
encryption, robust watermarks, fingerprinting. Moreover, the adaptation of
security techniques to real-time AV transmissions and the definition of
metadata schemes suitable for the representation of access control information
in AV contents is addressed.
-
Design and analysis of protocols for secure distribution, in order to identify
the levels of protection afforded. Definition of encryption requirements of the
protocols. Determination of the role of the various technological components
(encryption, secure transmission, watermarking) in the distribution chain.
-
Adaptation of security functions currently supported in general communications
to AV transmission. Adaptation will address the issues of protection of
multicast traffic, encryption schemes adapted to streaming with varying QoS and
information loss, and efficient algorithms for securing real-time
communications.
-
Specification of an appropriate key distribution infrastructure for the needs
of AV secure transmissions. This infrastructure, together with the
corresponding security policies and certification authorities, will provide the
basis for the secure transmission of audio-visual contents.
-
Use of metadata to provide information related to access control. The metadata
that will be defined will be used for conveying properties that will enable
rights management, protection against unauthorized copy or retransmission,
conditional access enforcement, authorization methods, and other security
functions. These metadata will be duly protected with the appropriate security
mechanisms.
-
Participation in standardization bodies (mainly CEN/ETSI), contributing the
audiovisual-oriented security mechanisms, distribution protocols and AV content
protection metadata developed in VISNET.
Watermarking and Fingerprinting Techniques
for DRM
Devise new methods and build software for robust watermarking and
fingerprinting techniques for copy control, traitor/transaction tracing,
anonymous purchase. The effectiveness of the new methods using suitable test
procedures and benchmarking platforms is tested.
-
Design new methods and protocols for robust watermarking of AV data. Robustness
to geometric attacks and collusion attacks will be investigated. The methods
will be capable of embedding sufficient information (payload) to trace
distributors & purchasers of pirated material. Joint watermarking of audio
and video data and fusion of detection results will be pursued.
Information-theoretic techniques will be used in order to increase watermark
payload. The methods will be also capable of withstanding multiple watermarking
so as to enable watermarking at each step in the distribution process.
-
Design and implement new methods for fingerprinting of AV data. Selection of
robust feature vectors, implementation of techniques for the efficient matching
of fingerprints and the organization of the fingerprint database
-
Design protocols for anonymous purchase based on asymmetric watermarks.
-
Devise and test asymmetric watermarking schemes.
-
Specification of benchmarking procedures for testing the robustness, security,
imperceptibility, capacity of the new audio & video watermarking
techniques.
-
Specification of benchmarking procedures for testing the robustness &
discriminative power of fingerprinting techniques.
|