-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.tex
executable file
·1737 lines (1397 loc) · 80.3 KB
/
report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%\documentclass[twoside,openright,a4paper,12pt,english,draft]{article}
\documentclass[twoside,openright,a4paper,12pt,english]{article}
%\documentclass[twoside,openright,a4paper,11pt,french]{article}
%\documentclass[twoside,openright,a4paper,11pt,french]{book}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
% Utilisation de tableaux
\usepackage{tabularx}
\usepackage{listings} % Include the listings-package
\lstset{language=Java} % Set your language (you can change the language for each code-block optionally)
\usepackage{subcaption}
\usepackage{wrapfig}
% Utilisation d'url
\usepackage{url}
\urlstyle{sf}
% Utilisation d'images, stockées dans le répertoire ./pics/
\usepackage{graphicx}
\graphicspath{pics/}
% Définition des marges
\usepackage{geometry}
\geometry{%
left=26mm,%
right=27mm,%
top=32mm,%
bottom=45mm,%
foot=15mm%
}
\begin{document}
%%% IRBABOON
\def\j{\emph{Jitsi}}
\def\jvb{\emph{Jitsi Videobridge}}
\def\bj{\emph{BlueJimp}}
\def\lj{\emph{libjitsi}}
\def\wrtc{\emph{webrtc.org}}
\def\jm{\emph{Jitsi Meet}}
\hyphenation{WebRTC}
\hyphenation{transport}
%%% NOOBABRI
% La page de garde
\include{page-garde}
% Page blanche au dos de la page de garde
\newpage
\pagestyle{empty}
\newpage
\cleardoublepage
\include{ack}
% La table des matières
\parskip=0pt
\tableofcontents
% Page blanche entre la table des matières et le texte
\newpage
\pagestyle{empty}
\newpage
\pagenumbering{arabic}
\thispagestyle{plain}
\pagestyle{plain}
\begin{abstract}
This document describes the implementation of various media recording-related
features in a modern video conferencing application based on WebRTC. We start
by introducing the work environment and the various technological components
involved in the project. We then proceed with a detailed description of the
different stages of the process including the way audio and video data is
transported over the network, the way it is persistently stored,
the way multiple audio and video files are organized and combined in a
single flat audio/video file, and how synchronization is ensured.
The work on the project is still actively
pursued and we therefore also present a number of planned future steps and
improvements that will likely be implemented in the near
future.
%This document describes work done
%during a 6-months long internship. It describes a modern video
%conferencing application based on WebRTC (\jm) discussed, and a system for
%media recording, which I helped to design and implement, is examined in detail.
%The first section introduces the work environment for the
%internship and the most important relevant standards and technologies. The rest
%of the sections examine specific parts of the media recording system.
\end{abstract}
\section{Introduction}
\label{chap:intro}
The ever decreasing cost of bandwidth and processing resources have in the recent years
made multi-party video conferencing over the internet viable for personal use. The advent of the WebRTC
technology that adds audio/video communication capabilities to web browsers,
has made the development of conferencing applications (or the addition of
conferencing features to existing applications) simpler than ever before.
The work described in this document is about the development of video recording features
within \jm: an existing WebRTC video conferencing application. Throughout
the rest of this section we introduce the working environment and the most
important standards and software products used in \jm. In section
\ref{intro-recording} we define
what we mean by recording a conference and introduce the possible general
approaches to the problem.
In sections \ref{recording-video} through \ref{dsd} we examine specific
parts of the recording process in detail.
In section \ref{conclusion} we review accomplished work, and
in section \ref{future-work} we
review new features and optimizations to existing features which we plan to
develop in the near future.
\subsection{\bj}
\bj\footnote{\url{https://bluejimp.com}} is a small company which
offers support and development services mainly focused around the
\j\footnote{\url{https://jitsi.org}}
family of projects. The FLOSS (Free/Libre Open Source Software)
nature of these projects makes for a slightly unusual business model. The
company works with various kinds of customers who all have different use cases
for \j\ and need it adapted to their needs. While \bj\ has no exclusivity
on such adaptations, it is tightly involved in the development of the project
and some of the related technologies and standards. This has helped the company
acquire significant credibility and offer advantageous price/quality ratios.
In addition to orders from customers, \bj\ also often works on internal projects
that aim to enrich \j\ and make it more attractive to both users and
enterprises.
\bj\ is registered in Strasbourg, but the development team is international,
with people working from different geographic locations. Most communication happens over
the Internet using e-mail, instant messaging and audio/video calls.
My position in \bj\ is that of a software developer. Apart from development, my tasks also
involve a fair amount of research, experimentation and optimizations. I have
worked on \j\ previously and when my internship began, I was able to quickly
get accustomed to the environment.
\subsection{The \j\ family}
\label{intro-jitsi}
\j\ is a feature-rich internet communications client.
It is free software, licensed under the LGPL\cite{lgpl}.
The project was started by Emil Ivov in
the University of Strasbourg in 2003, and it was then known as SIP
Communicator. In the beginning
SIP Communicator was only a SIP client, but through the years it has evolved
into a multi-protocol
client (XMPP, AIM, Yahoo! and ICQ are now also supported) with a very wide variety
of features: instant messaging (IM), video calls, desktop streaming,
conferencing (multi-party, both audio and video), cross-protocol calls (SIP to
XMPP), media encryption (SRTP),
session establishment using ICE, file transfers and more.
Most of the development is financed by \bj.
\j\ is written mostly in Java and runs on a variety of
platforms (Windows, Linux, MacOSX and Android). The various projects comprise a massive codebase--
over 700 000 lines of Java code alone.
A big part of the code which was originally in \j\ is now split into a
separate library-- \emph{libjitsi}. This allows it to be easily reused in other
projects, such as \jvb. The code in \lj\ deals mainly with multimedia-- capture
and rendering, transcoding, encryption/decryption and transport over the
network of audio and video data. It contains the RTP stack for \j\ (partially
implemented in \lj\ itself, partially in the external FMJ library).
\jvb\ is a server-side application which acts as a media
relay and/or mixer. It allows a smart client to organize a conference
using an existing technology (for example, SIP or XMPP/Jingle), outsourcing the bandwidth-intensive task of media relaying to a server.
The organizing client controls \jvb\ over XMPP\footnote{Although a REST API is also available.}, while the rest of the
participating clients need not be aware that \jvb\ is in use (for example they
can be simple SIP clients). Since the end of 2013 \jvb\ supports ICE and is WebRTC-compatible.
One of the latest additions to the \j\ family is
\jm\cite{jitsi-meet}. This is a WebRTC
application, which runs completely within a browser and creates a video
conference using a \jvb\ instance to relay the media. \jm\ is discussed in more detail in \ref{intro-jm}.
\subsection{WebRTC}
WebRTC (Web Real-Time Communications) is a set of specifications
currently in development, that allow browsers which implement them to open
"peer connections". These are direct connections between two
browsers (a webserver is used only to setup the connection), and can be used to
send audio, video or application data. The specifications are open and are meant
to be implemented in the browsers themselves (without the need for additional
plug-ins).
WebRTC is divided in two main parts: the JavaScript standard APIs, being defined within
the \emph{WebRTC} working group\cite{webrtc-wg} at W3C,
and the on-the-wire protocols, being defined within
the \emph{RTCWEB} working group\cite{rtcweb-wg} at the IETF.
These standards provide web developers with a very powerful tool, which
can be used to easily create rich real-time multimedia applications.
There is also the possibility to pass arbitrary data in an
application-defined format. These allow for some very interesting and more
complicated use-cases.
The specifications are still being developed, but are already at an advanced stage. There is
an open-source implementation of the network protocols, provided by Google with
a BSD-like license. I will refer to this implementation as \emph{webrtc.org}
(which is the domain name of the project). Currently all browsers implementing
WebRTC (Chrome/Chromium, Opera and Mozilla Firefox)
use \emph{webrtc.org} as their base. Because of this, \wrtc\ is very important -- it is
used for all practical compatibility testing, making it the de-facto reference implementation.
\subsection{XMPP and Jingle}
Extensible Messaging and Presence Protocol (XMPP)\cite{rfc6120} is a mature,
XML-based protocol which allows endpoints
to exchange messages in near real-time. The core XMPP protocol only covers instant messaging (that is, the exchange of
text messages meant to be read by humans), but there are a variety of extensions that allow the protocol
to cover a wide range of use cases. Many such extensions are published as XMPP Extension Protocols (XEPs), and there are XEPs for
group chats, user avatars, file transfers, account registration, transporting
XMPP over HTTP, discovery of node capabilities, management of wireless sensors (provisioning, control, data collection),
and, most relevant here, internet telephony.
\emph{Jingle} (defined in XEP-0166\cite{jingle166} and XEP-0167\cite{jingle167})
is a signalling protocol, serving a purpose similar to that of SIP: it uses an
offer-answer model to setup an RTP session. Many of the
protocols often used with SIP, such as ICE\cite{ice}, ZRTP\cite{zrtp}, DTLS-SRTP\cite{rfc5763}, and
RFC4575\cite{rfc4575} can also be used with \emph{Jingle}. Mappings
have been defined between the two\cite{stoxmedia}, which allow gateways or rich
clients to organize cross-protocol calls.
\subsection{COLIBRI}
\begin{figure}[h]
\centering
\includegraphics[width=0.45\textwidth]{./pics/colibri-conf.eps}
\caption{A COLIBRI conference.}
\label{colibri-conf}
\end{figure}
COnferencing with LIghtweight BRIdging (COLIBRI, defined in XEP-0340\cite{colibri}) is an XMPP extension
developed mostly in \bj\ for use with \jvb. It provides a way for a client to control a multimedia
relay or mixer, such as \jvb. It works with the concept of a \emph{Conference}, which contains
\emph{Channel}s, separated in different \emph{Content}s (see fig.\ref{colibri-conf}).
In the most common use case a client requests the creation of a \emph{Conference} with a specified number of
\emph{Channel}s. The mixer allocates local sockets for each \emph{Channel} and
provides their addresses to the client. The client then uses these transport
addresses as its own to establish, for example, a \emph{Jingle} call with another participant.
Instead of just allocating local sockets, the ICE protocol can be used, in
which case the mixer provides a list of ICE candidates for each \emph{Channel}.
The protocol works with a natural XML representation of a \emph{Conference}.
After a \emph{Conference} has been established, the client can add or remove channels
from it, or change the parameters (such as the direction in which
media is allowed to flow) of an existing \emph{Channel}.
The protocol is being extended for the purposes of \jm\ (see the next section), and
now also has support for establishing \emph{Channel}s which use DTLS\cite{dtls}, and
establishing special \emph{Channel}s for use with WebRTC data channels. It also supports the
starting and stopping of the recording for a specific \emph{Conference} (which was a
small extension implemented as part of the recording effort described in this document).
\subsection{\jm}
\label{intro-jm}
\jm\ uses the above-mentioned technologies to create a multi-party video conference.
The endpoints of the conference are simply WebRTC-enabled
browsers\footnote{Although at the present time only Chrome/Chromium and Opera are
supported.} running the actual \jm\ application. They all connect to an XMPP
server and join a Multi-User Chat (MUC) chatroom. One of the participants (the
first one to enter the chatroom) assumes the role of organizer (or focus).
The focus creates a COLIBRI conference on a \jvb\ instance (\emph{jvb}), and allocates
two COLIBRI channels for each participant (one for audio, and one for video).
Then, it initiates a separate \emph{Jingle} session with
each participant, using the
transport information (i. e. the list of ICE candidates) obtained from $jvb$
instead of its own. When the participants accept the Jingle sessions, they in
effect perform ICE and establish direct RTP sessions with $jvb$.
The resulting connections for signalling and media are depicted in figures
\ref{jitmeet-sig} and \ref{jitmeet-med} respectively.
\begin{figure}[h]
\centering
\begin{subfigure}[t]{0.4\textwidth}
\includegraphics[height=5.5cm]{./pics/jm-sig.eps}
\caption{Signalling connections in a \jm\ conference.
The solid lines are XMPP/Jingle sessions, the dashed line
is XMPP/COLIBRI. The thick line is an XMPP Component Connection.}
\label{jitmeet-sig}
\end{subfigure}
\quad
\quad
\quad
\begin{subfigure}[t]{0.4\textwidth}
\includegraphics[height=5.5cm]{./pics/jm-med.eps}
\caption{Media connections in a \jm\ conference. The
lines represent RTP/RTCP sessions.}
\label{jitmeet-med}
\end{subfigure}
\end{figure}
The \jvb\ instance runs as a relay (as opposed to a mixer) for both video and
audio (meaning that it only passes RTP packets between the participants,
without considering their payload).
\smallskip
On the user-interface end, \jm\ aims to make it as easy as possible for a
person to enter or organize a conference.
Entering a conference is accomplished by simply opening a URL such as \emph{https://meet.jit.si/ConferenceID}
(where \emph{ConferenceID} can be chosen by the user). If \emph{ConferenceID}
doesn't exists, it is automatically created and
the user assumes the role of focus, inviting anyone who enters later on. If
\emph{ConferenceID} exists,
the user joins it (possibly after entering a password for the conference).
When in a conference, the interface has two main elements: one big video
(taking all available space) and, overlayed on top of it,
scaled down versions of the videos of all other participants. Figure
\ref{jm-ss} is a screenshot from the actual application. Current work is underway to
use dominant speaker identification (see section \ref{dsd}) to change the video
shown in full size to the person currently speaking.
\begin{figure}[h]
\includegraphics[width=1\textwidth]{./pics/jm-ss2.eps}
\caption{A screen capture from a \jm\ conference.}
\label{jm-ss}
\end{figure}
\medskip
In contrast to other products for video conferencing over the internet, the
whole infrastructure needed to run \jm\ can be installed in a custom
environment. The only services which are needed are a web server
(only serving static content), an XMPP server, and an instance of \jvb. This
makes \jm\ very suitable for businesses (or even individuals) who want full
control over their conferencing solution.
\section{Recording a \jm\ conference}
\label{intro-recording}
\subsection{What we mean by recording}
By recording a multimedia conference in general we mean the following:
all audio and video flows exchanged during a conference are saved
to disk in some format, in a way which allows the whole conference to
be played back later on.
\medskip
In our specific case, the recording of a conference has four main parts:
\begin{itemize}
\item{\textbf{Recording video}: storing all video flows (0, 1 or more per participant) in separate, single track files.}
\item{\textbf{Recording audio}: storing all conference audio flows (usually 1 or more per participant) in either a single mix or separate single track files.}
\item{\textbf{Recording metadata}: persisting all non-media information that is important for the reconstitution of a conference.}
\item{\textbf{Post-processing}: combining all recorded audio, video and metadata in a single audio/video file.}
\end{itemize}
Recording of the media is the process in which the audio and video RTP streams
in the conference are converted to a convenient format and saved to disk. There
are many different ways in which this can be done. Our final solution (and some
of the ideas that didn't work) are discussed in detail in
sections \ref{recording-video} and \ref{recording-audio}.
By metadata we mean all additional information (apart from the media
itself) which is necessary to play back the conference later. This includes participant names,
filenames, synchronization source (SSRC) identifiers, flow start and end times, etc.
A detailed discussion of the
metadata that we use is provided in section \ref{recording-metadata}.
Post-processing in our case means taking all the recorded data and producing a
single file with one audio track and one video track, which can be easily
viewed, manipulated and uploaded and viewed at popular video streaming
platforms\footnote{Such as YouTube and Daily Motion}.
The details can vary, but
generally all audio is mixed together, and the videos are combined in a way to
resemble the \jm\ interface. Appendix \ref{jipopro} discusses the
post-processing application.
\subsection{Which entity performs recording}
As can be seen in figure \ref{jitmeet-med}, in a \jm\ conference both \jvb\ and
all the participants have access to the RTP streams, and so could potentially
perform recording.
\begin{wrapfigure}{R}{0.5\textwidth}
\centering
\includegraphics[width=0.4\textwidth]{./pics/jm-rec.eps}
\caption{A recording application connected to a \jm\ conference as a "fake" participant.}
\label{jitmeet-rec}
\end{wrapfigure}
Since the participating clients are running an application within their
browser, if we want one of them to do recording, we would need modifications to
the browsers. This is inconvenient because users would need to use modified
browsers, and because in most use-cases the recordings are going to be stored
(and post-processed) on a server, so they would have to be transferred there
somehow. To avoid this, a "fake" participant can
be added, which does not not actually participate in the conference
(does not send audio or video), and runs on a server (and without the need for a browser at all). Still, it
connects as a normal participant and establishes an RTP session with \jvb\ (see
figure \ref{jitmeet-rec}).
Recording directly on \jvb\ is more straightforward, so our initial
implementation was focused on that. However, most of the code resides in \lj, allowing
it to be easily reused.
Shunyang Li, a student from Peking University, is currently working, under my
guidance and within context of the Google Summer of Code program, on
\emph{Jirecon}\cite{jirecon}-- a standalone XMPP
container for the recording application described here.
\section{Implementing support for the WebRTC transport layer}
The documents from the RTCWEB working group at the IETF specify how multimedia
is to be transported between WebRTC endpoints. In the most part existing
standards are reused, which made our task of implementing support for the
WebRTC transport significantly more manageable.
It is mandatory to use the Interactive Connectivity Establishment (ICE\cite{ice}) protocol to
establish a session. This assures that an endpoint will not send any media
before it has receive consent (in the form of a STUN message) from the remote
side. This protects against possible traffic augmentation attacks, in which a malicious
web-server causes browsers to send large amounts of data (e.g. a video stream) to a
target.
After a connection is established using ICE, a DTLS-SRTP session is started.
This means that the endpoints use Datagram TLS (DTLS\cite{dtls})
to exchange key material, which is then used to generate session keys for a
Secure Real-Time Protocol (SRTP\cite{srtp})
session. The procedure is defined in RFC5763\cite{rfc5763}.
In a \jm\ conference, each participant's browser sets up two secure RTP sessions with
\jvb\ in this way.
\medskip
RTP provides an unreliable transport. For this reason \wrtc\ uses a couple of
mechanisms on top of RTP to improve the quality of the media.
Implementing support for these mechanisms in \lj\ was one of the most
significant efforts during our recording project. These efforts are
discussed in detail in sections \ref{red} to \ref{rtx}.
Before that, however, section \ref{lj} gives an overview of the existing RTP stack
in \lj.
\subsection{The RTP stack in \lj}
\label{lj}
\lj\ makes heavy use of the Freedom for Media in Java
(FMJ\cite{fmj}) library. This is an
open-source implementation of the Java Media Framework (JMF) API, and it is
used in \lj\ for a variety of tasks: capture and playback of media, conversion
(transcoding) of media, and for handling of basic RTP streams. FMJ is highly extensible, and
many components (such as media codecs, capture devices and renderers) are written in \lj.
\begin{figure}[h]
\centering
\includegraphics[width=9cm]{./pics/lj.eps}
\caption{General scheme of the RTP stack in \lj.}
\label{lj-scheme}
\end{figure}
\begin{wrapfigure}{R}{4cm}
\centering
\includegraphics[width=3cm]{./pics/lj-pt.eps}
\caption{The chain of \emph{PacketTransformer}s in \lj. The shaded elements are new additions.}
\label{lj-pt}
\end{wrapfigure}
The RTP stack used by FMJ lacks some features: notably support for SRTP and
for asymmetric payload type mappings (i.e. sending and receiving a given format
with two different RTP payload type numbers). In order for \lj\ to implement these features, it
intercepts the RTP packets from the actual socket in use, and processes them
before passing them on to FMJ. Specifically, packets go through a chain of
\emph{PacketTransformer}s, which perform various tasks. Figure \ref{lj-scheme}
illustrates this scheme.
\emph{PacketTransformer}s provide a convenient interface to intercept RTP and
RTCP packets at different stages of their processing and perform additional
operations on them. Despite their name, \emph{PacketTransformer}s don't need to change
the packets in any way, they can be used just for monitoring.
Figure \ref{lj-pt} lists the currently used \emph{PacketTransformer}s in \lj. A
packet which arrives from the network goes through the chain downwards (and
when packets are sent, they go through the same chain but in the other
direction). The transformer labelled "RFC6464" is used to extract the
audio level information from packets, which include this information in
an RTP header extensions defined in RFC6464\cite{rfc6464}. The audio
levels are used for, among other things, performing dominant speaker
identification (see section \ref{dsd}). This transformer also serves
as a filter, dropping packets with audio marked to contain silence (in order to
avoid unnecessary processing, which is why the transformer is first in the
chain). The SRTP transformer decrypts SRTP packets. The
"Override PT" transformer changes the payload type numbers of packets. It is
used to implement asymmetric payload type mappings. The statistics transformer
monitors RTCP packets and extracts statistics from them, making them available
to other parts of the library.
The rest of the transformers (the ones in grey) were added with the
implementation of the recording system, and will be discussed in the next
sections.
\subsection{RED}
\label{red}
Our first task was to implement support for implement support for the RED
payload format for RTP. This format is defined in
RFC2198\cite{red} and allows the encapsulation of
one or more "virtual" RTP packets in a single RTP packet. It is intended to be
used with redundancy data. Its use is negotiated as a regular media format and
it does not have a static payload-type number, so a dynamic number is assigned
during negotiation.
In \wrtc, RED is supported and used for video streams. In the case of a \jm\
conference, it is negotiated between the clients, and \jvb\ has no way of
affecting its use or its payload-type number, because it does not actively
participate in the offer/answer procedure. This means that in order to record
video, the recorder has to understand RED.
\smallskip
We decided that the best way to implement RED in \lj\ is as
a \emph{PacketTransformer}. There was one complication--
\emph{PacketTransformer}s work with single packets (they take a single packet
as input and produce a single packet as output), while a RED packet may contain
multiple "virtual" RTP packets which would need to be output.
We modified \lj, so that all \emph{PacketTransformer}s work with multiple
packets at a time-- they take an array of packets as input and produce an array
as output. This change was not easy, because we had to make sure that we didn't
break existing code, but it proved useful later on when we added support for
ULPFEC and RTCP compount packets.
We implemented a RED packet transformer following RFC2198 and inserted it in
the transformer chain, after the SRTP transformer.
\subsection{Uneven Level Protection Forward Error Correction}
\label{ulpfec}
Our next task was to implement support for the Forward Error Correction (FEC)
format used in \wrtc.
In general, (FEC) refers to a mechanism which allows
lost data to be recovered without retransmission. It involves sending redundant
data, in one way or another.
RFC5109\cite{ulpfec} defines a specific RTP payload-type for
redundant data called Uneven Level Protection FEC (ULPFEC). It is generic in
the sense that that it can be used with any media payload-type
(audio or video, no matter what the codec is).
In \emph{webrtc.org},
ULPFEC is used for the video streams\footnote{For audio, Opus' own FEC scheme which works
differently than ULPFEC is used (and it is already supported in \lj).},
and while not strictly mandatory for our video-recording (as opposed to RED),
it is important because by decreasing the number of irretrievably lost packets,
it will improve the quality of the recordings.
ULPFEC (in the rest of the section we refer to it as simply FEC) is applied to
an RTP stream (the "media stream", with "media packets"). When a sender uses FEC
for a particular stream, it adds additional packets to it ("FEC packets"). The basic idea is simple --
take a set $S$ of a few media packets and apply a parity operation (XOR) on it,
resulting in a FEC packet $f$. If any one of the packets in $S$ is lost,
the receiver can use the rest of the packets in $S$ together with $f$ to reconstruct the
lost packet\footnote{This is similar to how RAID5 works.}.
Along with the parity data, a FEC packet contains two fields which are used to
describe the set $S$ from which it was constructed: a "sequence number base"
field, and a bitmap field that describes the sequence numbers of the packets in
$S$ using the base. The packet $f$ is said to "protect" the packets in $S$. This scheme
allows FEC to work without any additional
signalling (apart from the payload-type number negotiated during session
initialization).
The sender can control the amount of FEC packets it adds to a stream by changing the
number of protected packets, and it can do this dynamically, adapting
to network conditions. This is the most common usage of FEC, and the
one currently employed by \wrtc: the sender allocates a given fraction of the configured bandwidth
to FEC, and this fraction changes depending on the packet loss
statistics received with RTCP. The aim is to mitigate the effects of packet
loss without the need for retransmissions.
Another way to use FEC is for probing the available bandwidth. When the sender
detects stable network conditions, it wants to increase its sending bitrate, in
order to improve quality. However, this risks causing a congestion, and therefore
packet loss. The sender initially increases its sending rate by significantly
increasing the amount of FEC. In this case, even if a congestion occurs, the receiver
is more likely to be able to reconstruct the media packets (without the need for
retransmissions). The sender then monitors the following RTCP reports. If they
indicate a high percentage of packet loss, the sender goes back to the
previous, lower rate. Otherwise, the sender keeps the total bitrate, but decreases
the rate of FEC, using the available bitrate for the encoder instead, thus improving
the video quality. This scheme is examined in \cite{NagySOE13}
and it may represent a surface for future improvement of the \jvb\ and \jm\ platforms.
\subsubsection{Implementation of ULPFEC}
We decided to implement FEC as another \emph{PacketTransformer}. This is how it works:
We keep two buffers of packets: $bMedia$ and $bFEC$. With every FEC packet $f$
we associate the number $numMissing(f)$ of media packets protected by $f$
which we have not received (but we could receive later on).
When we receive a new media packet, we recalculate the values $numMissing(f)$ for
all $f$ in the $bFEC$ buffer. Then, for all $f$ in $bFEC$: if $(numMissing(f)
== 0)$, then we remove $f$ from $bFEC$. If $numMissing(f) > 1$, then we do
nothing. If $numMissing(f) == 1$, we use $f$ and $bMedia$ to reconstruct a media
packet and then we remove $f$ from $bFEC$.
When we receive a FEC packet $f$, we calculate $numMissing(f)$, and apply the same procedure
as above.
We have limited $bFEC$ to a small size, and if $bFEC$ is full when we receive a new FEC packet, we drop the oldest packet from it.
In this way, we prevent "stale" FEC packets (for
which $numMissing$ will always be $>1$, because more than one of their
protected packets have been lost) to accumulate and cause needless computation.
\subsubsection{Re-writing RTP sequence numbers}
RFC5109 does not place any restrictions on the placement of FEC packets within
a stream, and in our architecture FEC packets are handled entirely in the FEC
\emph{PacketTransformer} and not passed on to the rest of the application. This
presents a potential problem for the depacketizer (see section
\ref{depacketizer}), because it cannot differentiate
between a sequence number missing because a packet was lost and a sequence number missing
because it was used by a FEC packet.
For this reason we initially implemented re-writing of the RTP sequence numbers of
the media packets after they pass the \emph{PacketTransformer}-- we decreased their
number by the number of FEC packets already received (see figure \ref{fec-seqs} for an
illustration). This still leaves some problems, because we might incorrectly interpret
a lost FEC packet as a lost media packet, and because we might incorrectly renumber
some packets if they arrive out of order.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth]{./pics/fec-seqs.eps}
\caption{Re-writing sequence numbers after removal of FEC packets. The
marked packets are FEC. The line above shows the sequence numbers
before FEC is removed, the line below-- after.}
\label{fec-seqs}
\end{figure}
Upon further research we found that \wrtc\ restricts the placement of FEC
packets in the stream by only adding them at the end of a VP8 frame, after the
RTP packet with the \emph{M}-bit set (see section \ref{recording-video}). This
restriction allows their depacketizer to distinguish between the case of a lost
part of a frame and a sequence number being used for FEC, and allows their implementation
to work without the unnecessary and awkward complication of rewriting sequence numbers.
We updated our implementation -- all that was required was to remove the
code which does the sequence number change.
\subsection{Retransmissions}
\label{rtx}
Our next task was to understand how \wrtc\ uses RTP retransmissions, and
implement support for them in \lj. We found that \wrtc\ uses RTCP NACK messages
(an RTCP Feedback Message type defined in RFC4585\cite{rfc4585}) from the
receiving side, in order to notify the sender that a specific RTP packet (or a
set of packets) has not been received. When a sender received a NACK message,
it attempts to retransmit the lost packets (i.e. retransmits the lost packets,
if they are still in its buffers).
The \wrtc\ code uses NACKs and retransmissions only for the
video streams. Current versions do retransmissions by sending the exact same
RTP packets (without even re-encrypting, which causes some SRTP implementations
to falsely detect a replay attack), but there's planned switch to using the
payload format defined in RFC4588\cite{rfc4588} to encapsulate retransmitted packets.
Currently we do not support RFC4588 in \lj. We plan to add support for it, in the form
of a \emph{PacketTransformer}. It will strip the additional RFC4588 headers and pass on
exact copies of the original packets. Then, in the rest of the library, we will handle the packets
in the same way as we currently handle non-RFC4588 retransmission, which is nothing special:
we just ensure that we have buffers of sufficient size so that we don't drop
retransmitted packets because they arrive too late (see section
\ref{video-jb}).
\medskip
When the recording application runs on the \jvb, requesting retransmissions with NACKs is not
very important, because all RTP packets go through the bridge, and if the
bridge is missing a packet, then so are the rest of the participants. The recorder can
rely on the participants for sending NACKs, and just make use of the retransmissions themselves.
However, when the recorder runs in a separate application (as a "fake"
participant), this approach doesn't work, because packets might be lost between
the bridge and the recorder. Our implementation does not yet support sending NACKs, but we plan
to introduce it.
\subsection{RTCP compound packets}
RFC3550 specifies that two or more RTCP packets can be combined into a compound
RTCP packet. The format is very simple -- the packets are just concatenated
together and the length fields in their headers allow their later
reconstruction.
Because of the lack of support for such packets in FMJ (and because \wrtc\
makes use of them), we implemented it in \lj\ (as a \emph{PacketTransformer}).
\section{Recording video}
\label{recording-video}
\subsection{The VP8 codec}
The WebRTC standards do not define a video codec which has to be supported by all clients. There
has been a very long discussion at the IETF
about whether to make a codec mandatory to implement (MTI), and if so, which one. The
four main options suggested were
\emph{(i) make the VP8 codec MTI};
\emph{(ii) make the H264 codec MTI};
\emph{(iii) have no MTI codecs};
\emph{(iv) make both VP8 and H264 MTI}. No consensus has been reached.
Nevertheless, currently VP8 is the de-facto standard codec for WebRTC, because
it is the only codec supported by \wrtc\ (and therefore \jm).
VP8 is a video compression format, defined in RFC6386\cite{vp8}. It was
originally developed by \emph{On2 Technologies}, which was acquired by
\emph{Google} in 2010. \emph{Google} published the specification and released a
reference implementation (\emph{libvpx}) under a BSD-like opensource
license. They also provided a statement granting permission for royalty-free
use of any of their patents used in
\emph{libvpx}\cite{webm-ip-rights}.
Both VP8 in general and \emph{libvpx} work exclusively with the I420 raw
(uncompressed) image format\cite{i420}.
A component called a VP8 encoder, which takes as input an I420 image and produces a "VP8
Compressed Frame". Similarly, a decoder reads VP8 compressed frames and
produces I420 images.
A separate specification\cite{vp8rtp} defines how to transport a VP8 compressed
frame over RTP. In short, the process involves optionally splitting a VP8
compressed in parts, prefixing each part with a structure called a "VP8 Payload
Descriptor", and then encapsulating each part in RTP. This process is referred
to
as packetization, and the reverse process (of collecting RTP packets and
constructing VP8 compressed frames)-- depacketization. Figure \ref{vp8-scheme}
provides a high-level overview of the use of VP8 with RTP.
\begin{figure}[h]
\includegraphics[width=0.9\textwidth]{./pics/vp8.eps}
\caption{Using VP8 over RTP (c.f. stands for "compressed frame")}
\label{vp8-scheme}
\end{figure}
\lj\ already has a VP8 implementation, which we use in the \j\ client. It
consists of four parts: an encoder and decoder (wrappers around \emph{libvpx}),
a packetizer and a depacketizer. For the purposes of recording, we only need
a depacketizer. We found that the existing depacketizer is not compliant with
the specification, and also not compatible with \wrtc. We decided to re-write it from scratch.
\subsection{Depacketization}
\label{depacketizer}
The VP8 Payload Descriptor can be thought of as an extension to the RTP header.
It is included in the beginning of the RTP payload for every RTP packet with
the VP8 format. It has variable length (between 1 and 6 bytes) and contains,
among others, the following fields:
\begin{itemize}
\item \emph{S}-bit: start of VP8 partition, set only if the first byte of the payload of the packet is the first byte of a VP8 partition.
\item \emph{PID}: Partition ID, specifies the ID of the VP8 parition to which the first byte of the payload of the packet belongs.
\item \emph{PictureID}: A running index of frames, incremented by 1 for each subsequent VP8 frame.
\end{itemize}
When we do depacketization, we use the above three fields, as well as the following fielsa
from the RTP header:
\begin{itemize}
\item \emph{Timestamp}: a 32-bit field specifying a generation timestamp of the payload. For VP8, all RTP packets from a given frame have the same timestamp.
\item \emph{Sequence number}: An index of RTP packets.
\item \emph{M}-bit: set for the last RTP packet of a frame (and only for it).
\end{itemize}
We implemented the algorithm suggested in the specification: we buffer RTP
packets locally, until we receive a packet from a new frame. At this point, we
check whether the buffer contains a full VP8 frame, and if it does we output
it. Otherwise, we drop the buffer and start to collect packets for the next
frame. In \lj, the depacketizer is part of the FMJ codec chain (see fig. \ref{lj-scheme}).
In order to decide whether a received packet is from a new frame or not, we use the
RTP timestamp and the \emph{PictureID} fields. If either of them don't match,
we assume that the packet is from a new frame).
We use the \emph{PID} and \emph{S} fields from the VP8 Payload Descriptor to
detect the first packet of a frame-- the first, and only the first packet
have both the \emph{S}-bit set and \emph{PID} set to 0.
This observations allow us to easily check whether we have a full packet
in the buffer or not. We have a full packet if: \emph{(i) we have the beginning
of a frame}; \emph{(ii) we have the end of a frame (a packet with the M-bit
set)}; and \emph{(iii) we have
all RTP sequence numbers inbetween}.
The following pseudo-code outlines the procedure.
\begin{lstlisting}[frame=single] % Start your code-block
hack
receive(Packet p){
if (!belongsToBuffer(p))
flush();
push(p);
if (haveFullFrame())
outputFrame();
}
belongsToBuffer(p){
if (bufferEmpty())
return true;
else if (bufferRtpTimestamp == p.RtpTimestamp
&& bufferPictureID == p.PictureID)
return true;
return false;
}
haveFullFrame(){
if (! (buffer.first.S && buffer.first.PID == 0))
return false;
if (!buffer.last.M)
return false;
for (int i=buffer.first.seq; i<=buffer.last.seq; i++)
if (!buffer.contains(i))
return false;
return true;
}
\end{lstlisting}
\subsection{Container format}
After depacketization, we are left with a stream of VP8 Compressed Frames. We needed to decide how to store them on disk. We considered three options:
\begin{itemize}
\item Use the \emph{ivf} container format
\item Define and use our own container format
\item Use the \emph{webm} container format
\end{itemize}
The \emph{ivf} format is a very simple video-only, VP8-only storage format. It
was developed with \emph{libvpx} for the purposes of testing the
implementation. It precedes each frame with a fixed-size header containing just
the length of the frames and a presentation timestamp. The only advantage of
using this format is the relative simplicity of it's implementation. The disadvantages
include that not many players support it (for example browsers don't support it), and
that it's lack of extensibility.
Defining our own container format has one advantage, and that's the possibility
to design it in a way that allows partial VP8 frames. The \emph{libvpx} decoder
has a mode which allows it to decode a frame even if parts of it are missing.
In order to use this API, however, the decoder needs to be provided with
information about which parts (which VP8 partitions) are missing, and this
information would be lost if we use \emph{ivf} or \emph{webm}. The
disadvantages of this approach are the complexity that it brings, and the
inflexibility with regards to the players -- none of the other tools which
we use (like \emph{ffmpeg}) would be able to handle it, and we would need to
implement our own decoder.
The \emph{webm}\cite{webm} format is a subset of
\emph{matroska}\cite{matroska}. It has been designed
specifically to contain VP8 (possibly interleaved with audio) and to be played
in browsers. It allows for much more flexibility than \emph{ivf}. The main advantage
is that it can be played by many players, and that it supports many features,
so we can later extend our implementation if needed.
\smallskip
We found a small library (written in C++, but with Java mappings already
available) with a simple API that would allow us to write \emph{webm} files
easily, so we decided to ignore \emph{ivf}, postpone the potential
definition of a new format, and use \emph{webm}.
We adapted the library to our needs, and implemented a \emph{WebmDataSink} class which takes
as input a stream of VP8 compressed frames and saves them in a \emph{webm} file.
\medskip
A VP8 stream transported over RTP does not have a strictly defined frame rate.
Each VP8 frame has it's own, independent timestamp, generated by the source (usually at the
time of capture from a camera, before VP8 encoding has taken place), and this timestamp
gets translated into an RTP timestamp and is carried in RTP.
When we save VP8 in \emph{webm} format, we use the RTP timestamp in order to calculate
a presentation timestamp. This is simple-- for the first recorded frame we use
a presentation timestamp of 0 (in milliseconds) and for subsequent frames we
calculate the difference from the first frame.
\subsection{Requesting keyframes}
VP8 has two types of frames: I-frames (or keyframes) which can be decoded
without any other context, and P-frames, whose decoding depends on previously
decoded frames. In order to start working, a VP8 decoder needs to first decode a keyframe.
Therefore, we want the recorded \emph{webm} files to start with a keyframe.
Because keyframes are rarely sent, and because we want to be able to start
recording a conference at any time, we needed a way to trigger the generation of a
keyframe. One way to do this, which is supported by \wrtc, is by the use of
RTCP Full-Intra Request (FIR) feedback messages (defined in
RFC5104\cite{rfc5104}, section 3.5.1).
We added support for FIR messages in \lj\ and made use of them to request
keyframes in the beginning of a recording. We faced a difficulty because FIR
messages contain an "SSRC of packer sender" field, and \wrtc\ clients only
accept messages from SSRCs that they know about (that is, that have been
specifically added via signalling). We had to make \jvb\ generate and use it's
own SSRC, which it also announces to the focus via COLIBRI.
\subsection{Jitter buffer}
\label{video-jb}
A jitter buffer is normally used in a real-time multimedia application to
introduce a certain amount of delay in the playback of media,
mitigating the effects of the varying delay of packets on the network
(the jitter). A buffer which is too small gets emptied quickly when packets are
delayed and playback has
to be paused. A buffer which is too big adds unnecessary delay to the playback.
For this reason adaptive jitter buffers are used, which change their
size according to network conditions.
In the case of video with WebRTC, apart from just jitter on the network, there
might be packet retransmissions triggered by the receiver (which necessarily
arrive at least one RTT later than the originally transmitted packets), and it is beneficial if the
buffer is large enough to accept them (as opposed to dropping them because they
have arrived too late).
For the purposes of recording, a relatively long delay (on the order of a few
seconds) is acceptable, provided that the buffer can be emptied at the end of the recording,
without packets being discarded. For this reason we decided to use a fixed-size
jitter buffer.
FMJ already includes a jitter buffer in its chain (see figure \ref{lj-scheme}), but
for technical reasons it was hard to implement a way to empty it without
discarding its contents. We decided to implement our own buffer, in the form of
a \emph{PacketTransformer}. For simplicity, we limited its size to a given
number of packets, and not to a given length of time. We used a default size
of 300 packets, since we observed that this usually corresponds to between 3
and 10 seconds.
\subsection{Overview}
\label{recording-video-overview}
The components discussed above are pieced together in a \lj\ class named
\emph{RecorderRtpImpl}\footnote{Because it implements the \emph{Recorder}
interface, using RTP as opposed to a capture device as input},
which we implemented specifically for recording video\footnote{Although we
later adapted it to record audio as well.}.
\emph{RecorderRtpImpl} takes its input in the form of RTP
packets. It first demultiplexes the packets by their SSRC, and puts them in a jitter
buffer, which is placed as the last element of the \emph{PacketTransformer} chain.
After they exit the jitter buffer, all packets
are passed to an instance of FMJ which is configured to transcode from the
"VP8/RTP" format, to the "VP8" (i.e. to do depacketization). FMJ does it's own
demultiplexing and creates a VP8 depacketizer for each stream. The depacktizer
is part of the FMJ codec chain (again, see figure \ref{lj-scheme}), and the
internal FMJ jitter buffer is practically not used. FMJ provides its output in the
form of multiple \emph{DataSource}s (one for each stream), which represent, in
essence, a stream of VP8 compressed frames. We take each of these
\emph{DataSource}s and pass it to a \emph{WebmDataSink} instance, which
produces the final \emph{webm} file.
As soon as we detect a new VP8 stream, before its packets enter the jitter
buffer, we request a keyframe by sending an RTCP FIR message.
When the recording of a stream is stopped (either because of a user request
via the API, or because of an RTCP BYE message, or because of a
timeout), we empty the jitter buffer for the specific SSRC, processing its contents.
Using this process, we save each stream in a
separate \emph{webm} file.
\section{Recording audio}
\label{recording-audio}
The WebRTC standards define two mandatory to implement audio codecs:
\emph{G711}\cite{g711} and \emph{Opus}\cite{opus}. All WebRTC endpoints are
required to implement them.
\emph{G711} is an audio codec originally developed in the 1970s by the ITU for
use in the telephone network. It works on an input PCM signal with a sampling