-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathdraft-ietf-rtcweb-jsep.xml
5280 lines (5129 loc) · 275 KB
/
draft-ietf-rtcweb-jsep.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="std"
docName="draft-uberti-rtcweb-rfc8829bis-05" number="8829" consensus="true"
ipr="trust200902" obsoletes="8829" updates="" submissionType="IETF"
xml:lang="en" tocInclude="true" symRefs="true" sortRefs="true"
tocDepth="4" version="3">
<!-- xml2rfc v2v3 conversion 2.34.0 -->
<front>
<title abbrev="JSEP">JavaScript Session Establishment Protocol (JSEP)</title>
<seriesInfo name="RFC" value="8829"/>
<author fullname="Justin Uberti" initials="J." surname="Uberti">
<address>
<email>[email protected]</email>
</address>
</author>
<author fullname="Cullen Jennings" initials="C." surname="Jennings">
<organization>Cisco</organization>
<address>
<postal>
<street>400 3rd Avenue SW</street>
<city>Calgary</city>
<region>AB</region>
<code>T2P 4H2</code>
<country>Canada</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author fullname="Eric Rescorla" initials="E." surname="Rescorla" role="editor">
<organization>Mozilla</organization>
<address>
<email>[email protected]</email>
</address>
</author>
<date day="21" month="Sep" year="2023"/>
<keyword>webrtc</keyword>
<keyword>sdp</keyword>
<keyword>negotiation</keyword>
<keyword>signaling</keyword>
<keyword>peerconnection</keyword>
<keyword>api</keyword>
<keyword>ice</keyword>
<keyword>rtp</keyword>
<keyword>offer</keyword>
<keyword>answer</keyword>
<abstract>
<t>This document describes the mechanisms for allowing a
JavaScript application to control the signaling plane of a
multimedia session via the interface specified in the W3C
RTCPeerConnection API and discusses how this relates to existing
signaling protocols.</t>
<t>This specification obsoletes RFC 8829.</t>
</abstract>
</front>
<middle>
<section anchor="sec.introduction" numbered="true" toc="default">
<name>Introduction</name>
<t>This document describes how the W3C Web Real-Time Communication (WebRTC) RTCPeerConnection
interface
<xref target="W3C.webrtc" format="default"/> is used to control the setup,
management, and teardown of a multimedia session.</t>
<section anchor="sec.general-design-of-jsep" numbered="true" toc="default">
<name>General Design of JSEP</name>
<t>WebRTC call setup has been designed to focus on controlling
the media plane, leaving signaling-plane behavior up to the
application as much as possible. The rationale is that
different applications may prefer to use different protocols,
such as the existing SIP call signaling protocol, or something
custom to the particular application, perhaps for a novel use
case. In this approach, the key information that needs to be
exchanged is the multimedia session description, which
specifies the transport and media configuration
information necessary to establish the media plane.</t>
<t>With these considerations in mind, this document describes
the JavaScript Session Establishment Protocol (JSEP), which
allows for full control of the signaling state machine from
JavaScript. As described above, JSEP assumes a model in which a
JavaScript application executes inside a runtime containing
WebRTC APIs (the "JSEP implementation"). The JSEP
implementation is almost entirely divorced from the core
signaling flow, which is instead handled by the JavaScript
making use of two interfaces: (1) passing in local and remote
session descriptions and (2) interacting with the Interactive
Connectivity Establishment (ICE) state
machine <xref target="RFC8445"/>. The combination of the JSEP implementation and the
JavaScript application is referred to throughout this document
as a "JSEP endpoint".</t>
<t>In this document, the use of JSEP is described as if it
always occurs between two JSEP endpoints. Note, though, that in many
cases it will actually be between a JSEP endpoint and some kind
of server, such as a gateway or Multipoint Control Unit (MCU). This distinction is
invisible to the JSEP endpoint; it just follows the
instructions it is given via the API.</t>
<t>JSEP's handling of session descriptions is simple and
straightforward. Whenever an offer/answer exchange is needed,
the initiating side creates an offer by calling a createOffer
API. The application then uses that offer to set up its local
configuration via the setLocalDescription API. The offer is finally
sent off to the remote side over its preferred signaling
mechanism (e.g., WebSockets); upon receipt of that offer, the
remote party installs it using the setRemoteDescription
API.</t>
<t>To complete the offer/answer exchange, the remote party uses
the createAnswer API to generate an appropriate answer,
applies it using the setLocalDescription API, and sends the
answer back to the initiator over the signaling channel. When
the initiator gets that answer, it installs it using the
setRemoteDescription API, and initial setup is complete. This
process can be repeated for additional offer/answer
exchanges.</t>
<t>Regarding ICE
<xref target="RFC8445" format="default"/>, JSEP decouples the ICE state
machine from the overall signaling state machine. The ICE
state machine must remain in the JSEP implementation because
only the implementation has the necessary knowledge of
candidates and other transport information. Performing this
separation provides additional flexibility in protocols that
decouple session descriptions from transport. For instance, in
traditional SIP, each offer or answer is self-contained,
including both the session descriptions and the transport
information. However,
<xref target="RFC8840" format="default"/> allows SIP to
be used with Trickle ICE
<xref target="RFC8838" format="default"/>, in which the session
description can be sent immediately and the transport
information can be sent when available. Sending transport
information separately can allow for faster ICE and DTLS
startup, since ICE checks can start as soon as any transport
information is available rather than waiting for all of it.
JSEP's decoupling of the ICE and signaling state machines
allows it to accommodate either model.</t>
<t>Although it abstracts signaling, the JSEP approach
requires that the application be aware of the signaling process.
While the application does not need to understand the contents
of session descriptions to set up a call, the application must
call the right APIs at the right times, convert the session
descriptions and ICE information into the defined messages of
its chosen signaling protocol, and perform the reverse
conversion on the messages it receives from the other side.</t>
<t>One way to make life easier for the application is to
provide a JavaScript library that hides this complexity from
the developer; said library would implement a given signaling
protocol along with its state machine and serialization code,
presenting a higher-level call-oriented interface to the
application developer. For example, libraries exist to provide
implementations of the SIP <xref target="RFC3261"/> and Extensible Messaging
and Presence Protocol (XMPP) <xref target="RFC6120"/> signaling
protocols atop the JSEP API.
Thus, JSEP
provides greater control for the experienced developer without
forcing any additional complexity on the novice developer.</t>
</section>
<section anchor="sec.other-approaches-consider" numbered="true" toc="default">
<name>Other Approaches Considered</name>
<t>One approach that was considered instead of JSEP was to
include a lightweight signaling protocol. Instead of providing
session descriptions to the API, the API would produce and
consume messages from this protocol. While providing a more
high-level API, this put more control of signaling within the
JSEP implementation, forcing it to have to understand and
handle concepts like signaling glare (see
<xref target="RFC3264" sectionFormat="comma" section="4"/>).</t>
<t>A second approach that was considered but not chosen was to
decouple the management of the media control objects from
session descriptions, instead offering APIs that would control
each component directly. This was rejected based on the
argument that requiring exposure of this level of complexity to
the application programmer would not be beneficial; it would
(1) result in an API where even a simple example would require a
significant amount of code to orchestrate all the needed
interactions and (2) create a large API surface that
would need to be agreed upon and documented.
In addition, these API
points could be called in any order, resulting in a more
complex set of interactions with the media subsystem than the
JSEP approach, which specifies how session descriptions are to
be evaluated and applied.</t>
<t>One variation on JSEP that was considered was to keep the
basic session-description-oriented API but to move the
mechanism for generating offers and answers out of the JSEP
implementation. Instead of providing createOffer/createAnswer
methods within the implementation, this approach would instead
expose a getCapabilities API, which would provide the
application with the information it needed in order to generate
its own session descriptions. This increases the amount of work
that the application needs to do; it needs to know how to
generate session descriptions from capabilities, and especially
how to generate the correct answer from an arbitrary offer and
the supported capabilities. While this could certainly be
addressed by using a library like the one mentioned above, it
basically forces the use of said library even for a simple
example. Providing createOffer/createAnswer avoids this
problem.</t>
</section>
<section>
<name>Changes from RFC 8829</name>
<t>
When <xref target="RFC8829"/> was published, inconsistencies regarding BUNDLE
<xref target="RFC8843"/> operation were identified with regard to
both the specification text as well as implementation behavior. The
former concern was addressed via an update to BUNDLE (see <xref target="RFC9143"/>).
For the latter concern, it was observed that some implementations
implemented the "max-bundle" bundle policy defined in <xref target="RFC8829"/>
by assuming that bundling had already been negotiated, rather than marking "m=" sections
as bundle-only as indicated by the BUNDLE specification.
In order to prevent unexpected changes to applications relying
on the pre-standard behavior, the decision
was made to deprecate "max-bundle" and instead
introduce an identically defined "must-bundle" policy that, when selected,
provides the behavior originally specified by <xref target="RFC8829"/>.
</t>
</section>
</section>
<section anchor="sec.terminology" numbered="true" toc="default">
<name>Terminology</name>
<t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
"<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>",
"<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>",
"<bcp14>SHOULD NOT</bcp14>",
"<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are
to be interpreted as described in BCP 14 <xref target="RFC2119"/>
<xref target="RFC8174"/> when, and only when, they appear in all capitals,
as shown here.</t>
</section>
<section anchor="sec.semantics-and-syntax" numbered="true" toc="default">
<name>Semantics and Syntax</name>
<section anchor="sec.signaling-model" numbered="true" toc="default">
<name>Signaling Model</name>
<t>JSEP does not specify a particular signaling model or state
machine, other than the generic need to exchange session
descriptions in the fashion described by
<xref target="RFC3264" format="default"/> (offer/answer) in order for both
sides of the session to know how to conduct the session. JSEP
provides mechanisms to create offers and answers, as well as to
apply them to a session. However, the JSEP implementation is
totally decoupled from the actual mechanism by which these
offers and answers are communicated to the remote side,
including addressing, retransmission, forking, and glare
handling. These issues are left entirely up to the application;
the application has complete control over which offers and
answers get handed to the implementation, and when.</t>
<figure anchor="fig-sigModel">
<name>JSEP Signaling Model</name>
<artwork name="" type="ascii-art" align="left" alt=""><![CDATA[
+-----------+ +-----------+
| Web App |<--- App-Specific Signaling -->| Web App |
+-----------+ +-----------+
^ ^
| SDP | SDP
V V
+-----------+ +-----------+
| JSEP |<----------- Media ------------>| JSEP |
| Impl. | | Impl. |
+-----------+ +-----------+ ]]></artwork>
</figure>
</section>
<section anchor="sec.session-descriptions-and-state-machine" numbered="true" toc="default">
<name>Session Descriptions and State Machine</name>
<t>In order to establish the media plane, the JSEP
implementation needs specific parameters to indicate what to
transmit to the remote side, as well as how to handle the media
that is received. These parameters are determined by the
exchange of session descriptions in offers and answers, and
there are certain details to this process that must be handled
in the JSEP APIs.</t>
<t>Whether a session description applies to the local side or
the remote side affects the meaning of that description. For
example, the list of codecs sent to a remote party indicates
what the local side is willing to receive, which, when
intersected with the set of codecs the remote side supports,
specifies what the remote side should send. However, not all
parameters follow this rule; some parameters are declarative,
and the remote side must either accept them or reject them
altogether. An example of such a parameter is the TLS
fingerprints <xref target="RFC8122" format="default"/>
as used in the context of DTLS <xref target="RFC6347" format="default"/>;
these fingerprints are calculated based on
the local certificate(s) offered and are not subject to
negotiation.
</t>
<t>In addition, various RFCs put different conditions on the
format of offers versus answers. For example, an offer may
propose an arbitrary number of "m=" sections (i.e., media
descriptions as described in
<xref target="RFC4566" sectionFormat="comma" section="5.14"/>), but an answer must
contain the exact same number as the offer.</t>
<t>Lastly, while the exact media parameters are known only
after an offer and an answer have been exchanged, the offerer
may receive ICE checks, and possibly media (e.g., in the case
of a re-offer after a connection has been established) before
it receives an answer. To properly process incoming media in
this case, the offerer's media handler must be aware of the
details of the offer before the answer arrives.</t>
<t>Therefore, in order to handle session descriptions properly,
the JSEP implementation needs:
</t>
<ol spacing="normal" type="1">
<li>To know if a session description pertains to the local or
remote side.</li>
<li>To know if a session description is an offer or an
answer.</li>
<li>To allow the offer to be specified independently of the
answer.</li>
</ol>
<t>JSEP addresses this by adding both setLocalDescription
and setRemoteDescription methods and having session description
objects contain a type field indicating the type of session
description being supplied. This satisfies the requirements
listed above for both the offerer, who first calls
setLocalDescription(sdp [offer]) and then later
setRemoteDescription(sdp [answer]), and the
answerer, who first calls setRemoteDescription(sdp [offer]) and
then later setLocalDescription(sdp [answer]).</t>
<t>During the offer/answer exchange, the outstanding offer is
considered to be "pending" at the offerer and the answerer, as
it may be either accepted or rejected. If this is a re-offer,
each side will also have "current" local and remote
descriptions, which reflect the result of the last offer/answer
exchange. Sections
<xref target="sec.pendinglocaldescription" format="counter"/>,
<xref target="sec.pendingremotedescription" format="counter"/>,
<xref target="sec.currentlocaldescription" format="counter"/>, and
<xref target="sec.currentremotedescription" format="counter"/> provide more
detail on pending and current descriptions.</t>
<t>JSEP also allows for an answer to be treated as provisional
by the application. Provisional answers provide a way for an
answerer to communicate initial session parameters back to the
offerer, in order to allow the session to begin, while allowing
a final answer to be specified later. This concept of a final
answer is important to the offer/answer model; when such an
answer is received, any extra resources allocated by the caller
can be released, now that the exact session configuration is
known. These "resources" can include things like extra ICE
components, Traversal Using Relays around NAT (TURN) candidates, or video decoders. Provisional
answers, on the other hand, do no such deallocation; as a
result, multiple dissimilar provisional answers, with their own
codec choices, transport parameters, etc., can be received and
applied during call setup. Note that the final answer itself
may be different than any received provisional answers.</t>
<t>In
<xref target="RFC3264" format="default"/>, the constraint at the signaling
level is that only one offer can be outstanding for a given
session, but at the JSEP level, a new offer can be
generated at any point. For example, when using SIP for
signaling, if one offer is sent and is then canceled using a SIP
CANCEL, another offer can be generated even though no answer
was received for the first offer. To support this, the JSEP
media layer can provide an offer via the createOffer method
whenever the JavaScript application needs one for the
signaling. The answerer can send back zero or more provisional
answers and then finally end the offer/answer exchange by sending a
final answer. The state machine for this is as follows:</t>
<figure anchor="fig-state-machine">
<name>JSEP State Machine</name>
<artwork name="" type="ascii-art" align="left" alt=""><![CDATA[
setRemote(OFFER) setLocal(PRANSWER)
/-----\ /-----\
| | | |
v | v |
+---------------+ | +---------------+ |
| |----/ | |----/
| have- | setLocal(PRANSWER) | have- |
| remote-offer |------------------- >| local-pranswer|
| | | |
| | | |
+---------------+ +---------------+
^ | |
| | setLocal(ANSWER) |
setRemote(OFFER) | |
| V setLocal(ANSWER) |
+---------------+ |
| | |
| |<---------------------------+
| stable |
| |<---------------------------+
| | |
+---------------+ setRemote(ANSWER) |
^ | |
| | setLocal(OFFER) |
setRemote(ANSWER) | |
| V |
+---------------+ +---------------+
| | | |
| have- | setRemote(PRANSWER) |have- |
| local-offer |------------------- >|remote-pranswer|
| | | |
| |----\ | |----\
+---------------+ | +---------------+ |
^ | ^ |
| | | |
\-----/ \-----/
setLocal(OFFER) setRemote(PRANSWER) ]]></artwork>
</figure>
<t>Aside from these state transitions, there is no other
difference between the handling of provisional ("pranswer") and
final ("answer") answers.</t>
</section>
<section anchor="sec.session-description-forma" numbered="true" toc="default">
<name>Session Description Format</name>
<t>JSEP's session descriptions use Session Description Protocol (SDP) syntax for their
internal representation. While this format is not optimal for
manipulation from JavaScript, it is widely accepted and is
frequently updated with new features; any alternate encoding of
session descriptions would have to keep pace with the changes
to SDP, at least until the time that this new encoding eclipsed
SDP in popularity.</t>
<t>However, to provide for future flexibility, the SDP syntax
is encapsulated within a SessionDescription object, which can
be constructed from SDP and be serialized out to SDP. If
future specifications agree on a JSON format for session
descriptions, this object could be enhanced to generate
and consume that JSON.</t>
<t>As detailed below, most applications should be able to treat
the SessionDescriptions produced and consumed by these various
API calls as opaque blobs; that is, the application will not
need to parse or understand them.</t>
</section>
<section anchor="sec.session-description-ctrl" numbered="true" toc="default">
<name>Session Description Control</name>
<t>In order to give the application control over various common
session parameters, JSEP provides control surfaces that tell
the JSEP implementation how to generate session descriptions.
In most cases, this removes the need for applications to modify session
descriptions after they are created.</t>
<t>Changes to these objects result in changes to the session
descriptions generated by subsequent createOffer/createAnswer
calls.</t>
<section anchor="sec.rtptransceivers" numbered="true" toc="default">
<name>RtpTransceivers</name>
<t>RtpTransceivers allow the application to control the RTP
media associated with one "m=" section. Each RtpTransceiver has
an RtpSender and an RtpReceiver, which an application can use
to control the sending and receiving of RTP media. The
application may also modify the RtpTransceiver directly, for
instance, by stopping it.</t>
<t>RtpTransceivers generally have a 1:1 mapping with "m="
sections, although there may be more RtpTransceivers than "m="
sections when RtpTransceivers are created but not yet
associated with an "m=" section, or if RtpTransceivers have been
stopped and disassociated from "m=" sections. An RtpTransceiver
is said to be associated with an "m=" section if its
media identification (mid) property is non-null; otherwise, it is said to be
disassociated. The associated "m=" section is determined using
a mapping between transceivers and "m=" section indices, formed
when creating an offer or applying a remote offer.</t>
<t>An RtpTransceiver is never associated with more than one
"m=" section, and once a session description is applied, an "m="
section is always associated with exactly one RtpTransceiver.
However, in certain cases where an "m=" section has been
rejected, as discussed in
<xref target="sec.subsequent-offers" format="default"/> below, that "m=" section
will be "recycled" and associated with a new RtpTransceiver
with a new MID value.</t>
<t>RtpTransceivers can be created explicitly by the
application or implicitly by calling setRemoteDescription
with an offer that adds new "m=" sections.</t>
</section>
<section anchor="sec.rtpsenders" numbered="true" toc="default">
<name>RtpSenders</name>
<t>RtpSenders allow the application to control how RTP media
is sent. An RtpSender is conceptually responsible for the
outgoing RTP stream(s) described by an "m=" section. This
includes encoding the attached MediaStreamTrack, sending RTP
media packets, and generating/processing the RTP Control Protocol (RTCP) for the
outgoing RTP streams(s).</t>
</section>
<section anchor="sec.rtpreceivers" numbered="true" toc="default">
<name>RtpReceivers</name>
<t>RtpReceivers allow the application to inspect how RTP
media is received. An RtpReceiver is conceptually responsible
for the incoming RTP stream(s) described by an "m=" section.
This includes processing received RTP media packets, decoding
the incoming stream(s) to produce a remote MediaStreamTrack,
and generating/processing RTCP for the incoming RTP
stream(s).</t>
</section>
</section>
<section anchor="sec.ice" numbered="true" toc="default">
<name>ICE</name>
<section anchor="sec.ice-gather-overview" numbered="true" toc="default">
<name>ICE Gathering Overview</name>
<t>JSEP gathers ICE candidates as needed by the application.
Collection of ICE candidates is referred to as a gathering
phase, and this is triggered either by the addition of a new
or recycled "m=" section to the local session description or by
new ICE credentials in the description, indicating an ICE
restart. Use of new ICE credentials can be triggered
explicitly by the application or implicitly by the JSEP
implementation in response to changes in the ICE
configuration.</t>
<t>When the ICE configuration changes in a way that requires
a new gathering phase, a 'needs-ice-restart' bit is set. When
this bit is set, calls to the createOffer API will generate
new ICE credentials. This bit is cleared by a call to the
setLocalDescription API with new ICE credentials from either
an offer or an answer, i.e., from either a locally or
remotely initiated ICE restart.</t>
<t>When a new gathering phase starts, the ICE agent will
notify the application that gathering is occurring through a state
change event. Then, when each new ICE candidate becomes available,
the ICE agent will supply it to the application via an
onicecandidate event; these candidates will also automatically be
added to the current and/or pending local session
description. Finally, when all candidates have been gathered,
a final onicecandidate event will be dispatched to signal that the
gathering process is complete.</t>
<t>Note that gathering phases only gather the candidates
needed by new/recycled/restarting "m=" sections; other "m="
sections continue to use their existing candidates. Also, if
an "m=" section is bundled (either by a successful bundle
negotiation or by being marked as bundle-only), then
candidates will be gathered and exchanged for that "m=" section
if and only if its MID item is a BUNDLE-tag, as described in
<xref target="RFC9143" format="default"/>.</t>
</section>
<section anchor="sec.ice-candidate-trickling" numbered="true" toc="default">
<name>ICE Candidate Trickling</name>
<t>Candidate trickling is a technique through which a caller
may incrementally provide candidates to the callee after the
initial offer has been dispatched; the semantics of "Trickle
ICE" are defined in
<xref target="RFC8838" format="default"/>. This process
allows the callee to begin acting upon the call and setting
up the ICE (and perhaps DTLS) connections immediately,
without having to wait for the caller to gather all possible
candidates. This results in faster media setup in cases where
gathering is not performed prior to initiating the call.</t>
<t>JSEP supports optional candidate trickling by providing
APIs, as described above, that provide control and feedback
on the ICE candidate gathering process. Applications that
support candidate trickling can send the initial offer
immediately and send individual candidates when they get
notified of a new candidate; applications that do not support
this feature can simply wait for the indication that
gathering is complete, and then create and send their offer,
with all the candidates, at that time.</t>
<t>Upon receipt of trickled candidates, the receiving
application will supply them to its ICE agent. This triggers
the ICE agent to start using the new remote candidates for
connectivity checks.</t>
<section anchor="sec.ice-candidate-format" numbered="true" toc="default">
<name>ICE Candidate Format</name>
<t>In JSEP, ICE candidates are abstracted by an
IceCandidate object, and as with session descriptions, SDP
syntax is used for the internal representation.</t>
<t>The candidate details are specified in an IceCandidate
field, using the same SDP syntax as the
"candidate-attribute" field defined in
<xref target="RFC8839" sectionFormat="comma" section="5.1"/>. Note that this
field does not contain an "a=" prefix, as indicated in the
following example:</t>
<sourcecode name="" type="sdp"><![CDATA[
candidate:1 1 UDP 1694498815 192.0.2.33 10000 typ host ]]></sourcecode>
<t>The IceCandidate object contains a field to indicate
which ICE username fragment (ufrag) it is associated with, as defined in
<xref target="RFC8839" sectionFormat="comma" section="5.4"/>. This value is used
to determine which session description (and thereby which
gathering phase) this IceCandidate belongs to, which helps
resolve ambiguities during ICE restarts. If this field is
absent in a received IceCandidate (perhaps when
communicating with a non-JSEP endpoint), the most recently
received session description is assumed.</t>
<t>The IceCandidate object also contains fields to indicate
which "m=" section it is associated with, which can be
identified in one of two ways: either by an "m=" section
index or by a MID. The "m=" section index is a zero-based
index, with index N referring to the N+1th "m=" section in
the session description referenced by this IceCandidate.
The MID is a "media stream identification" value, as
defined in
<xref target="RFC5888" sectionFormat="comma" section="4"/>, which provides a
more robust way to identify the "m=" section in the session
description, using the MID of the associated RtpTransceiver
object (which may have been locally generated by the
answerer when interacting with a non-JSEP endpoint that
does not support the MID attribute, as discussed in
<xref target="sec.applying-a-remote-desc" format="default"/> below). If the
MID field is present in a received IceCandidate, it <bcp14>MUST</bcp14> be
used for identification; otherwise, the "m=" section index is
used instead.</t>
<t>Other than the "m=" section index, all IceCandidate fields are optional, and
implementations <bcp14>MUST NOT</bcp14> reject a candidate simply because
an optional field is missing.</t>
</section>
</section>
<section anchor="sec.ice-candidate-policy" numbered="true" toc="default">
<name>ICE Candidate Policy</name>
<t>Typically, when gathering ICE candidates, the JSEP
implementation will gather all possible forms of initial
candidates -- host, server-reflexive, and relay.
However, in
certain cases, applications may want to have more specific
control over the gathering process, due to privacy or related
concerns. For example, one may want to only use relay
candidates, to leak as little location information as
possible (keeping in mind that this choice comes with
corresponding operational costs). To accomplish this, JSEP
allows the application to restrict which ICE candidates are
used in a session. Note that this filtering is applied on top
of any restrictions the implementation chooses to enforce
regarding which IP addresses are permitted for the
application, as discussed in
<xref target="RFC8828" format="default"/>.</t>
<t>There may also be cases where the application wants to
change which types of candidates are used while the session
is active. A prime example is where a callee may initially
want to use only relay candidates, to avoid leaking location
information to an arbitrary caller, but then change to use
all candidates (for lower operational cost) once the user has
indicated that they want to take the call. For this scenario, the
JSEP implementation <bcp14>MUST</bcp14> allow the candidate policy to be
changed in mid-session, subject to the aforementioned
interactions with local policy.</t>
<t>To administer the ICE candidate policy, the JSEP
implementation will determine the current setting at the
start of each gathering phase. Then, during the gathering
phase, the implementation <bcp14>MUST NOT</bcp14> expose candidates
disallowed by the current policy to the application, use them
as the source of connectivity checks, or indirectly expose
them via other fields, such as the raddr/rport attributes for
other ICE candidates. Later, if a different policy is
specified by the application, the application can apply it by
kicking off a new gathering phase via an ICE restart.</t>
</section>
<section anchor="sec.ice-candidate-pool" numbered="true" toc="default">
<name>ICE Candidate Pool</name>
<t>JSEP applications typically inform the JSEP implementation
to begin ICE gathering via the information supplied to
setLocalDescription, as the local description indicates the
number of ICE components that will be needed and for which
candidates must be gathered. However, to accelerate cases
where the application knows the number of ICE components to
use ahead of time, it may ask the implementation to gather a
pool of potential ICE candidates to help ensure rapid media
setup.</t>
<t>When setLocalDescription is eventually called and the
JSEP implementation prepares to gather the needed ICE candidates,
it <bcp14>SHOULD</bcp14> start by checking if any candidates are available
in the pool. If there are candidates in the pool, they <bcp14>SHOULD</bcp14>
be handed to the application immediately via the ICE
candidate event. If the pool becomes depleted, either because
a larger-than-expected number of ICE components are used or
because the pool has not had enough time to gather
candidates, the remaining candidates are gathered as usual.
This only occurs for the first offer/answer exchange, after
which the candidate pool is emptied and no longer used.</t>
<t>One example of where this concept is useful is an
application that expects an incoming call at some point in
the future, and wants to minimize the time it takes to
establish connectivity, to avoid clipping of initial media.
By pre-gathering candidates into the pool, it can exchange
and start sending connectivity checks from these candidates
almost immediately upon receipt of a call. Note, though, that
by holding on to these pre-gathered candidates, which will be
kept alive as long as they may be needed, the application
will consume resources on the STUN/TURN servers it is
using. ("STUN" stands for "Session Traversal Utilities for NAT".)</t>
</section>
<section numbered="true" toc="default">
<name>ICE Versions</name>
<t>While this specification formally relies on <xref target="RFC8445" format="default"/>, at the time of its publication, the
majority of WebRTC implementations support the version
of ICE described in <xref target="RFC5245" format="default"/>. The "ice2" attribute defined in <xref target="RFC8445" format="default"/>
can be used to detect the version in use by a remote endpoint
and to provide a smooth transition from the older specification
to the newer one. Implementations <bcp14>MUST</bcp14> be able to accept remote
descriptions that do not have the "ice2" attribute.</t>
</section>
</section>
<section anchor="sec.imageattr" numbered="true" toc="default">
<name>Video Size Negotiation</name>
<t>Video size negotiation is the process through which a
receiver can use the "a=imageattr" SDP attribute
<xref target="RFC6236" format="default"/> to indicate what video frame sizes it
is capable of receiving. A receiver may have hard limits on
what its video decoder can process, or it may have some maximum
set by policy. By specifying these limits in an "a=imageattr"
attribute, JSEP endpoints can attempt to ensure that the remote
sender transmits video at an acceptable resolution. However,
when communicating with a non-JSEP endpoint that does not
understand this attribute, any signaled limits may be exceeded,
and the JSEP implementation <bcp14>MUST</bcp14> handle this gracefully, e.g.,
by discarding the video.</t>
<t>Note that certain codecs support transmission of samples
with aspect ratios other than 1.0 (i.e., non-square pixels).
JSEP implementations will not transmit non-square pixels but
<bcp14>SHOULD</bcp14> receive and render such video with the correct aspect
ratio. However, sample aspect ratio has no impact on the size
negotiation described below; all dimensions are measured in
pixels, whether square or not.</t>
<section anchor="sec.creating-imageattr" numbered="true" toc="default">
<name>Creating an imageattr Attribute</name>
<t>The receiver will first combine any known local limits
(e.g., hardware decoder capabilities or local policy) to
determine the absolute minimum and maximum sizes it can
receive. If there are no known local limits, the
"a=imageattr" attribute <bcp14>SHOULD</bcp14> be omitted. If these local
limits preclude receiving any video, i.e., the degenerate
case of no permitted resolutions, the "a=imageattr" attribute
<bcp14>MUST</bcp14> be omitted, and the "m=" section <bcp14>MUST</bcp14> be marked as
sendonly/inactive, as appropriate.</t>
<t>Otherwise, an "a=imageattr" attribute is created with a
"recv" direction, and the resulting resolution space formed
from the aforementioned intersection is used to specify its
minimum and maximum "x=" and "y=" values.</t>
<t>The rules here express a single set of preferences, and
therefore, the "a=imageattr" "q=" value is not important. It
<bcp14>SHOULD</bcp14> be set to "1.0".</t>
<t>The "a=imageattr" field is payload type specific. When all
video codecs supported have the same capabilities, use of a
single attribute, with the wildcard payload type (*), is
<bcp14>RECOMMENDED</bcp14>. However, when the supported video codecs have
different limitations, specific "a=imageattr" attributes <bcp14>MUST</bcp14>
be inserted for each payload type.</t>
<t>As an example, consider a system with a multiformat video
decoder, which is capable of decoding any resolution from
48x48 to 720p. In this case, the implementation would
generate this attribute:</t>
<t>a=imageattr:* recv [x=[48:1280],y=[48:720],q=1.0]</t>
<t>This declaration indicates that the receiver is capable of
decoding any image resolution from 48x48 up to 1280x720
pixels.</t>
</section>
<section anchor="sec.interpreting-imageattr" numbered="true" toc="default">
<name>Interpreting imageattr Attributes</name>
<t>
<xref target="RFC6236" format="default"/> defines "a=imageattr" to be an
advisory field. This means that it does not absolutely
constrain the video formats that the sender can use but
gives an indication of the preferred values.</t>
<t>This specification prescribes behavior that is more specific. When
a MediaStreamTrack, which is producing video of a certain
resolution (the "track resolution"), is attached to an
RtpSender, which is encoding the track video at the same or
lower resolution(s) (the "encoder resolutions"), and a remote
description is applied that references the sender and
contains valid "a=imageattr recv" attributes, it <bcp14>MUST</bcp14> follow
the rules below to ensure that the sender does not transmit a
resolution that would exceed the size criteria specified in
the attributes. These rules <bcp14>MUST</bcp14> be followed as long as the
attributes remain present in the remote description,
including cases in which the track changes its resolution or
is replaced with a different track.</t>
<t>Depending on how the RtpSender is configured, it may be
producing a single encoding at a certain resolution or, if
simulcast
(<xref target="sec.simulcast" format="default"/>) has been negotiated, multiple
encodings, each at their own specific resolution. In
addition, depending on the configuration, each encoding may
have the flexibility to reduce resolution when needed or may
be locked to a specific output resolution.</t>
<t>For each encoding being produced by the RtpSender, the set
of "a=imageattr recv" attributes in the corresponding "m="
section of the remote description is processed to determine
what should be transmitted. Only attributes that reference
the media format selected for the encoding are considered;
each such attribute is evaluated individually, starting with
the attribute with the highest "q=" value. If multiple
attributes have the same "q=" value, they are evaluated in
the order they appear in their containing "m=" section. Note
that while JSEP endpoints will include at most one
"a=imageattr recv" attribute per media format, JSEP endpoints
may receive session descriptions from non-JSEP endpoints with
"m=" sections that contain multiple such attributes.</t>
<t>For each "a=imageattr recv" attribute, the following rules
are applied. If this processing is successful, the encoding
is transmitted accordingly, and no further attributes are
considered for that encoding. Otherwise, the next attribute
is evaluated, in the aforementioned order. If none of the
supplied attributes can be processed successfully, the
encoding <bcp14>MUST NOT</bcp14> be transmitted, and an error <bcp14>SHOULD</bcp14> be
raised to the application.
</t>
<ul spacing="normal">
<li>The limits from the attribute are compared to the
encoder resolution. Only the specific limits mentioned
below are considered; any other values, such as picture
aspect ratio, <bcp14>MUST</bcp14> be ignored. When considering a
MediaStreamTrack that is producing rotated video, the
unrotated resolution <bcp14>MUST</bcp14> be used for the checks. This is
required regardless of whether the receiver supports
performing receive-side rotation (e.g., through Coordination of
Video Orientation (CVO)
<xref target="TS26.114" format="default"/>), as it significantly simplifies
the matching logic.</li>
<li>If the attribute includes a "sar=" (sample aspect ratio)
value set to something other than "1.0", indicating that the
receiver wants to receive non-square pixels, this cannot be
satisfied and the attribute <bcp14>MUST NOT</bcp14> be used.</li>
<li>If the encoder resolution exceeds the maximum size
permitted by the attribute and the encoder is allowed to
adjust its resolution, the encoder <bcp14>SHOULD</bcp14> apply downscaling
in order to satisfy the limits. Downscaling <bcp14>MUST NOT</bcp14> change
the picture aspect ratio of the encoding, ignoring any
trivial differences due to rounding. For example, if the
encoder resolution is 1280x720 and the attribute specified
a maximum of 640x480, the expected output resolution would
be 640x360. If downscaling cannot be applied, the attribute
<bcp14>MUST NOT</bcp14> be used.</li>
<li>If the encoder resolution is less than the minimum size
permitted by the attribute, the attribute <bcp14>MUST NOT</bcp14> be used;
the encoder <bcp14>MUST NOT</bcp14> apply upscaling. JSEP implementations
<bcp14>SHOULD</bcp14> avoid this situation by allowing receipt of
arbitrarily small resolutions, perhaps via fallback to a
software decoder.</li>
<li>If the encoder resolution is within the maximum and
minimum sizes, no action is needed.</li>
</ul>
</section>
</section>
<section anchor="sec.simulcast" numbered="true" toc="default">
<name>Simulcast</name>
<t>JSEP supports simulcast transmission of a MediaStreamTrack,
where multiple encodings of the source media can be transmitted
within the context of a single "m=" section. The current JSEP API
is designed to allow applications to send simulcasted media but
only to receive a single encoding. This allows for multi-user
scenarios where each sending client sends multiple encodings to
a server, which then, for each receiving client, chooses the
appropriate encoding to forward.</t>
<t>Applications request support for simulcast by configuring
multiple encodings on an RtpSender. Upon generation of an offer
or answer, these encodings are indicated via SDP markings on
the corresponding "m=" section, as described below. Receivers
that understand simulcast and are willing to receive it will
also include SDP markings to indicate their support, and JSEP
endpoints will use these markings to determine whether
simulcast is permitted for a given RtpSender. If simulcast
support is not negotiated, the RtpSender will only use the
first configured encoding.</t>
<t>Note that the exact simulcast parameters are up to the
sending application. While the aforementioned SDP markings are
provided to ensure that the remote side can receive and demux
multiple simulcast encodings, the specific resolutions and
bitrates to be used for each encoding are purely a send-side
decision in JSEP.</t>
<t>JSEP currently does not provide a mechanism to configure
receipt of simulcast. This means that if simulcast is offered
by the remote endpoint, the answer generated by a JSEP endpoint
will not indicate support for receipt of simulcast, and as such
the remote endpoint will only send a single encoding per "m="
section.</t>
<t>In addition, JSEP does not provide a mechanism to handle an
incoming offer requesting simulcast from the JSEP endpoint.
This means that setting up simulcast in the case where the JSEP
endpoint receives the initial offer requires out-of-band
signaling or SDP inspection. However, in the case where the
JSEP endpoint sets up simulcast in its initial offer, any
established simulcast streams will continue to work upon
receipt of an incoming re-offer. Future versions of this
specification may add additional APIs to handle the incoming
initial offer scenario.</t>
<t>When using JSEP to transmit multiple encodings from an
RtpSender, the techniques from
<xref target="RFC8853" format="default"/> and
<xref target="RFC8851" format="default"/> are used. Specifically,
when multiple encodings have been configured for an RtpSender,
the "m=" section for the RtpSender will include an "a=simulcast"
attribute, as defined in
<xref target="RFC8853" sectionFormat="comma" section="5.1"/>,
with a "send" simulcast stream description that lists each
desired encoding, and no "recv" simulcast stream description.
The "m=" section will also include an "a=rid" attribute for each
encoding, as specified in
<xref target="RFC8851" sectionFormat="comma" section="4"/>; the use of
Restriction Identifiers (RIDs, also called rid-ids or RtpStreamIds)
allows the individual encodings to be
disambiguated even though they are all part of the same "m="
section.</t>
</section>
<section anchor="sec.interactions-with-forking" numbered="true" toc="default">
<name>Interactions with Forking</name>
<t>Some call signaling systems allow various types of forking
where an SDP Offer may be provided to more than one device. For
example, SIP
<xref target="RFC3261" format="default"/> defines both a "parallel search"
and "sequential search". Although these are primarily signaling-level issues that are outside the scope of JSEP, they do have
some impact on the configuration of the media plane that is
relevant. When forking happens at the signaling layer, the
JavaScript application responsible for the signaling needs to
make the decisions about what media should be sent or received
at any point in time, as well as which remote endpoint it
should communicate with; JSEP is used to make sure the media
engine can make the RTP and media perform as required by the
application. The basic operations that the applications can
have the media engine do are as follows:
</t>
<ul spacing="normal">
<li>Start exchanging media with a given remote peer, but keep
all the resources reserved in the offer.</li>
<li>Start exchanging media with a given remote peer, and free
any resources in the offer that are not being used.</li>
</ul>
<section anchor="sec.sequential-forking" numbered="true" toc="default">
<name>Sequential Forking</name>
<t>Sequential forking involves a call being dispatched to
multiple remote callees, where each callee can accept the
call, but only one active session ever exists at a time; no
mixing of received media is performed.</t>
<t>JSEP handles sequential forking well, allowing the
application to easily control the policy for selecting the
desired remote endpoint. When an answer arrives from one of
the callees, the application can choose to apply it as either
(1) a provisional answer, leaving open the possibility of using a
different answer in the future or (2) a final
answer, ending the setup flow.</t>
<t>In a "first-one-wins" situation, the first answer will be
applied as a final answer, and the application will reject
any subsequent answers. In SIP parlance, this would be ACK +
BYE.</t>
<t>In a "last-one-wins" situation, all answers would be
applied as provisional answers, and any previous call leg
will be terminated. At some point, the application will end
the setup process, perhaps with a timer; at this point, the
application could reapply the pending remote description as a
final answer.</t>
</section>
<section anchor="sec.parallel-forking" numbered="true" toc="default">
<name>Parallel Forking</name>
<t>Parallel forking involves a call being dispatched to
multiple remote callees, where each callee can accept the
call and multiple simultaneous active signaling sessions can
be established as a result. If multiple callees send media at
the same time, the possibilities for handling this are
described in
<xref target="RFC3960" sectionFormat="comma" section="3.1"/>. Most SIP devices
today only support exchanging media with a single device at a
time and do not try to mix multiple early media audio
sources, as that could result in a confusing situation. For
example, consider having a European ringback tone mixed
together with the North American ringback tone -- the
resulting sound would not be like either tone and would
confuse the user. If the signaling application wishes to only
exchange media with one of the remote endpoints at a time,
then from a media engine point of view, this is exactly like
the sequential forking case.</t>
<t>In the parallel forking case where the JavaScript
application wishes to simultaneously exchange media with
multiple peers, the flow is slightly more complex, but the
JavaScript application can follow the strategy that
<xref target="RFC3960" format="default"/> describes, using UPDATE. The
UPDATE approach allows the signaling to set up a separate
media flow for each peer that it wishes to exchange media
with. In JSEP, this offer used in the UPDATE would be formed
by simply creating a new PeerConnection (see
<xref target="sec.peerconnection" format="default"/>) and making sure that
the same local media streams have been added into this new
PeerConnection. Then the new PeerConnection object would
produce an SDP offer that could be used by the signaling to
perform the UPDATE strategy discussed in
<xref target="RFC3960" format="default"/>.</t>
<t>As a result of sharing the media streams, the application
will end up with N parallel PeerConnection sessions, each
with a local and remote description and their own local and
remote addresses. The media flow from these sessions can be
managed using setDirection (see
<xref target="sec.transceiver-set-direction" format="default"/>), or the
application can choose to play out the media from all
sessions mixed together. Of course, if the application wants
to only keep a single session, it can simply terminate the
sessions that it no longer needs.</t>
</section>
</section>
</section>
<section anchor="sec.interface" numbered="true" toc="default">
<name>Interface</name>
<t>This section details the basic operations that must be present
to implement JSEP functionality. The actual API exposed in the
W3C API may have somewhat different syntax but should map easily
to these concepts.
</t>
<section anchor="sec.peerconnection" numbered="true" toc="default">
<name>PeerConnection</name>
<section anchor="sec.pc-constructor" numbered="true" toc="default">
<name>Constructor</name>
<t>The PeerConnection constructor allows the application to
specify global parameters for the media session, such as the
STUN/TURN servers and credentials to use when gathering
candidates, as well as the initial ICE candidate policy and
pool size, and also the bundle policy to use.</t>
<t>If an ICE candidate policy is specified, it functions as
described in
<xref target="sec.ice-candidate-policy" format="default"/>, causing the JSEP
implementation to only surface the permitted candidates
(including any implementation-internal filtering) to the
application and only use those candidates for connectivity
checks. The set of available policies is as follows:
</t>
<dl newline="false" spacing="normal">
<dt>all:</dt>
<dd>All candidates permitted by
implementation policy will be gathered and used.</dd>