forked from jalemkul/gmx_tutorials_livecoms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgmx_tutorials.tex
2451 lines (1778 loc) · 237 KB
/
gmx_tutorials.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% LIVECOMS ARTICLE TEMPLATE FOR BEST PRACTICES GUIDE
%%% ADAPTED FROM ELIFE ARTICLE TEMPLATE (8/10/2017)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% PREAMBLE
\documentclass[9pt,tutorial,pubversion]{livecoms}
% Use the 'onehalfspacing' option for 1.5 line spacing
% Use the 'doublespacing' option for 2.0 line spacing
% Use the 'lineno' option for adding line numbers.
% Use the "ASAPversion' option following article acceptance to add the DOI and relevant dates to the document footer.
% Use the 'pubversion' option for adding the citation and publication information to the document footer, when the LiveCoMS issue is finalized.
% The 'bestpractices' option for indicates that this is a best practices guide.
% Omit the bestpractices option to remove the marking as a LiveCoMS paper.
% Please note that these options may affect formatting.
\usepackage{lipsum} % Required to insert dummy text
\usepackage[version=4]{mhchem}
\usepackage{siunitx}
\usepackage{textcomp}
%\usepackage{soul} %% for markup in revision, to use \hl
\usepackage{listings}
\lstset{
basicstyle=\ttfamily,
commentstyle={},
breakatwhitespace=true,
breaklines=true,
language=bash
}
\DeclareSIUnit\Molar{M}
\usepackage[italic]{mathastext}
\graphicspath{{figures/}}
\DeclareGraphicsExtensions{.png,.eps}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% IMPORTANT USER CONFIGURATION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\versionnumber}{1.0} % you should update the minor version number in preprints and major version number of submissions.
\newcommand{\githubrepository}{\url{https://github.com/jalemkul/gmx_tutorials_livecoms}} %this should be the main github repository for this article
\newcommand{\urlstring}{http://www.mdtutorials.com/gmx}
\newcommand{\tutorialhomeurl}{\url{\urlstring}}
\newcommand{\tutoriallyso}{\url{\urlstring/lysozyme}}
\newcommand{\tutorialkalp}{\url{\urlstring/membrane_protein}}
\newcommand{\tutorialpmf}{\url{\urlstring/umbrella}}
\newcommand{\tutorialbiphasic}{\url{\urlstring/biphasic}}
\newcommand{\tutorialcomplex}{\url{\urlstring/complex}}
\newcommand{\tutorialfes}{\url{\urlstring/free_energy}}
\newcommand{\tutorialvsite}{\url{\urlstring/vsites}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% ARTICLE SETUP
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{From Proteins to Perturbed Hamiltonians: A Suite of Tutorials for the GROMACS-2018 Molecular Simulation Package [Article v\versionnumber]}
\author[1*]{Justin A. Lemkul}
% \author[1,2\authfn{1}\authfn{3}]{Firstname Middlename Familyname}
% \author[2\authfn{1}\authfn{4}]{Firstname Initials Surname}
% \author[2*]{Firstname Surname}
\affil[1]{Department of Biochemistry, Virginia Tech, Blacksburg, VA, United States of America}
% \affil[2]{Institution 2}
% Correspondence emails. FMS and FS are the appropriate authors initials.
\corr{[email protected]}{JAL}
% \corr{[email protected]}{FS}
\orcid{Justin A. Lemkul}{0000-0001-6661-8653}
% \contrib[\authfn{1}]{These authors contributed equally to this work}
% \contrib[\authfn{2}]{These authors also contributed equally to this work}
% \presentadd[\authfn{3}]{Department, Institute, Country}
% \presentadd[\authfn{4}]{Department, Institute, Country}
\blurb{This LiveCoMS document is maintained online on GitHub at \githubrepository; to provide feedback, suggestions, or help improve it, please visit the GitHub repository and participate via the issue tracker.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% PUBLICATION INFORMATION
%%% Fill out these parameters when available
%%% These are used when the "pubversion" option is invoked
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pubDOI{10.33011/livecoms.1.1.5068}
\pubvolume{1}
\pubissue{1}
\pubyear{2019}
\articlenum{5068}
\datereceived{16 July 2018}
\dateaccepted{31 August 2018}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% ARTICLE START
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{frontmatter}
\maketitle
\begin{abstract}
% In your work, in this particular slot, please provide an abstract of no more than 250 words.
% Your abstract should explain the main contributions of your article, and should not contain any material that is not included in the main text.
% Please note that your abstract, plus the authorship material following it, must not extend beyond the title page or modifications to the LaTeX class will likely be needed.
Molecular dynamics (MD) simulations are a popular technique for studying the atomistic behavior of any molecular system. Performing MD simulations requires a user to become familiar with the commands, options, and file formats of the chosen simulation software, none of which are consistent across different programs. Beyond these requirements, users are expected to be familiar with various aspects of physics, mathematics, computer programming, and interaction with a command-line environment, presenting critical barriers to entry in the MD simulation field. This article presents seven tutorials for instructing users in the proper methods for preparing and carrying out different types of MD simulations in the popular GROMACS simulation package. GROMACS is an open-source, free, and flexible MD package that is consistently among the fastest in the world. The tutorials presented here range from a "simple" system of a protein in aqueous solution to more advanced concepts such as force field organization and modification for a membrane-protein system, two methods of calculating free energy differences (umbrella sampling and "alchemical" methods), biphasic systems, protein-ligand complexes, and the use of virtual sites in MD simulations. In this article, users are provided the rationale and a theoretical explanation for the command-line syntax in each step in the online tutorials (available at \tutorialhomeurl) and the underlying settings and algorithms necessary to perform robust MD simulations in each scenario.
\end{abstract}
\end{frontmatter}
\section{Introduction}
% Here you would explain what problem you are tackling and briefly motivate your work.
% In this particular template, we have removed most of the usage examples which occur in \texttt{sample-document.tex} to provide a minimal template you can modify; however, we retain a couple of examples illustrating more unusual features of our templates/article class, such as the checklists, and information on algorithms and pseudocode.
Molecular dynamics (MD) simulations are widely applied to studies of structure-function relationships, disease pathways, and drug design. MD simulations rely on the relationship between a given configuration of atoms and its energy to propagate dynamics. This process is driven by empirical force fields, energy functions and associated parameter sets that are used to compute energies and forces acting in a given configuration. After calculating the forces, the configuration is updated by applying those forces over a discrete time step, $\delta$t. An assembly of many configurations, linked through time, establishes a trajectory for subsequent visualization and analysis.
Performing an MD simulation requires that many choices be made, including the starting structure of the molecule(s) of interest, the force field model to be applied, concentration and nature of salt in solution, the integration method, algorithms for temperature and/or pressure regulation, and length of the simulation itself. This complex series of decisions will dictate the quality, reliability, and reproducibility of the simulation, and consequently should be undertaken only after careful consideration. Beyond the conceptual issues lies the interface with the MD simulation package itself. The user must translate the knowledge of the experimental setup to actual practice by implementing these details in the framework of a complex software suite. The combination of these factors can be burdensome for new users, who may be prone to making errors or failing to consider best practices and subtle nuances of a large body of literature.
The GROMACS simulation package~\cite{Hess2008,Abraham2015} is an open-source software suite that is free for all users and available worldwide. GROMACS has for many years been among the fastest MD codes available~\cite{Hess2008}, making it a popular choice for carrying out MD simulations as hardware and algorithmic development allow for ever-increasing trajectory lengths.
\subsection{Scope}
% Tutorials should endeavor to cover the specific task at hand, and also highlight how the steps might need to be modified (or additional care might need to be taken at particular points) to handle more general cases.
% The scope of the tutorial, as well as the expected proficiencies / outcomes for researchers who complete the tutorial, should be clearly defined.
% This will often happen in a specific section or subsection in the article itself.
% Define objectives for each tutorial, refer back in each section
The seven tutorials presented here range in difficulty from basic to advanced, roughly in the order of presentation. Each of the tutorials comes with its own learning objectives and expected outcomes.
After completing Tutorial 1, "Lysozyme in Water," the user should be able to build a solvated protein system and carry out a short MD simulation. Specific learning objectives include:
\begin{enumerate}
\item Understand the content of a GROMACS system topology
\item Carry out basic steps of setting up a periodic system, including solvation and adding ions
\item Identify keywords and typical settings in \texttt{.mdp} input files
\item Control position restraints and topology directives via \texttt{.mdp} settings
\item Determine the appropriate algorithms for energy-minimizing and equilibrating a biomolecular system and performing a production MD simulation
\end{enumerate}
After completing Tutorial 2, "KALP\textsubscript{15} in DPPC," the user should be able to build a simple membrane-protein system containing one lipid type. Specific learning objectives include:
\begin{enumerate}
\item Understand the organization and contents of GROMACS force field files and how parameters from different, but compatible, sources can be added to them
\item Apply an iterative packing routine to place phospholipids around a transmembrane protein
\item Perform lipid-specific analysis using custom index groups
\end{enumerate}
After completing Tutorial 3, "Umbrella Sampling," the user should be able to perform simulations under the influence of a biasing potential. Specific learning objectives include:
\begin{enumerate}
\item Apply \texttt{.mdp} keywords that invoke the pull code to add a simple, one-dimensional biasing potential to selected atoms in a system
\item Define what is meant by a "reaction coordinate" and how to construct a suitable one
\item Perform restrained simulations in multiple sampling windows along a reaction coordinate
\item Compute a potential of mean force profile
\end{enumerate}
After completing Tutorial 4, "Biphasic Systems," the user should be able to construct heterogeneous systems of two different liquids. Specific learning objectives include:
\begin{enumerate}
\item Build a simulation system containing a liquid that is not water
\item Manipulate the relative positioning of a box within a larger volume
\end{enumerate}
After completing Tutorial 5, "Protein-Ligand Complex," the user should be able to construct a solvated protein system including a ligand that needs to be parametrized. Specific learning objectives include:
\begin{enumerate}
\item Produce a ligand topology outside of GROMACS and incorporate it into a system topology
\item Determine how to verify if a ligand topology is suitable for further simulation
\item Perform analysis of interactions between a protein and a ligand
\end{enumerate}
After completing Tutorial 6, "Free Energy of Solvation," the user should be able to carry out a series of simulations for the alchemical transformation of a small molecule in water. Specific learning objectives include:
\begin{enumerate}
\item Understand the purpose of \texttt{.mdp} keywords related to free energy calculations
\item Use the Bennett Acceptance Ratio method to compute the free energy difference of the transformation of van der Waals terms
\end{enumerate}
After completing Tutorial 7, "Virtual Sites," the user should be able to understand what virtual sites are and how they are used in specific cases. Specific learning objectives include:
\begin{enumerate}
\item Define a virtual site and its position relative to other atoms
\item Construct a system of a linear molecule with virtual sites
\end{enumerate}
\section{Prerequisites}
% Here you would identify prerequisites/background knowledge that are assumed by your work, as well as any software/license requirements.
The tutorials described in this article assume the user has installed GROMACS, version 2018 or any minor or patch release in the 2018.x series. Instructions for how to install GROMACS can be found in the online manual at \url{http://manual.gromacs.org/documentation/2018-current/index.html}. This tutorial article will not describe how to install GROMACS, as details vary depending on the available hardware, compilers, etc.
\subsection{Background knowledge}
% Tutorials should clearly define what concepts or abilities researchers will need to complete the tutorial (e.g., some proficiency in Python; experience with Jupyter notebooks; knowledge of classical MD; etc).
These tutorials assume the user is familiar with basic Linux or Unix command-line interaction and navigation as well as the use of a plain-text editor such as VIM or Emacs. Background knowledge in MD simulations would be beneficial, though this article seeks to introduce new users to the fundamental concepts that are required to properly execute a simulation. These tutorials cannot be comprehensive in explaining all elements and algorithms in MD simulations, and as such, users are expected to seek outside resources ({\em e.g.}~\cite{Frenkel2001,Leach2001,Kukol2008,Braun2018}) to assist in their understanding. Tutorial 1 (see Section~\ref{lyso}) assumes the user has no experience with GROMACS, and therefore contains the most detail about how to carry out various tasks. In the subsequent six tutorials, it will be assumed that the user is familiar with basic operations and not all steps will be described in depth, except for instances of new or different usage.
Throughout this article, commands that are to be issued by the user are written in \texttt{monospace font} and the command prompt is indicated with \texttt{\$}, which itself is not part of the command the user should enter. Some commands are long and span multiple lines; in these cases the commands should not be entered on separate lines in the terminal, rather as one continuous command. Program and file names are similarly written in \texttt{monospace font} in the text to distinguish them from the narrative.
\subsection{Software/system requirements}
% Tutorials should clearly define what system and/or software requirements the researcher will need to complete the tutorial (e.g., VMD version 1.9 or newer, AMBER, etc.). Tutorials requiring specific software packages must provide instructions and files for the referenced version of the software.
These tutorials assume the user is working with a GROMACS version in the 2018.x series. Other versions have different command-line syntax and input file keywords and attempting to use them will result in problems.
The output of GROMACS analysis programs is, by default, written in a format compatible with XmGrace (\url{http://plasma-gate.weizmann.ac.il/Grace/}). As such, the user should be familiar with plotting data using this program.
For visualization of coordinate files and trajectories, many options are available, including Visual Molecular Dynamics (VMD, \url{http://www.ks.uiuc.edu/Research/vmd/})~\cite{Humphrey1996}, Chimera (\url{https://www.cgl.ucsf.edu/chimera/})~\cite{Petersen2004}, and PyMOL (\url{https://pymol.org/2/}), among others. Users should be familiar with at least one of these programs for visualizing the system. The latest version of each of these programs (VMD 1.9.3, Chimera 1.12, and PyMOL 2.0 at the time this article was written) is preferred for maximum compatibility with different file formats and features.
The protein-ligand complex tutorial will make use of the Avogadro molecular editor (\url{https://avogadro.cc/})~\cite{Hanwell2012} and a Python script that requires a Python 2.7.x version, the NetworkX package in the 1.11.x version series, and any compatible NumPy version. The topology conversion script is not compatible with NetworkX versions in the 2.x series.
\section{Content and Links}
% A tutorial will normally draw on additional files and materials; clearly indicate where and how these are available, with links, and how they are being archived for the long-term and maintained so they stay current.
% You will likely want to reference your GitHub repository as a central point to access all of this information, and then the GitHub repository may link out to other content as needed.
The tutorials described in this article can be accessed at \tutorialhomeurl. All necessary files for completing each tutorial are provided at that location.
\subsection{Overview}
GROMACS comprises a set of programs for preparing, running, and analyzing MD simulations. Each program is called from within a single binary, called \texttt{gmx}, and command-line syntax is similar among all the programs. All options are preceded by a hyphen (-) and the subsequent keyword or filename must be the next argument on the command line. For instance, if a coordinate file is specified as an input file for a given command, \texttt{-f my\_file.pdb} is correct syntax, but separating \texttt{-f} from \texttt{my\_file.pdb} in any way with intervening text or other arguments will cause an error. Some arguments are boolean, meaning that they indicate binary, yes/no choices. For instance, the option \texttt{-princ} (from the \texttt{editconf} command, discussed below), indicates that the user wants the solute molecule to be aligned with its principal axis along the Cartesian {\em x}-axis. The opposite, to not require alignment, would be \texttt{-noprinc}. This syntax is the same for any GROMACS boolean argument.
At any time, the user can refer to help information for a given program by invoking \texttt{gmx help} and the name of the program or use the \texttt{-h} flag. For instance, to read the help information for the \texttt{pdb2gmx} program, the user would type \texttt{gmx help pdb2gmx} or equivalently, \texttt{gmx pdb2gmx -h}.
Every GROMACS program has a set of default options and file names that it assumes it can use if the user does not otherwise specify. That is, if any argument requires a file name (typically input and output files) and one is not provided by the user, the program will look in the working directory for a file with the default name. If that file is not found, then an error is returned. This outcome is often a point of confusion for new users. Refer to the help information for a given program to see what its default file names are, so in the event of an error, it is clear which file is missing. It is generally inadvisable to rely on GROMACS default file names, as they are very generic. The tutorials below all use descriptive file names to encourage best practices in organization and naming conventions.
Each tutorial is available at \tutorialhomeurl, and the link to each specific tutorial is given within each of the following sections. Command-line syntax and input files are available at each tutorial's URL. This article will describe the practical considerations and theoretical basis for the approaches taken here.
%%%%%%%%% TUTORIAL 1 %%%%%%%%%
\subsection{Tutorial 1: Lysozyme in Water} \label{lyso}
\begin{figure}[H]
\centering
\includegraphics{lysozyme_1aki}
\caption{Crystal structure of hen egg-white lysozyme, taken from PDB 1AKI~\cite{Artymiuk1982}.}
\label{lyso_xtal_fig}
\end{figure}
Many biomolecular simulations involve a solute that is solvated in an aqueous medium or saline solution. Thus, this tutorial guides the user through preparing a system of a protein in water, with counterions. The full tutorial is available at \tutoriallyso. The general protocol for any simulation is to build the coordinates and topology (a description of the properties and connectivity of all atoms in the system), perform energy minimization to relax the coordinates to a low-energy state, equilibrate, and perform a production MD simulation, during which data are collected for subsequent analysis. In this tutorial, the coordinates for the system of interest are for hen egg-white lysozyme (Figure~\ref{lyso_xtal_fig}), which can be downloaded from the Protein Databank, accession code 1AKI~\cite{Artymiuk1982}. For simplicity, the user is instructed to remove crystal waters from the input PDB file:
\begin{lstlisting}
$ grep -v HOH 1aki.pdb > 1AKI_clean.pdb
\end{lstlisting}
%
Removal of water here is to make the resulting topology less complex. Multiple blocks of water in the topology require manual correction, a task that is not well-suited to new users of the software who are likely unfamiliar with the contents and format of various files. It should be noted, however, that ad hoc removal of crystal waters is not generally a good practice, particularly in instances where a tightly bound water may have some functional significance (structural or catalytic) and must be modeled appropriately.
\subsubsection{Topology Preparation} \label{lyso_top}
The typical approach that GROMACS takes for constructing the topology of a molecular system is to begin with generating a topology for the solute of interest, which is then amended or updated to include other components that are added to the system (water, ions, ligands, etc.). GROMACS uses two types of topology files, one with a \texttt{.top} extension and the other with an \texttt{.itp} extension. There is only ever one \texttt{.top} file for a system, hence it is called the "system topology." A topology with an \texttt{.itp} extension is called an "included topology," indicating that it can be embedded (included) within a system topology to provide the source of other parameters or other molecule definitions.
The GROMACS program most frequently used to write topologies is called \texttt{pdb2gmx}, which reads a coordinate file, determines its contents, and writes a topology for the supplied molecule(s). The name of this program is somewhat misleading, as there is no strict requirement to provide coordinates in PDB format. Refer to the help information for \texttt{pdb2gmx} for allowed formats.
The \texttt{pdb2gmx} program is best suited to linear biopolymers, with limited exceptions for branched (non-linear) bonds. Thus, it is very easy to use \texttt{pdb2gmx} for proteins and nucleic acids, but it is somewhat ill-suited for materials, branched polymers, or complex carbohydrates. However, \texttt{pdb2gmx} can handle common cases such as disulfides and heme linkages. An important point about \texttt{pdb2gmx} is that it can only process species for which parameters have been provided; it does not perform automated parametrization. Therefore, a user must carefully consider what is provided to \texttt{pdb2gmx} before attempting to process a coordinate file. Each force field provided with GROMACS has one or more \texttt{.rtp} (residue topology) files, which contain the parameters and bonded structure for each species supported by the force field. If a given species is not found in a force field \texttt{.rtp} file, \texttt{pdb2gmx} will return a fatal error. Therefore, users must always check the availability of non-standard residues in a given force field to determine its suitability for use.
Many force fields are included in GROMACS, including several variants of the AMBER force field~\cite{Cornell1994,Kollman1996,Wang2000,Garcia2002,Duan2003,Hornak2006,LindorffLarsen2010}, CHARMM22/CMAP for proteins~\cite{MacKerell1998,MacKerell2004} and CHARMM27 for nucleic acids~\cite{Foloppe2000,MacKerell2000}, several GROMOS96 parameter sets~\cite{Daura1998,Schuler2001,Oostenbrink2004,Schmid2011}, and OPLS-AA~\cite{Kaminski2001}. Also note that the CHARMM36 force field~\cite{Best2012,Hart2012,Denning2011,Klauda2010,Huang2016} is available in GROMACS format from \url{http://mackerell.umaryland.edu/charmm_ff.shtml#gromacs}.
Beyond the consideration of whether or not a force field has parameters for a given chemical species, it is important to realize that the choice of force field is perhaps the single most critical aspect to beginning a new simulation project. No existing force field perfectly treats all species, and each available force field has pros and cons for different species. It is incumbent upon the user to make an informed, justifiable choice based on literature reading and evaluation of the force field parametrization protocol to understand to which purposes a given force field is best suited. There is no catch-all answer or universally "best" force field.
Once a starting structure has been chosen and the choice of force field decided upon, run \texttt{pdb2gmx} to create the topology:
\begin{lstlisting}
$ gmx pdb2gmx -f 1AKI_clean.pdb -o 1AKI_processed.gro -water spce
\end{lstlisting}
%
The first prompt is for the user to select the force field that will be applied to the system. For this tutorial, choose option 15 for OPLS-AA, a widely used all-atom force field. An all-atom force field is used here for simplicity rather than introducing concepts of implicit hydrogen atoms as in united-atom force fields or coarse-grain representations. In principle, many of the force fields provided with GROMACS would be adequate for the purposes of this tutorial. The next choice the user must make is the water model that is to be used. In the tutorial, the \texttt{-water} argument is invoked from \texttt{pdb2gmx} to select the SPC/E water mode~\cite{Berendsen1987}; if this option is not given on the command line, the user is prompted for an interactive choice. The water model is another important consideration, though one over which the user has less freedom. Each force field was parametrized for use with a specific water model. Therefore, the user is not entirely free to choose whichever model happens to be available without due consideration. GROMACS provides a suggested water model for each force field; unless there is good evidence to choose another model, the user should follow this recommendation for greatest accuracy. However, some studies have shown that different biomolecular force fields may be accurately combined with other water models, as is the case in this tutorial. The OPLS-AA force field was originally parametrized for use with the TIP3P water model~\cite{Jorgensen1983}, but it was subsequently shown that the combination of OPLS-AA with SPC/E yielded more accurate hydration free energies for protein side chains~\cite{Hess2006}, suggesting this combination is a sufficiently accurate and self-consistent model for simulating proteins. In the absence of any sufficiently strong justification or precedent, the user should always choose the recommended water model listed by \texttt{pdb2gmx}.
Often, new users are confused at having to choose a water model at this stage, particularly in the case that their provided coordinate file only contains a biomolecule such as a protein. The water model selection is planning for the subsequent steps, in which this solute is hydrated and a force field choice should be made ahead of time. By selecting the water model at this first step, \texttt{pdb2gmx} can write a complete topology that will describe all the species that will be included in the system being built.
By default, \texttt{pdb2gmx} will produce three output files: a system topology (named \texttt{topol.top} by default), a topology that specifies parameters for position restraints (named \texttt{posre.itp} by default, discussed in section~\ref{lyso_equil}), and a force-field compliant coordinate file (named \texttt{conf.gro} by default but in this tutorial is called \texttt{1AKI\_processed.gro}). The output coordinate file is actually somewhat of a side effect of the normal function of \texttt{pdb2gmx}, which is simply to produce a topology. However, since many experimental structures determined by X-ray crystallography lack the resolution to assign hydrogen atom positions, these atoms must be built in to the model. \texttt{pdb2gmx} does this, though it cannot build in other missing atoms, requiring either a complete experimental structure or the use of modeling software such as MODELLER~\cite{Sali1993} to construct missing atoms. As a result, the processed coordinates are subsequently written out to a file to match the contents of \texttt{topol.top}. The output coordinate file can be one of several formats, the \texttt{.gro} (GROMOS87 format) is not required but it is the default in GROMACS.
After generating \texttt{topol.top}, inspect its contents. For a full description of the topology file format, refer to the online tutorial. Several keywords are worth noting here. GROMACS processes files using C preprocessor syntax and macros. Topology files will often contain statements such as \texttt{\#define}, \texttt{\#ifdef/\#endif}, and \texttt{\#include}. The \texttt{\#define} macro sets the value of a variable, which can then be tested using \texttt{\#ifdef} statements to call different elements of a topology conditionally. This approach is most often taken in defining rigid vs. flexible water. Many topologies have \texttt{\#include} statements; these literally mean "copy and paste the contents of the named file here." CHARMM users may be familiar with this concept, which is called \texttt{stream} in that program and serves a similar function. The \texttt{\#include} statements are useful for making the topology compact, rather than writing out all parameters explicitly. A parent force field, defining all necessary parameters, is the first \texttt{\#include} statement, and others may reference water and ion topologies or the position restraint topology file that \texttt{pdb2gmx} created.
When \texttt{pdb2gmx} is done, it will print out the net charge of the protein to the terminal. Record this value for later use.
\subsubsection{Solvate the System} \label{lyso_solv}
Simulating proteins {\em in vacuo} is not typically of biological interest, rather a condensed-phase simulation is more relevant. Thus, the next step in building the solvated protein system is to define a volume around the protein that will be filled with water. There are several important considerations in doing so, and it is also important to explain the reason that these methods are employed.
The first fundamental concept is that, in a condensed-phase system, there are no "edges" or "boundaries" to the system that is being built. If the protein was solvated in some volume of water that was simply surrounded by vacuum, ultimately the system will develop into a droplet and water molecules that are on the surface of the droplet will tend to evaporate over time. In such cases, effects such as surface tension and poor energy conservation become important. To solve this issue, most modern simulations (and certainly all that are carried out in the condensed phase) employ what are called "periodic boundary conditions" (PBC), in which identical copies of the system of interest are constructed around the central "image," which is the system that a user actually constructs. Therefore, atoms at the perceived "boundaries" of the central image will interact with periodic copies of atoms in the neighboring images. Further, any atom that diffuses "outside" of the central image will reappear on the opposite face of the central image. In truth, there is no such thing as being "outside" of a box when employing PBC, as the representation of the system is infinite.
In constructing a simulation system (the central image) for use with PBC, an important consideration is the size of that central image. Under most circumstances, it is desirable to simulate the solute of interest in dilute solution (discussion of crowded or crystalline environments is beyond the scope of this tutorial). That is, the solute should not "see" any of its periodic copies in real space. The underlying principle is called the "minimum image convention," which states that to properly calculate the force on each atom, no atom should see multiple copies of the same atom within the short-range neighbor list~\cite{Frenkel2001}. So how does one decide the right box size that will prevent atoms from experiencing duplicate forces? This decision requires an understanding of the requirements of the force field chosen in section~\ref{lyso_top}. Each force field has a set of required nonbonded cutoffs (see discussion on values of \texttt{rvdw} and \texttt{rcoulomb} below in section~\ref{lyso_em}). These cutoffs define the radius around each atom for which short-range forces are calculated. If the box is too small, such that this radius encompasses more than half of any box dimension (along the x, y, or z-axes), then forces will be double-counted, leading to artifacts. A common approach is to define a buffer around the solute of interest that is equal to the longest cutoff that will be employed in the simulation. The user should note that this type of planning is initiated even before arriving at this step in the tutorial, as it should be determined before or upon choosing a force field to represent the system.
The GROMACS program that is used for defining the box around the solute, and the subsequent positioning of the solute therein, is called \texttt{editconf}, for "edit configuration." This program allows the user to manipulate the coordinates of the solute via rotations and translations, so that molecules can be specifically positioned and oriented within a box. The most conventional usage for a system like this one is to center the solute with a box that has a defined buffer according to the minimum image convention (see below). For the lysozyme system in this tutorial, issue the following \texttt{editconf} command:
\begin{lstlisting}
$ gmx editconf -f 1AKI_processed.gro -o 1AKI_newbox.gro -c -d 1.0 -bt cubic
\end{lstlisting}
%
Doing so will center the protein (option \texttt{-c}, which is boolean and therefore takes no additional argument) in a cubic box (\texttt{-bt cubic}), with a minimum solute-box distance of 1.0 nm (\texttt{-d 1.0}). Note that GROMACS uses SI notation for all base units. New users often face issues in specifying distances and cutoffs if they are more familiar with software that uses \AA. There are many different box shapes that can be used, specifically those that have a small volume but still satisfy the same periodic distance, but for simplicity this tutorial will only address a cubic system. Interested readers are referred to the GROMACS manual for discussion on alternate box shapes. After running \texttt{editconf}, the protein is now centered within a cubic box (Figure~\ref{lyso_box_fig}).
\begin{figure}[H]
\centering
\includegraphics{1AKI_1box}
\caption{The structure of hen egg-white lysozyme in an empty, cubic box following the use of \texttt{editconf}.}
\label{lyso_box_fig}
\end{figure}
Now that the box size and shape have been defined, it is possible to visualize the periodic images of the system that were alluded to above. Such a representation is shown in Figure~\ref{lyso_boxes_fig}.
\begin{figure}[h]
\centering
\includegraphics{1AKI_boxes}
\caption{Periodic construction of the central image (with lysozyme colored as a rainbow gradient from blue (N-terminus) to red (C-terminus). Periodic images of the protein are shown in light gray.}
\label{lyso_boxes_fig}
\end{figure}
Having defined a suitable volume around the protein, the system is subsequently filled with water. The program that does this is called \texttt{solvate}. This program relies on the use of a small, pre-equilibrated cubic box of water, which is then placed onto a grid that is evenly distributed through the volume of the \texttt{1AKI\_newbox.gro} coordinate file. Any water molecules with atoms that overlap the protein (or any species already within the box) are deleted.
\begin{lstlisting}
$ gmx solvate -cp 1AKI_newbox.gro -cs spc216.gro -o 1AKI_solv.gro -p topol.top
\end{lstlisting}
%
The \texttt{solvate} program requires both an input coordinate file for the solute (specified with the \texttt{-cp} option, conventionally the "coordinates of the protein") and the solvent (option \texttt{-cs} for "coordinates of the solvent"). In this example, the empty volume is filled using the \texttt{spc216.gro} coordinate file, which contains 216 water molecules that have been pre-equilibrated using the SPC model~\cite{Berendsen1981}. A common question that new users have is about the availability of coordinates for other 3-point water models such as TIP3P~\cite{Jorgensen1983} and SPC/E~\cite{Berendsen1987} for the purpose of solvation. Simply, distinct files are not necessary, as any coordinate file of 3-point water can serve as a reasonable starting point for such systems. After a short equilibration period (see below, section~\ref{lyso_equil}), the solvent will relax according to the properties of these other models.
Where does the \texttt{spc216.gro} file come from? GROMACS has a database of pre-built coordinate files for 3-, 4-, and 5-point models of water, located in \texttt{\$GMXLIB} (an environment variable that refers to the \texttt{share/gromacs/top} subdirectory of wherever GROMACS is installed on the computer). GROMACS will search in this directory for files specified on the command line before looking in the working directory.
It is important to note the use of the \texttt{-p} option to specify the name of the system topology. Doing so will cause \texttt{solvate} to automatically update the system topology with the number of water molecules it added. If the user does not supply the topology on the command line, updates must be made manually. It is strongly encouraged that users instruct GROMACS programs to do bookkeeping for them, rather than trying to make manual adjustments that can be error-prone.
\subsubsection{Add Ions} \label{lyso_ions}
With the system solvated, the last step in constructing coordinates is to add ions. Biological and {\em in vitro} systems often contain some amount of salt, and it is in the interest of those performing simulations to model relevant conditions as closely as possible. Beyond this point, MD simulations are typically carried out under electroneutral conditions; that is, the system does not carry net charge. Individual molecules (proteins, ions, etc.) may carry a formal charge. This convention reflects typical experimental conditions but there is another point that needs to be mentioned. Electrostatic interactions are most often evaluated using the particle mesh Ewald (PME) method~\cite{Darden1993,Essmann1995}, which requires an electroneutral system. The PME method itself can supply what is known as a "uniform neutralizing plasma," in which an offsetting charge is spread uniformly across the system to neutralize it. For homogeneous systems, this approach is adequate and the addition of explicit ions is not required. However, systems of biological interest are rarely homogeneous, and relying on the neutralizing plasma can result in serious artifacts~\cite{Hub2014}. For this reason, monoatomic ions are typically added within the aqueous solution to counterbalance any net charge from the solute(s) present.
The GROMACS program for adding ions is called \texttt{genion}. It adds ions by replacing molecules in whatever group the user specifies, typically water. To do so, \texttt{genion} needs both coordinate information (to know what coordinates to assign to the added ions) as well as topological information (to know which atoms are connected, therefore defining molecules that are deleted in their entirety). Such information is present in a single file type with the extension \texttt{.tpr}. A \texttt{.tpr} file also contains simulation instructions, which are written in plain-text files with the extension \texttt{.mdp}. In this context, no instructions for performing a simulation are required, but the file format contains everything that is needed. The contents of an \texttt{.mdp} file will be discussed in detail in subsequent sections of this tutorial. To produce a \texttt{.tpr} file, invoke the GROMACS preprocessor, \texttt{grompp}:
\begin{lstlisting}
$ gmx grompp -f ions.mdp -c 1AKI_solv.gro -p topol.top -o ions.tpr
\end{lstlisting}
%
The \texttt{grompp} program reads in coordinates (\texttt{-c}), topology (\texttt{-p}), and a simulation input file (\texttt{-f}) and produces the \texttt{.tpr} file. For the purpose of adding ions, the \texttt{.mdp} file passed to the \texttt{-f} option can contain any syntactically correct set of options. In this tutorial, a very simple file (\texttt{ions.mdp}) is supplied, to minimize the number of options that must be correctly specified. The \texttt{ions.mdp} file calls for energy minimization but this process is not actually carried out. It is also important to note that a plain cutoff method is specified for electrostatics; doing so is inadvisable for performing a simulation, but \texttt{grompp} will generate a warning if the user attempts to execute a process with PME for a non-neutral system. As counterions have not yet been added to the system, PME should not be specified. Moreover, since this \texttt{.mdp} file is not actually being used for any simulation, it need not specify rigorous simulation methods.
Recall the net charge printed by \texttt{pdb2gmx}, which should be +8. This number means that the protein has a net positive charge at neutral pH, requiring the addition of 8 anions to neutralize it. To do so, execute \texttt{genion} as follows:
\begin{lstlisting}
$ gmx genion -s ions.tpr -o 1AKI_solv_ions.gro -p topol.top -pname NA -nname CL -neutral
\end{lstlisting}
%
Choose group 13 (SOL) to replace water molecules with ions. Do not choose System or Protein, or any similar group, otherwise the output system will be completed fragmented and unphysical.
\texttt{genion} reads the \texttt{.tpr} file passed to the \texttt{-s} option and outputs (\texttt{-o}) a coordinate file called \texttt{1AKI\_solv\_ions.gro}. It makes changes to the topology (deleting water and adding the requested ions), therefore the use of \texttt{-p} to specify the topology is strongly encouraged. The \texttt{-pname} and \texttt{-nname} specify the residue names of positive and negative ions, respectively. This example does not require the addition of sodium ions (Na\textsuperscript{+}) but is shown for illustrative purposes. The user can specify the number of negative ions to be added with \texttt{-nn}, and if positive ions were needed, they would be specified with \texttt{-np}. Note that the charge on different systems will be different, and in some instances, it is desirable to add a higher salt concentration. To automatically neutralize a system, use the \texttt{-neutral} option, and to set a target concentration for which \texttt{genion} will determine the number of ions to be added (rather than specifying them manually), use \texttt{-conc} with a suitable value (in M).
The final, solvated system will look something like what is shown in Figure~\ref{lyso_solv_ions_fig}. Note that due to the random nature of adding ions with \texttt{genion}, the location of Cl\textsuperscript{$-$} ions will differ in every system.
\begin{figure}[h]
\centering
\includegraphics{1AKI_solv_ions}
\caption{Rendering of the final system after solvation and adding ions. Green spheres are Cl\textsuperscript{$-$} ions, whose positions will vary randomly as a result of \texttt{genion}.}
\label{lyso_solv_ions_fig}
\end{figure}
\subsubsection{Energy Minimization} \label{lyso_em}
Before beginning dynamics, it is necessary to relax the solvated system to a low-energy state. Energy minimization is the process that moves the positions of atoms according to the forces acting upon them. In preparing the system, hydrogen atoms were constructed by \texttt{pdb2gmx} in assumed geometries, water molecules were added in a grid around the protein by \texttt{solvate} in a manner that may create suboptimal hydrogen bonding, and ions were added randomly by \texttt{genion} with no regard to nearby atoms with which the charges might clash. As a result, the system must be relaxed to a low-energy state, otherwise any simulation will likely be unstable. It is impossible to know whether or not the system is at its global energy minimum, but this is not necessarily the goal of energy minimization, rather it seeks to find a reasonable starting point for the simulation.
In this section, the contents of the \texttt{.mdp} file will be discussed in depth. Refer to the \texttt{minim.mdp} file, which will be used as the instructions for the energy minimization process. The first section of this input file specifies the parameters for carrying out energy minimization:
\begin{lstlisting}
integrator = steep
emtol = 1000.0
emstep = 0.01
nsteps = 50000
\end{lstlisting}
%
These instructions tell the GROMACS program \texttt{mdrun} (discussed below) to use the steepest descents method for energy minimization, and to terminate the process if the magnitude of the potential energy gradient is 1000.0 kJ mol\textsuperscript{-1} nm\textsuperscript{-1} or smaller. The maximum step size along the gradient is 0.01 nm, and a maximum of 50000 steps are allowed. The default value of \texttt{emstep} in GROMACS of 0.01 nm is used, but it is often necessary to use a smaller maximum step ({\em e.g.} 0.002 nm) for systems that have difficulty converging; the smaller step size allows for a more thorough walk over the potential energy surface, whereas a larger step may miss a path to the true minimum. It is also possible to allow the process to go on indefinitely, stopping only when convergence is reached (due to numerical precision or by actually achieving \texttt{emtol}) by setting \texttt{nsteps = -1}. Note that \texttt{emstep} has units of distance, not time. Energy minimization is not a dynamical process; time does not elapse between each step and there are no velocities. Energy minimization steps are expressed in terms of maximum displacement along the vector indicated by the force.
The remainder of the \texttt{.mdp} file contains instructions for how to compute nonbonded interactions:
\begin{lstlisting}
nstlist = 1
cutoff-scheme = Verlet
ns_type = grid
coulombtype = PME
rcoulomb = 1.0
rvdw = 1.0
pbc = xyz
\end{lstlisting}
%
The value of \texttt{nstlist} controls how many steps (minimization or MD integration) elapse between updating the neighbor list for determining which atoms contribute to the short-range forces. In MD simulations, this value is larger because atoms exchange from the neighbor list in diffusion-limited time, but for energy minimization, it needs to be set to 1 because the configuration may change considerably between each step. The \texttt{cutoff-scheme = Verlet} setting uses a buffered neighbor list, that is, atoms outside the longest cutoff are still tracked to improve energy conservation. Neighbor searching is done by checking atoms in neighboring grids (\texttt{ns\_type = grid}) rather than checking every possible atom (\texttt{ns\_type = simple}). All short-range nonbonded interactions (electrostatics and van der Waals) are truncated at 1.0 nm (\texttt{rcoulomb = 1.0} and \texttt{rvdw= 1.0}), and periodic boundary conditions are applied in all three dimensions (\texttt{pbc = xyz}). The PME method is used to calculate long-range electrostatic forces.
Note that in \texttt{.mdp} files, there is no difference between a hyphen (-) and underscore (\_); hence in the above example, \texttt{ns-type} and \texttt{ns\_type} would be equivalent, as would \texttt{cutoff-scheme} and \texttt{cutoff\_scheme}.
As above, invoke \texttt{grompp} to assemble the instructions for energy minimization (\texttt{-f minim.mdp}), coordinates \\(\texttt{-c 1AKI\_solv\_ions.gro}), and topology (\texttt{-p topol.top}) to create the run input file for energy minimization (\texttt{-o em.tpr}):
\begin{lstlisting}
$ gmx grompp -f minim.mdp -c 1AKI_solv_ions.gro -p topol.top -o em.tpr
\end{lstlisting}
The GROMACS \texttt{mdrun} program is responsible for performing all minimization and dynamics processes. To run energy minimization, use the following syntax:
\begin{lstlisting}
$ gmx mdrun -v -deffnm em
\end{lstlisting}
%
The \texttt{-v} option invokes "verbose" mode, in which an estimate of time remaining is printed to the screen, along with the current energy minimization step, the step size, potential energy, and maximum force. It is not a required option, and if printed to a file, may result in a large file that is saved to disk. It is, however, instructive to watch the progress of energy minimization to understand what is going on. The \texttt{-deffnm} option defines the base file name for all input and output files and eliminates the need for using individual input and output flags to specify names.
\texttt{mdrun} requires the \texttt{.tpr} file as its only input, normally passed to the \texttt{-s} flag. \texttt{mdrun} produces several file types, including an ASCII text file with a \texttt{.log} extension, a binary file with all energy values with a \texttt{.edr} extension, and trajectory files with either \texttt{.trr} (full-precision coordinates, velocities, and/or forces) or \texttt{.xtc} (reduced precision coordinates only) extensions. When using \texttt{-deffnm}, these files will be called \texttt{em}, with a corresponding file extension. It is useful to name files in this manner to avoid relying on default file names (\texttt{md.log}, \texttt{ener.edr}, \texttt{traj.trr}, \texttt{traj\_comp.xtc}), which will be the same for any process carried out by \texttt{mdrun}.
When \texttt{mdrun} is done, the user will see something similar to the following:
\begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
Steepest Descents converged to Fmax < 1000 in 726 steps
Potential Energy = -5.8448769e+05
Maximum force = 9.6957593e+02 on atom 736
Norm of force = 2.3290928e+01
\end{lstlisting}
For a system such as this, it is expected that the final potential energy will be a negative value, on the order of 10\textsuperscript{5} - 10\textsuperscript{6} kJ mol\textsuperscript{-1}. Potential energy is an extrinsic quantity, so larger systems will have larger magnitudes. For a solvated protein, the value will be negative, as favorable electrostatic interactions between water molecules dominate the potential energy terms. Smaller systems may even have positive potential energy, arising from the fact that in such cases, the intramolecular bonded terms have a larger magnitude than nonbonded terms. Bonded components of the potential energy function have positive values, by definition. So for small systems, these may prevail and the final, potential energy is positive. Such an outcome is not necessarily an indication that anything is wrong. It would, however, be unusual for a fully solvated protein system such as this one to have a positive potential energy.
The various energy terms saved during energy minimization are stored in the \texttt{em.edr} file, and can be extracted with the \texttt{energy} program. For example, to extract the potential energy of a system as a function of energy minimization step:
\begin{lstlisting}
$ gmx energy -f em.edr -o potential.xvg
\end{lstlisting}
By default, GROMACS analysis programs write output formatted for use with the XmGrace plotting program, though this formatting can be changed with the \texttt{-xvgr} option. The output \texttt{potential.xvg} file can be plotted to produce Figure~\ref{lyso_em_fig}.
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_em_potential}
\caption{Potential energy of the solvated lysozyme system as a function of energy minimization step.}
\label{lyso_em_fig}
\end{figure}
The solvated system is now at a reasonable energy minimum. Note that the steepest descents method is very efficient but also may have difficulty finding a true energy minimum. The settings used in this tutorial are fairly permissive but work reasonably well for most similar systems. There are applications, however, that require more exhaustive energy minimization (for instance, free energy calculations such as those in section~\ref{fes}), either through multiple rounds of minimization with different algorithms that may more carefully sample energy minima, or with stricter convergence criteria.
It is also important to note that the large changes in potential energy are often confusing to new users. During dynamics, it is expected that energy will be conserved, but the purpose of energy minimization is to change the configuration of the system in a manner that leads to a decrease in energy. As such, obtaining an average energy during energy minimization is not a useful metric, nor is any examination of energy conservation.
\subsubsection{Equilibration} \label{lyso_equil}
Following energy minimization, the system is now ready for the start of dynamics. Equilibration is carried out to obtain a stable thermodynamic ensemble for whatever conditions are desired. Most biomolecular simulations are carried out under a canonical (NVT) ensemble, in which the number of particles (N), volume (V), and temperature (T) of the system are conserved, or an isothermal-isobaric (NPT) ensemble, in which the number of particles (N), pressure (P), and temperature (T) are conserved. There are other statistical mechanical ensembles that can be employed, but this tutorial will focus on those that are most relevant to biomolecular systems. Before carrying out a simulation under a given ensemble, the system must be equilibrated under the same conditions. For this tutorial, the production MD simulation will be carried out under an NPT ensemble. It is possible to simply initiate an equilibration simulation under the same conditions, but it is often more robust to first equilibrate the temperature of the system under an NVT ensemble before applying a barostat to control the temperature. The simultaneous combination of velocity generation and coordinate scaling under the influence of the barostat can introduce instabilities in a system that may be far from equilibrium.
It is also important to note that rather than initiating the simulation at the desired temperature, it is also possible to slowly warm the system up. This technique will not be applied in this tutorial, as there is some debate as to whether or not it is necessary in most cases. Interested readers are directed to the GROMACS manual section on "simulated annealing" for information on how this process is carried out in GROMACS.
To initialize dynamics, velocities are generated by \texttt{grompp} and written to the \texttt{.tpr} file. The following settings in the \texttt{nvt.mdp} file control the generation of velocities:
\begin{lstlisting}
gen_vel = yes
gen_temp = 300
gen_seed = -1
\end{lstlisting}
A new concept is the application of position restraints. A restraint is a biasing potential that is applied to selected atoms to disfavor motion. It should be noted that position restraints do not prevent motion, rather they penalize it. Position restraints are typically applied to solute non-hydrogen atoms during equilibration to prevent deformations that could be induced by the random nature of the starting velocities and subsequent collisions between the solute and solvent. Position restraints are specified with the following keyword in the \texttt{nvt.mdp} file:
\begin{lstlisting}
define = -DPOSRES
\end{lstlisting}
%
Recall the generation of the \texttt{posre.itp} file by \texttt{pdb2gmx} in section~\ref{lyso_top}, which is only invoked as part of an \texttt{\#ifdef} block in \texttt{topol.top}. This construct means restraints are only applied when the user wants them, with invocation being dependent upon the use of the \texttt{define} keyword in the \texttt{.mdp} file. The presence of \texttt{define = -DPOSRES} means that the \texttt{\#ifdef POSRES} condition is evaluated as true by \texttt{grompp} and the position restraint topology (\texttt{posre.itp}) is applied to the system. Note that the general syntax of \texttt{\#ifdef KEYWORD} and \texttt{define = -DKEYWORD}; failing to prefix the keyword with \texttt{-D} will not lead to any error, but the \texttt{\#ifdef} condition will not be matched.
Some of the settings in the \texttt{.mdp} file are the same as in energy minimization and will not be repeated here. New settings that are introduced during NVT are:
\begin{lstlisting}
integrator = md
nsteps = 50000
dt = 0.002
\end{lstlisting}
%
The above lines specify that this process is an MD simulation, and the \texttt{md} setting specifies the leap-frog integrator in GROMACS. A total of 50,000 integration steps will be performed, and each step corresponds to an integration time step, $\delta$t, of 0.002 ps, or 2 fs. The simulation will therefore last for 100 ps (50,000 steps $\times$ 0.002 ps per step).
Output is specified in the next section of \texttt{nvt.mdp}:
\begin{lstlisting}
nstxout = 500
nstvout = 500
nstenergy = 500
nstlog = 500
\end{lstlisting}
%
The \texttt{nstxout} and \texttt{nstvout} settings control the interval of time steps between writing coordinates (x) and velocities (v) to the trajectory (\texttt{.trr}) file. With these settings, coordinates and velocities are saved every 1 ps. Similarly, energy terms are written to the \texttt{.edr} file (\texttt{nstenergy}) and \texttt{.log} file (\texttt{nstlog}) at the same interval.
Bond constraints are applied to allow for the 2-fs integration time step specified above. The value of $\delta$t is limited by the fastest vibrational modes in the system, which are typically bonds to hydrogen atoms. As these bonds will rarely exist in excited states, it is reasonable to model them as rigid, replacing the harmonic oscillator term with a holonomic constraint. Doing so allows the value of $\delta$t to be increased from roughly 0.5 fs up to 2 fs for typical dynamics runs. In the \texttt{.mdp} file, the relevant settings are:
\begin{lstlisting}
continuation = no
constraint_algorithm = lincs
constraints = h-bonds
lincs_iter = 1
lincs_order = 4
\end{lstlisting}
%
The \texttt{continuation} keyword tells \texttt{mdrun} that the coordinates are not the output of a previously constrained simulation, so the constraints need to be solved at the first time step. The method for applying constraints is LINCS, the Linear Constraint Solver~\cite{Hess1997,Hess2008b}. Constraints are applied to bonds to hydrogen atoms \texttt{constraints = h-bonds}, and the remaining settings are the default LINCS parameters for accuracy of the constraint equations. It is important to note that "constraints" and "restraints" are different. As described above, a restraint is a biasing potential that disfavors motion, whereas a constraint is used to fix a distance or angle in a simulation. Older literature often uses these terms interchangeably, but in modern simulation practice, these two terms have distinct meaning.
The next relevant section is related to nonbonded interactions, which were described above in section~\ref{lyso_em}. After the nonbonded settings are the keywords related to temperature control:
\begin{lstlisting}
tcoupl = V-rescale
tc-grps = Protein Non-Protein
tau_t = 0.1 0.1
ref_t = 300 300
\end{lstlisting}
%
The \texttt{tcoupl} keyword specifies the algorithm that is used for temperature coupling, also known as the thermostat. In this example, the velocity rescaling method of Bussi et al. is applied~\cite{Bussi2007}. Another thermostat commonly applied during equilibration is the weak coupling method of Berendsen et al.~\cite{Berendsen1984} (specified with \texttt{tcoupl = berendsen} in the \texttt{.mdp} file). The Berendsen weak coupling method has fallen out of favor, as it has been demonstrated that it does not sample a correct canonical velocity distribution~\cite{Bussi2007}. The Bussi velocity rescaling method, however, does sample a correct velocity distribution, and has the same advantages as the Berendsen method in that it quickly relaxes the system to the target temperature, making it practical and efficient for use in both equilibration and production MD simulations. The \texttt{tc-grps} keyword specifies which groups of atoms are coupled to the specified thermostat. All atoms should be coupled, but they can be separated into different groups. Strictly speaking, any separation of atoms into different groups violates the Equipartition Theorem, in that thermal energy can flow into and out of the thermal reservoir (thermostat) independently between the different groups, rather than being exchanged between the atoms in the system. The practice of coupling solute and solvent atoms separately arises from the "hot solvent/cold solute" problem in MD simulations~\cite{Lingenheil2008}, in which the solvent heats while the solute cools, due to a host of factors related to approximations made in MD integration. Here, the protein and all other non-protein atoms (water and ions) are coupled to separate thermostats at 300 K, with a coupling time ($\tau_{T}$) of 0.1 ps, which reflects rather tight regulation, which is appropriate for equilibration.
Create the NVT equilibration \texttt{.tpr} file with \texttt{grompp}:
\begin{lstlisting}
$ gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
\end{lstlisting}
%
The command-line options to \texttt{grompp} are largely the same as those used above for energy minimization, but there is one addition, the \texttt{-r} flag. This option specifies the name of a coordinate file that is used as the origin for the position restraint biasing potential. That is, this coordinate file contains the reference positions (hence \texttt{-r}) at which the biasing potential on the atoms would be zero. In previous versions of GROMACS, this option was not mandatory, but in the 2018 version, it is required to prevent users from incorrectly specifying the origin of their biasing potentials. In most cases, however, the coordinates passed to \texttt{-c} and \texttt{-r} will be the same. There are advanced use cases in which different coordinate files will be used, but they are beyond the scope of this tutorial.
Perform NVT equilibration by invoking \texttt{mdrun}:
\begin{lstlisting}
$ gmx mdrun -deffnm nvt
\end{lstlisting}
After the run, analyze the temperature time series to verify that the target temperature (300 K) was achieved and maintained, using the \texttt{energy} program. At the prompt, type \texttt{16 0}, then Enter. Group 16 is the temperature time series, and 0 terminates input to \texttt{energy}.
\begin{lstlisting}
$ gmx energy -f nvt.edr -o temperature.xvg
\end{lstlisting}
The time series of temperature during NVT is shown in Figure~\ref{lyso_nvt_temp_fig}.
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_nvt_temperature}
\caption{Temperature of the solvated lysozyme system over time during NVT equilibration.}
\label{lyso_nvt_temp_fig}
\end{figure}
As it is ultimately desirable to conduct the production MD simulation under NPT conditions (reflecting typical laboratory conditions), equilibration is continued under an NPT ensemble. Position restraints are maintained on the protein non-hydrogen atoms. New settings in the \texttt{.mdp} file related to pressure coupling are as follows:
\begin{lstlisting}
pcoupl = Parrinello-Rahman
pcoupltype = isotropic
tau_p = 2.0
ref_p = 1.0
compressibility = 4.5e-5
refcoord_scaling = com
\end{lstlisting}
The equilibration protocol utilizes a barostat method proposed by Parrinello and Rahman~\cite{Parrinello1981}, and later modified by Nos\'e and Klein~\cite{Nose1983}. As with thermostats explained in the NVT equilibration step, barostats require a reference pressure (\texttt{ref\_p}) and a period of time over which the pressure of the system is allowed to relax towards the target (\texttt{tau\_p}). The \texttt{pcoupltype} keyword controls the deformation of the box in all three dimensions. For a homogenous system such as the one constructed here, it is appropriate to scale all box vectors uniformly. Such scaling is achieved with \texttt{isotropic} pressure coupling. GROMACS has options for semi-isotropic, with $x$ and $y$ coupled together, $z$ coupled independently, as will be used in the membrane protein tutorial (see Section~\ref{kalp}). Anisotropic coupling allows all dimensions (including diagonal elements of the box) to be scaled separately (useful for crystalline materials and solid phases). The \texttt{compressibility} setting controls the speed of the response of the barostat and is, by default, set to the isothermal compressibility of water (4.5 $\times$ 10\textsuperscript{-5} bar\textsuperscript{-1}).
The \texttt{refcoord\_scaling} setting is often confusing for new users. Pressure is scaled by actually scaling atomic coordinates as a function of increases or decreases in box size. The reference coordinates (see NVT equilibration above) set the origin of the position restrain biasing potential. If the reference coordinates are not similarly scaled, the forces applied due to the restraints will accumulate and become artificial as the updated coordinates drift slightly and are scaled, while the reference coordinates remain fixed. Reference coordinates can be scaled either based on the center-of-mass (COM) of the restrained molecule(s) or by individual atoms. In principle, either approach works sufficiently well for equilibration, but scaling relative to the COM of the restrained molecule(s) is often somewhat more numerically stable, especially for systems that are not necessarily at thermodynamic equilibrium.
An additional difference in the \texttt{npt.mdp} file is the treatment of bonds. The \texttt{continuation} keyword is set to a value of \texttt{yes}, indicating that this simulation run is the continuation of a previous run in which constraints were applied. To preserve continuity, the constraints should not be re-solved, and the input coordinates should be assumed to satisfy the constraint algorithm being applied.
Prepare the run input file (\texttt{.tpr}) again using \texttt{grompp}. Note here that the checkpoint file from NVT equilibration (\texttt{nvt.cpt}) is passed to the \texttt{grompp -t} option. The checkpoint file contains a complete description of the thermodynamic state of the system, in high precision. To exactly continue a simulation, the checkpoint file must be passed to \texttt{grompp}. If it is omitted, information regarding velocities, thermostat variables, and high-precision coordinates will be lost. The checkpoint file is an essential component of proper continuation of MD simulations.
\begin{lstlisting}
$ gmx grompp -f npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
\end{lstlisting}
As before, execute \texttt{mdrun} to perform NPT equilibration:
\begin{lstlisting}
$ gmx mdrun -deffnm npt
\end{lstlisting}
After the run has concluded, again use the \texttt{energy} program, in this case to extract the pressure time series. Enter 18 to select Pressure, and 0 to terminate input to \texttt{energy}:
\begin{lstlisting}
$ gmx energy -f npt.edr -o pressure.xvg
\end{lstlisting}
The pressure time series is plotted in Figure~\ref{lyso_npt_pres_fig}. The screen output also indicates that the average pressure was 7.5 $\pm$ 160.5 bar. The target value was 1 bar. This discrepancy is often a source of concern for new users. Consider, however, the large error bar associated with the time series, and the wide range of pressure values sampled during the simulation (Figure~\ref{lyso_npt_pres_fig}). Pressure is a quantity that fluctuates wildly in a microscopic system. As a result, instantaneous values are going to vary considerably, and average values may not correspond exactly to the desired target value. However, given the huge standard deviation of pressure during NPT equilibration (160.5 bar), the average obtained during the simulation (7.5 bar) is statistically indistinguishable from the target of 1 bar. It is possible to extend the NPT equilibration phase for additional time in an attempt to obtain better agreement, but it is not necessary to do so. The existing agreement is sufficient.
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_npt_pressure}
\caption{Pressure of the solvated lysozyme system over time during NPT equilibration.}
\label{lyso_npt_pres_fig}
\end{figure}
An additional indicator of the convergence of NPT equilibration is the density of the system. Density is less prone to the fluctuations observed in the instantaneous pressure values. Again, use the \texttt{energy} program to extract the density time series from the \texttt{npt.edr} file, choosing 24 for density and 0 to terminate input:
\begin{lstlisting}
$ gmx energy -f npt.edr -o density.xvg
\end{lstlisting}
The density time series is shown in Figure~\ref{lyso_npt_dens_fig}. The average value from NPT equilibration is 1019 $\pm$ 3 kg m\textsuperscript{-3}. Note that this value is about two percent higher than the expected value for bulk SPC/E water (998 kg m\textsuperscript{-3})~\cite{Berendsen1987}. Such an outcome is not unusual, as the system contains a protein and ions, therefore its density will not be exactly equivalent to that of pure water.
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_npt_density}
\caption{Density of the solvated lysozyme system over time during NPT equilibration.}
\label{lyso_npt_dens_fig}
\end{figure}
Since the density is essentially constant over the duration of NPT equilibration, it is reasonable to conclude that the system is adequately equilibrated and unrestrained simulations can be initiated.
\subsubsection{Production MD Simulation} \label{lyso_md}
At this point, the solvent has relaxed around the protein and the system is equilibrated under the desired statistical mechanical ensemble, NPT. Following equilibration, position restraints are removed from the protein and the simulation proceeds in an unbiased manner. As such, this phase of the MD simulation is often referred to as "unrestrained MD" or "production MD," the latter owing to the fact that trajectory data that are being collected are now considered useful for answering the scientific question(s) at hand.
Production MD proceeds like any other process carried out thus far, with the generation of a \texttt{.tpr} input file with \texttt{grompp}, providing the coordinates and checkpoint file from NPT equilibration to ensure an exact continuation:
\begin{lstlisting}
$ gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md_0_1.tpr
\end{lstlisting}
%
Note that in this \texttt{grompp} command, there is no use of \texttt{-r}, as no restraints are being applied. The \texttt{.mdp} settings are the same as during NPT equilibration, except no restraints are specified and the run is being carried out for 500,000 steps, corresponding to 1 ns (500,000 steps $\times$ 0.002 ps per step = 1000 ps = 1 ns).
As before, it is useful to name files in accordance with the time interval and process being simulated. Here, the user will perform a 1-ns, unrestrained MD simulation, starting from t = 0 ns (equilibration time is not counted) and proceeding to t = 1 ns, hence the base file name of \texttt{md\_0\_1}. To run the simulation, invoke \texttt{mdrun}:
\begin{lstlisting}
$ gmx mdrun -deffnm md_0_1
\end{lstlisting}
\subsubsection{Data Analysis} \label{lyso_ana}
Data analysis is the most difficult aspect of carrying out an MD simulation. It is generally advisable to plan appropriate analyses before initiating the simulation to ensure that the simulation is designed in such a way as to produce useful data that answer the scientific question(s) at hand. One should not simply execute a simulation and hope that something interesting will happen that will reveal itself in time. This tutorial cannot possibly cover all analysis methods, nor should the simple tasks here be considered required for analyzing any given MD simulation. In this tutorial, the user will conduct several basic analyses that reflect typical syntax of running various GROMACS analysis routines. Beyond being a fast MD engine, GROMACS comes pre-packaged with numerous analysis programs, facilitating the collection and processing of data.
The first step in analysis is to remove the influence of PBC that were applied during the simulation, a process often referred to as "re-imaging" or "re-wrapping." As a consequence of PBC, molecules often appear "broken" in trajectory frames, or may appear to be "outside" the unit cell. Neither is actually true. Molecules are always intact according to the topology of the system, but may be represented in a manner such that part of a molecule is at one "side" or "corner" of the box, with the remainder elsewhere. This outcome is a result of the fact that visualization convenience is irrelevant to \texttt{mdrun} in its physical calculation. Rather than wasting time re-wrapping molecules to make them appear "whole" during the run, \texttt{mdrun} writes "broken" molecules that must later be fixed, if desired. As molecules diffuse through the system, it may appear that some molecules ({\em e.g.} the protein) "leave" the box or appear "outside" of it. This is not the case; there is no such thing as "outside" an infinite system.
The GROMACS program that applies coordinate manipulations to account for periodicity effects is called \texttt{trjconv} ("trajectory converter") and it serves a number of functions. Beyond accounting for PBC effects, it can be used to convert file formats or save a subset of atoms from the system. In this case, only a simple invocation of \texttt{trjconv} is necessary, to simultaneously make "broken" molecules appear whole and to center the protein in the unit cell. These processes can be done in a single command:
\begin{lstlisting}
$ gmx trjconv -s md_0_1.tpr -f md_0_1.xtc -o md_0_1_noPBC.xtc -pbc mol -center
\end{lstlisting}
%
The user is prompted for two selections. The first requires a choice of which group to center in the unit cell. Choose group 1 (Protein). The second prompt asks the user to choose which group should be output in the modified trajectory. For simplicity, choose group 0 (System).
Additional manipulations can be applied, such as rotational and translational fitting to produce a smoother visualization of the trajectory. Such fitting relies on a least-squares fit to a reference structure in the \texttt{.tpr} file. For the purposes of this tutorial, no such fitting will be applied, but interested readers are referred to the \texttt{-fit} option of \texttt{trjconv} and associated options. Note that one should not simultaneously use \texttt{-fit} and \texttt{-pbc}, as they are mathematically incompatible. Correct periodicity effects first, then deal with rotational and/or translational fitting.
At this point, the user should visualize the trajectory using programs like VMD~\cite{Humphrey1996}. Watching the progression of the trajectory is the most important part of analysis. Though GROMACS has many built-in analysis routines, the user will not know {\em a priori} everything that will happen during the MD simulation. Animating the trajectory can also be useful for determining if the \texttt{trjconv} treatment was appropriate. Many GROMACS programs are not PBC-aware, meaning if the user does not properly re-image the trajectory to account for PBC effects, analysis will be incorrect, particularly in the case of structure-related properties like those discussed here.
If one is interested in quantifying how much the protein structure changed over the course of the simulation, the root-mean-square deviation (RMSD) can be computed. RMSD is calculated following a least-squares fit to the reference structure and is plotted as a function of time. The GROMACS program that computes RMSD is called \texttt{rms}. It reads the reference coordinates, masses, and topology from the \texttt{.tpr} file passed to \texttt{-s} and computes the RMSD for each frame found in the trajectory passed to \texttt{-f}. Typically, the RMSD of a protein is calculated for its backbone (N, C\textsubscript{$\alpha$}, C) or "main chain" (N, C\textsubscript{$\alpha$}, C, O) atoms rather than the entire structure. Flexible side chains may have large RMSD that do not actually reflect meaningful structural changes. To calculate the RMSD of the protein backbone with respect to the equilibrated coordinates, invoke \texttt{rms} as follows:
\begin{lstlisting}
$ gmx rms -s md_0_1.tpr -f md_0_1_noPBC.xtc -o rmsd.xvg -tu ns
\end{lstlisting}
%
When prompted, choose group 4 (Backbone) for both the group for fitting and the group for output. The RMSD thus reflects the net change in backbone configuration after accounting for global rotation and translation.
It may also be of interest to calculate the RMSD with respect to some other structure, such as a specific frame of interest or the crystal structure rather than the equilibrated (t = 0 ns) coordinates. To compute the RMSD with respect to the crystal structure, provide \texttt{rms} with the \texttt{em.tpr} file, which was the input to energy minimization. The coordinates of the protein contained therein are the crystal structure with the hydrogen atoms added by \texttt{pdb2gmx}. Invoke \texttt{rms} again:
\begin{lstlisting}
$ gmx rms -s em.tpr -f md_0_1_noPBC.xtc -o rmsd_xtal.xvg -tu ns
\end{lstlisting}
%
As before, choose group 4 (Backbone) for both the group for fitting and the group for output. The two resulting RMSD time series can be plotted together, as shown in Figure~\ref{lyso_md_rmsd_fig}.
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_md_rmsd}
\caption{RMSD of the lysozyme backbone atoms during the production MD simulation following least-squares fitting to the indicated reference structures.}
\label{lyso_md_rmsd_fig}
\end{figure}
The RMSD values with respect to the crystal structure and equilibrated system are not identical, with the RMSD to the crystal reference being systematically higher than the RMSD to the equilibrated structure. This outcome is not unexpected. Though restraints were applied during equilibration, it is not guaranteed that the protein will not move. Instead, it moves slightly but overall, its conformation is maintained. After about 0.5 ns, the RMSD levels off around 0.1 nm (1~\AA), which is quite small and suggests the protein is very stable. It is appropriate to use this space to dispel several common misconceptions about RMSD. First, it is a degenerate metric, meaning a single value can have multiple interpretations (mathematical solutions), and as such on its own, RMSD is not very useful. Second, one cannot say that a simulation has converged simply based on RMSD, due to the first point $-$ a structure can still be changing considerably from its starting structure even if the RMSD has largely plateaued. Therefore, one should never terminate a simulation, believing it to be "stable" or "converged" simply by looking at RMSD.
As a final example of analysis, calculate the radius of gyration (R\textsubscript{g}) of lysozyme over time with the GROMACS \texttt{gyrate} program. This quantity is not particularly useful for systems of well-folded proteins, as it is a measure of compactness, a quantity expected to be largely constant for a protein with stable tertiary structure. Calculating R\textsubscript{g} demonstrates an advanced capability of GROMACS. Invoke \texttt{gyrate} as follows:
\begin{lstlisting}
$ gmx gyrate -s md_0_1.tpr -f md_0_1_noPBC.xtc -o gyrate.xvg
\end{lstlisting}
\begin{figure}[h]
\centering
\includegraphics{plot_lyso_md_rg}
\caption{Radius of gyration of lysozyme during the production MD simulation.}
\label{lyso_md_rg_fig}
\end{figure}
The R\textsubscript{g} time series is shown in Figure~\ref{lyso_md_rg_fig}. This quantity is similarly stable over this short simulation, which is expected given the stable fold of lysozyme and the very short nature of the simulation. Large structural changes are not expected within 1 ns for any protein.
\subsubsection{Summary and Review of Objectives} \label{lyso_summary}
Through this tutorial, the user has been guided through the construction of a solvated protein system, including neutralizing counterions. To review, the objectives for this tutorial were as follows:
\begin{enumerate}
\item Understand the content of a GROMACS system topology
\item Carry out basic steps of setting up a periodic system, including solvation and adding ions
\item Identify keywords and typical settings in \texttt{.mdp} input files
\item Control position restraints and topology directives via \texttt{.mdp} settings
\item Determine the appropriate algorithms for energy-\\minimizing and equilibrating a biomolecular system and performing a production MD simulation
\end{enumerate}
The first step in the process was the generation of a topology for the protein, to which solvent and ions were added to describe the entire system. The topology contains a listing of the atomic properties, connectivity, and all relevant bonded interactions in the system, as well as topology directives such as \texttt{\#include} to provide additional parameters or define species like water and ions, and \texttt{\#ifdef...\#endif} blocks that allow for selective application of parameters or restraints. Following processing of the protein by \texttt{pdb2gmx}, water was added to a cubic volume with \texttt{solvate} and ions added by \texttt{genion}. The system was energy-minimized with an efficient algorithm, steepest descents, and equilibrated under NVT and NPT ensembles. Recall the practical considerations for thermostat and barostat settings, including the use of a thermostat that properly samples an NVT distribution of kinetic energies. The tutorial concluded with rudimentary analysis that illustrates the basic syntax of GROMACS analysis tools.
%%%%%%%%% TUTORIAL 2 %%%%%%%%%
\subsection{Tutorial 2: KALP\textsubscript{15} in DPPC} \label{kalp}
\begin{figure}[H]
\centering
\includegraphics{kalp15_dppc}
\caption{The KALP\textsubscript{15} model peptide, embedded in a hydrated DPPC bilayer.}
\label{kalp_dppc_fig}
\end{figure}
Membrane proteins are of substantial biological interest as they modulate signaling pathways, engage in cell-cell interactions, and frequently serve as drug targets. Given their hydrophobicity, membrane proteins are difficult to crystallize or otherwise study experimentally. Thus, MD simulations serve as a useful technique in describing membrane protein conformational ensembles and protein-lipid interactions. Preparing a membrane protein system (Figure!\ref{kalp_dppc_fig}) is somewhat more difficult than a simple protein in water, as there are additional steps required to embed the protein in a membrane and carefully equilibrate these heterogeneous systems. This tutorial is available at \tutorialkalp.
\subsubsection{Prepare the Protein Topology} \label{kalp_protein_topology}
Model peptides are useful for studying protein-lipid interactions. The peptide that is the subject of this tutorial is called "KALP\textsubscript{15}" and has a sequence of Ac-GKK(LA)\textsubscript{4}LKKA-NH\textsubscript{2}. The "Ac" and "NH\textsubscript{2}" correspond to acetyl and amide groups, respectively. This peptide was the subject of an investigation on hydrophobic mismatch performed by Kandasamy and Larson~\cite{Kandasamy2006}. That study motivated the present tutorial.
The input coordinate file for the KALP\textsubscript{15} peptide can be constructed with programs that can build arbitrary molecules. In this tutorial, the peptide was initially built in the xLeap program of AmberTools (\url{http://ambermd.org/}) but could also be built with CHARMM~\cite{Brooks2009} if the user prefers. The initial geometry was set to that of an ideal $\alpha$-helix ($\phi$ = -60\textdegree, $\psi$ = -45\textdegree). This coordinate file was aligned along the {\em z}-axis (which, as will be shown later, corresponds to the membrane normal) by using the GROMACS \texttt{editconf} program, which has an option called \texttt{-princ} that aligns the longest axis of the provided coordinates along the {\em x}-axis. A subsequent rotation of 90\textdegree about the {\em y}-axis (also with \texttt{editconf}) aligns the peptide with the {\em z}-axis. A properly aligned KALP\textsubscript{15} peptide coordinate file is provided in the online tutorial.
The topology for this system will be generated with the GROMOS96 53A6 force field~\cite{Oostenbrink2004}. When executing \texttt{pdb2gmx}, the user should choose "None" for both the N- and C-termini. As described above, the N- and C-termini are capped with acetyl and amide groups, respectively. Without these groups, the peptide would have a large dipole moment as a result of the free amine (-NH\textsubscript{3}\textsuperscript{+}) and carboxylate (-COO\textsuperscript{$-$}) groups that would give rise to artificial dynamics. Model peptides are often capped to avoid these spurious "end effects."
Invoke \texttt{pdb2gmx} to produce the topology:
\begin{lstlisting}
$ gmx pdb2gmx -f KALP-15_princ.pdb -o KALP-15_processed.gro -ignh -ter -water spc
\end{lstlisting}
%
Note the use of the \texttt{-ignh} option above. The initial coordinates are an all-atom model. GROMOS96 is a united-atom force field, meaning that nonpolar hydrogen atoms are merged into their parent carbon atoms. This convention means there are fewer atoms in the system and thus the simulation will proceed more quickly. Using \texttt{-ignh} allows \texttt{pdb2gmx} to ignore all hydrogen atoms and rebuild only those required by the force field.
It is important to note here that united-atom force fields are not very commonly used. In the past, when computing power was much lower, it was desirable to limit the number of atoms in the system. All modern force fields had their origins in united-atom parameter sets~\cite{Gelin1979,Weiner1984,Jorgensen1984}. In the present tutorial, a united-atom force field will be used as it is convenient for demonstrating the principles necessary to construct the membrane protein system and manipulate the underlying force field files. With phospholipids, the reduction in the number of atoms in the system is also quite dramatic, leading to considerable increase in simulation speed.
\subsubsection{Customize the Force Field} \label{kalp_ff}
The topology generated in Section~\ref{kalp_protein_topology} describes only the protein, and the GROMOS96 53A6 parameter set (as distributed with GROMACS) does not contain topologies or parameters for phospholipids. Recently, such parameters have been published~\cite{Kukol2009}. In this tutorial, we will apply an older, but widely used, united-atom lipid force field derived by Berger, Edholm, and J{\"a}hnig~\cite{Berger1997}.
\begin{figure}[h!]
\centering
\includegraphics{DPPC_new_label}
\caption{The chemical structure of the DPPC lipid, with atom names according to the Berger lipid force field. The {\em sn-1} chain begins at carbon C34 and the {\em sn-2} chain begins at carbon C15.}
\label{kalp_dppc_structure_fig}
\end{figure}
The phospholipid used in this tutorial is dipalmitoylphosphatidylcholine (DPPC, Figure~\ref{kalp_dppc_structure_fig}). It is a fully saturated, zwitterionic membrane lipid that is often used in simple models of membranes. The topology and force field parameters for DPPC using the Berger parameters are distributed by Prof. D. Peter Tieleman at the University of Calgary. Download the following files from his website (\url{http://wcm.ucalgary.ca/tieleman/downloads}):
\begin{enumerate}
\item{\texttt{dppc128.pdb}: Coordinates of a 128-lipid DPPC bilayer}
\item{\texttt{dppc.itp}: The topology of DPPC, containing its\\ \texttt{[moleculetype]} definition}
\item{\texttt{lipid.itp}: The Berger lipid force field parameters}
\end{enumerate}
The \texttt{lipid.itp} file is a self-contained, full force field for the Berger phospholipids. It is distributed in a standalone form, such that it can be used directly for simulating pure lipid membranes. To utilize it in the context of a membrane protein system, it must be incorporated into the existing force field that describes the protein. Rather than using it directly, the user must add the parameters from \texttt{lipid.itp} into the appropriate GROMOS96 53A6 force field files. To do so, make a local copy of the force field directory into the working directory (assuming GROMACS is installed in the standard location of \texttt{/usr/local/gromacs}):
\begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
$ cp -r /usr/local/gromacs/share/gromacs/top/gromos53a6.ff/ ./gromos53a6_lipid.ff
\end{lstlisting}
%
A full explanation of the force field files in this new directory is given in the online version of the tutorial. The files that will be modified as part of this tutorial are:
\begin{enumerate}
\item{\texttt{ffbonded.itp}: The bonded force field parameters (bonds, angles, dihedrals, and improper dihedrals)}
\item{\texttt{ffnonbonded.itp}: The nonbonded force field parameters (atom types, Lennard-Jones parameters, pair interactions)}
\end{enumerate}
The modifications to the force field files must be made carefully to have a functional force field. First, copy the contents of the \texttt{[atomtypes]}, \texttt{[nonbond\_params]}, and \texttt{[pairtypes]} directives in \texttt{lipid.itp} into the corresponding sections of \texttt{ffnonbonded.itp}. The \texttt{[atomtypes]} section of \texttt{lipid.itp} lack atomic numbers and must be added in. In the \texttt{[nonbond\_params]} section (which defines pair-specific Lennard-Jones interactions, {\em e.g.} those that do not obey normal combination rules), delete the line that says \texttt{;; parameters for lipid-GROMOS interactions}. Delete this line and all the lines that follow in the \texttt{[nonbond\_params]} section. These entries correspond to GROMOS87 atom types and interactions that are not necessary or appropriate when using the GROMOS96 53A6 force field. Similarly, delete or comment out (with ;) any line in \texttt{[nonbond\_params]} that includes "HW," which is also not a valid atom type in GROMOS96. It indicates water hydrogen atoms, but these are simply called "H" in GROMOS96. Thus, one can also simply replace "HW" with "H" to avoid future errors. Failure to do these steps properly will lead to fatal errors when using \texttt{grompp}.
After modifying \texttt{ffnonbonded.itp}, add the contents of the \texttt{lipid.itp} \texttt{[dihedraltypes]} section to the equivalent section in \texttt{ffbonded.itp}.
These modifications of the force field files are fairly significant, and it is important to carefully check to make sure all steps have been completed correctly.
\begin{Checklists}
\begin{checklist}{Force Field Modifications}
\begin{itemize}
\item{Copy the contents of \texttt{[atomtypes]} from \texttt{lipid.itp} to \texttt{ffnonbonded.itp} and add atomic numbers to each}
\item{Copy the contents of \texttt{[nonbond\_params]} from \texttt{lipid.itp} to \texttt{ffnonbonded.itp}}
\item{Remove the \texttt{;; parameters for lipid-GROMOS \\interactions} line and all subsequent lines in \\\texttt{[nonbond\_params]} from \texttt{ffnonbonded.itp}}
\item{Remove or comment out any line in \texttt{[nonbond\_params]} in \texttt{lipid.itp} containing "HW," or rename to "H"}
\item{Copy the contents of \texttt{[pairtypes]} from \texttt{lipid.itp} to \texttt{ffnonbonded.itp}}
\item{Copy the contents of \texttt{[dihedraltypes]} from \texttt{lipid.itp} to \texttt{ffbonded.itp}}
\end{itemize}
\end{checklist}
\end{Checklists}
Next, to make use of the modified force field, an adjustment to the system topology (\texttt{topol.top}) must be made. Change the call to the force field from:
\begin{lstlisting}[basicstyle=\small\ttfamily]
#include "gromos53a6.ff/forcefield.itp"
\end{lstlisting}
%
to:
\begin{lstlisting}[basicstyle=\small\ttfamily]
#include "gromos53a6_lipid.ff/forcefield.itp"
\end{lstlisting}
Finally, add an \texttt{\#include} statement to add the DPPC topology to the system topology, in the exact location shown (to avoid disrupting any other \texttt{[moleculetype]} in the topology:
\begin{lstlisting}[basicstyle=\small\ttfamily]
; Include Position restraint file
#ifdef POSRES
#include "posre.itp"
#endif
; Include DPPC chain topology
#include "dppc.itp"
; Include water topology
#include "gromos53a6_lipid.ff/spc.itp"
\end{lstlisting}
At this point, some additional discussion on force field choice is warranted. The use of GROMOS96 53A6 to treat the protein and the Berger united-atom lipid parameters for DPPC is motivated by several factors. First, the constructed system (KALP\textsubscript{15} in DPPC) is a reproduction of a published and well-studied system~\cite{Kandasamy2006}. Second, the Berger lipid parameters are compatible with the GROMOS protein parameter set. Finally, the modification of the force field files is a useful instructive tool in teaching new users how this information is organized in GROMACS, allowing for future modifications in advanced systems. However, any MD simulation must be carefully planned, which includes a critical assessment of force field parameters for all species in the system. One should not automatically choose this force field combination simply because this tutorial does. In fact, the GROMOS96 53A6 force field has been shown to under-stabilize $\alpha$-helices~\cite{Schmid2011}, and the Berger parameters are not necessarily the best representation of lipid properties, particularly for DPPC. A useful reference is a systematic study by Piggot et al.~\cite{Piggot2012}, though the lipid force field literature is rich with other comparisons. Users must carefully consider available studies that critically assess force field performance for any system, but phospholipid membranes are especially challenging, and while no force field perfectly represents all relevant parameters, some force fields are demonstrably better than others.
\subsubsection{Construct the System} \label{kalp_construct}
\begin{figure*}[h!]
\centering
\includegraphics{kalp_dppc_lipid_pack}
\caption{The process of packing DPPC lipids around the KALP\textsubscript{15} peptide. The initial lipid coordinates, with protein-lipid atomic overlap, are scaled outward by some inflation factor (4 in this tutorial) and any lipids that remain overlapping with the protein are deleted. The scaling vector is reduced to a value < 1 to shrink the system (inward movement denoted by arrows pointing in from corners). The shrinking process is carried out iteratively until a reasonable APL value is achieved, yielding the final, packed system that is prepared for solvation with water.}
\label{kalp_lipid_pack}
\end{figure*}
The force field and system topology files are now fully prepared to describe the contents of the membrane-protein system being constructed. As in Tutorial 1 (Section~\ref{lyso}), the next step after preparing the initial topology is solvation. In nearly all MD simulations of biomolecules, the solute of interest is embedded within some type of liquid solvent, which in this case will be comprised of both a phospholipid bilayer and water. In contrast to the simple addition of a homogeneous solvent using \texttt{solvate}, the addition of the lipids around the protein requires auxiliary methods. In this tutorial, the DPPC lipids will be packed around the KALP\textsubscript{15} peptide using the InflateGRO method~\cite{Kandt2007}. With this approach, the lipid coordinates are scaled in the {\em x-y} plane, any lipids that remain overlapping with the protein are deleted, and then the lipids are progressively packed around the protein with intervening rounds of energy minimization.
The DPPC coordinate file distributed by Prof. Tieleman has "broken" lipids; all atoms are "inside" the central image and have the appearance that the lipid molecules are not bonded across periodic boundaries. While this representation is perfectly valid for carrying out MD simulations, InflateGRO will not work unless the molecules are "intact," that is, the distance between bonded atoms is at a minimum with respect to the periodic unit cell. To produce a correct coordinate file for InflateGRO, build a \texttt{.tpr} file that corresponds to the DPPC-water system. A topology for this system (\texttt{topol\_dppc.top}) is provided as part of the online tutorial. Any syntactically valid \texttt{.mdp} file can be used for this purpose. Note that \texttt{topol\_dppc.top} is only used in this step and never again. Any updates or reference to the system topology described in the tutorial are in \texttt{topol.top}.
Create the \texttt{.tpr} file:
\begin{lstlisting}
$ gmx grompp -f minim.mdp -c dppc128.pdb -p topol_dppc.top -o dppc.tpr
\end{lstlisting}
Invoke \texttt{trjconv}:
\begin{lstlisting}
$ gmx trjconv -s dppc.tpr -f dppc128.pdb -o dppc128_whole.gro -pbc mol -ur compact
\end{lstlisting}
Next, the KALP\textsubscript{15} peptide must be centered within the same box as the DPPC bilayer, such that the membrane center and peptide center are coincident. Open \texttt{dppc128\_whole.gro} in a text editor and navigate to the last line of the file (or, on the command line, use \texttt{tail -n 1 dppc128\_whole.gro}). The last line of a \texttt{.gro} file contains the box vectors of the system. Center the KALP\textsubscript{15} peptide in a box with the same dimensions as that of the DPPC bilayer with \texttt{editconf}:
\begin{lstlisting}
$ gmx editconf -f KALP-15_processed.gro -o KALP_newbox.gro -c -box 6.41840 6.44350 6.59650
\end{lstlisting}
The next step in assembling the system is to prepare the input coordinates for use in InflateGRO. To do so, concatenate the protein and membrane coordinates:
\begin{lstlisting}
$ cat KALP_newbox.gro dppc128_whole.gro > system.gro
\end{lstlisting}
The \texttt{system.gro} file is not formatted correctly. A valid \texttt{.gro} file has the following format:
\begin{lstlisting}
Title
Number of atoms
(all lines containing atomic coordinates)
Box vectors
\end{lstlisting}
By concatenating two \texttt{.gro} files together, there are now unnecessary lines that need to be removed. Open \texttt{system.gro} in a text editor and search down for "DPPC." Above the first occurrence of lipid coordinates are three lines that need to be removed: (1) the box vectors from \texttt{KALP\_newbox.gro}, (2) the title line from \texttt{dppc128\_whole.gro}, and (3) the number of atoms from \texttt{dppc128\_whole.gro}. Prior to removing the line containing the number of atoms, make note of it. After removing it, add this number to the second line in the file, to specify the total number of atoms in the system. Save the file and exit the text editor. Note that one can easily validate the number of atoms in the file with the Linux \texttt{wc} command:
\begin{lstlisting}
$ wc -l system.gro
\end{lstlisting}
%
Subtract three from the number reported by \texttt{wc} (to reflect the title line, number of atoms line, and box vectors). This number should match what appears on the second line of the \texttt{.gro} file. If it does not, inspect the file to determine the source of discrepancy, and if it does not match, start over with the \texttt{cat} command above. A quick test for validity of \texttt{system.gro} is to open it in some visualization software like VMD. If the file does not open or the program reports an error, it has been constructed incorrectly.
The InflateGRO method requires strong position restraints to be placed on the protein being embedded in the lipid membrane, beyond the default value of 1000 kJ mol\textsuperscript{-1} nm\textsuperscript{-2} written in \texttt{posre.itp} from \texttt{pdb2gmx}~\cite{Kandt2007}. To create a new file, with stronger restraints (100000 kJ mol\textsuperscript{-1} nm\textsuperscript{-2}), invoke the \texttt{genrestr} program, choosing "Protein" when prompted:
\begin{lstlisting}
$ gmx genrestr -f KALP_newbox.gro -o strong_posre.itp -fc 100000 100000 100000
\end{lstlisting}
Add an \texttt{\#ifdef} statement to call the new \texttt{strong\_posre.itp} file in \texttt{topol.top}:
\begin{lstlisting}
; Include Position restraint file
#ifdef POSRES
#include "posre.itp"
#endif
; Strong position restraints for InflateGRO
#ifdef STRONG_POSRES
#include "strong_posre.itp"
#endif
; Include DPPC chain topology
#include "dppc.itp"
; Include water topology
#include "gromos53a6_lipid.ff/spc.itp"
\end{lstlisting}
The coordinates and topology are now ready to be processed with the InflateGRO Perl script:
\begin{lstlisting}
$ perl inflategro.pl system.gro 4 DPPC 14 system_inflated.gro 5 area.dat
\end{lstlisting}
%
The order of command-line arguments passed to InflateGRO is specific and must adhere to this exact order. The syntax is as follows:
\begin{enumerate}
\item{\texttt{system.gro}: the input coordinate file to which scaling is applied}
\item{\texttt{4}: the scaling factor applied. A value > 1 indicates inflation (expansion) and a value < 1 indicates shrinking (compression)}
\item{\texttt{DPPC}: the residue name of the lipids to which scaling is applied}
\item{\texttt{14}: the cutoff radius (in \AA) for searching for lipids to delete}
\item{\texttt{system\_inflated.gro}: the output coordinate file name}
\item{\texttt{5}: the grid spacing for calculation of area per lipid}
\item{\texttt{area.dat}: the name of the text file to which area per lipid values are printed}
\end{enumerate}
InflateGRO will print to the screen how many lipids are deleted. Note this value and update the \texttt{[molecules]} section of \texttt{topol.top} accordingly. It is also important to note that InflateGRO removes all water molecules. The user must not add back any water molecules until the lipids are completely packed around the protein. Therefore, \texttt{[molecules]} should refer only to the protein and the DPPC lipids. Do not invoke \texttt{solvate} until instructed to later in this tutorial.
The system now requires energy minimization. The\\ \texttt{minim\_inflategro.mdp} file (provided online) includes a line that reads:
\begin{lstlisting}
define = -DSTRONG_POSRES
\end{lstlisting}
%
With this command, the strong position restraints will be applied to the protein heavy atoms during all energy minimizations carried out while packing the lipids around the protein with InflateGRO.
Perform energy minimization, being sure to use explicit file names. As there will be many rounds of packing and energy minimization, relying on default GROMACS file names is not recommended. Minimize the inflated system using descriptive file names:
\begin{lstlisting}
$ gmx grompp -f minim_inflategro.mdp -c system_inflated.gro -p topol.top -r system_inflated.gro -o system_inflated_em.tpr
$ gmx mdrun -deffnm system_inflated_em
\end{lstlisting}
\begin{figure*}[h!]
\centering
\includegraphics{kalp_dppc_water_deletor}
\caption{The action of the \texttt{water\_deletor.pl} script. Water molecules with {\em z}-coordinates that fall between the O33 atoms (which define the translucent blue planes) are deleted, leaving a hydrated bilayer with no water molecules within the hydrophobic core.}
\label{kalp_lipid_water_delete}
\end{figure*}
As before, the user must reconstruct the "broken" lipids before attempting to use the coordinate file further:
\begin{lstlisting}
$ gmx trjconv -s system_inflated_em.tpr -f system_inflated_em.gro -o tmp.gro -pbc mol
$ mv tmp.gro system_inflated_em.gro
\end{lstlisting}
%
The use of \texttt{tmp.gro} is necessary as \texttt{trjconv} returns an error when trying to read from, and write to, the same file name when correcting for PBC effects.
After the first energy minimization, the lipids need to be packed around the protein. To do so, use InflateGRO to apply a scaling factor that is less than 1. Doing so moves the lipids towards the protein. It is wise to do this packing very slowly to prevent intermolecular clashes that will lead to an unstable system. A typical scaling factor during packing is 0.95. To perform the first shrinking/packing step, invoke InflateGRO as follows:
\begin{lstlisting}
$ perl inflategro.pl system_inflated_em.gro 0.95 DPPC 0 system_shrink1.gro 5 area_shrink1.dat
\end{lstlisting}
%
Note that the cutoff radius has also been reduced to 0 \AA. Such a cutoff value disables the deletion of lipids. It is only appropriate to delete lipids during inflation; if the cutoff value is retained, eventually all of the lipids (or a great many of them) will be deleted and the system will be useless. After 26 rounds of packing (shrinking the lipids inward, energy minimization, and reconstruction of "broken" lipids), an area per lipid of roughly 71 \AA\textsuperscript{2} should be attained. Since InflateGRO overestimates area per lipid slightly, this value is an indication that additional shrinking is not necessary. That is, it is not necessary to achieve the exact experimental value of area per lipid (62 \AA\textsuperscript{2}). The online tutorial provides a Bash script that automates this process, but users wishing to learn via repetition can challenge themselves to carry out this process manually and become very comfortable with proper file naming. In summary, the process of packing lipids around the KALP\textsubscript{15} peptide is illustrated in Figure~\ref{kalp_lipid_pack}.
It is now time to add water to the system using \texttt{solvate}. One issue that will likely arise is that water molecules will be added within small void volumes in the hydrophobic core of the membrane. There are several strategies one can apply to deal with this situation, but the easiest is to allow \texttt{solvate} to add however many water molecules it can and subsequently delete those that fall within the membrane core. Solvate as usual with:
\begin{lstlisting}
$ gmx solvate -cp system_shrink26_em.gro -cs spc216.gro -p topol.top -o system_solv.gro
\end{lstlisting}
The online tutorial provides \texttt{water\_deletor.pl}, which is a simple Perl script that finds water molecules within a user-defined range of {\em z}-coordinate values and deletes them. The user defines a "reference" atom, which should be an atom in the ester region of the phospholipid and a "middle" atom defining the center of the membrane along its normal (typically the {\em z}-axis). See Figure~\ref{kalp_dppc_structure_fig} for atom naming in DPPC. From these atoms, the bilayer is divided into upper and lower leaflets, such that the reference atoms define the boundaries along the {\em z}-axis within which water molecules will be deleted (Figure~\ref{kalp_lipid_water_delete}). Finally, the script needs to know how many atoms constitute a water molecule. For SPC, there are 3 atoms (OW, HW1, and HW2), so this value is passed to the \texttt{-nwater} flag. Invoke the script as follows:
\begin{lstlisting}
$ perl water_deletor.pl -in system_solv.gro -out system_solv_fix.gro -ref O33 -middle C50 -nwater 3
\end{lstlisting}
\subsubsection{Add Ions} \label{kalp_ions}
Adding ions to the solvated KALP\textsubscript{15}-DPPC system is the same as any other system. Invoke \texttt{grompp} and \texttt{genion}:
\begin{lstlisting}
$ gmx grompp -f ions.mdp -c system_solv_fix.gro -p topol.top -o ions.tpr
$ gmx genion -s ions.tpr -o system_solv_ions.gro -p topol.top -pname NA -nname CL -neutral
\end{lstlisting}
%
Choose group 15 (SOL) for replacing water molecules with Cl\textsuperscript{$-$} ions.
\subsubsection{Energy Minimization} \label{kalp_em}
Energy minimization is performed in the same manner as any other system:
\begin{lstlisting}
$ gmx grompp -f minim.mdp -c system_solv_ions.gro -p topol.top -o em.tpr
$ mdrun -deffnm em
\end{lstlisting}
Energy minimization should converge rather quickly for this system, as the lipids were progressively relaxed around the protein prior to the addition of water:
\begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
Steepest Descents converged to Fmax < 1000 in 292 steps
Potential Energy = -3.2078762e+05
Maximum force = 9.4299438e+02 on atom 130
Norm of force = 5.1295775e+01
\end{lstlisting}
\subsubsection{Equilibration} \label{kalp_equil}
Equilibration of the KALP\textsubscript{15}-DPPC system will be carried out in two phases, as described in Section~\ref{lyso_equil}, an initial NVT equilibration followed by NPT equilibration. There are several important considerations specific to membrane simulations that are relevant at this stage. First, for the purposes of thermostatting and COM motion removal, the user will need to merge the atoms of the protein and DPPC groups using \texttt{make\_ndx}, a program that generates custom groups of atoms:
\begin{lstlisting}
$ gmx make_ndx -f em.gro -o index.ndx
\end{lstlisting}
%
Merge the protein (group 1) and DPPC (group 13) index groups with: \texttt{1 | 13}, type Enter, then type \texttt{q} and Enter to quit \texttt{make\_ndx}.
In equilibrating the lysozyme system (Section~\ref{lyso_equil}), temperature coupling groups were set to:
\begin{lstlisting}
tc-grps = Protein Non-Protein
\end{lstlisting}
If the same syntax were used in the case of a membrane-protein system, the "Non-Protein" group would also contain the phospholipids. Doing so is undesirable due to the heterogeneity of the system. Lipids and water diffuse on different time scales, thus if they are treated in the same group for the purposes of thermostatting, there will be spurious contributions to the velocity scaling. As a consequence, it is common to treat the constituents of the membrane (all lipids and the embedded protein, whose diffusion depends strongly on the lipids) as a single group with respect to the thermostat. Merging the protein and DPPC lipids (above) allows the user to apply this convention. As such, the thermostat settings in \texttt{nvt.mdp} are:
\begin{lstlisting}
tcoupl = V-rescale
tc-grps = Protein_DPPC Water_and_ions
tau_t = 0.1 0.1
ref_t = 323 323
\end{lstlisting}
The "Water\_and\_ions" group is a default group available in GROMACS, encompassing all SOL molecules and (in this case) Cl\textsuperscript{$-$} ions. Note, too, that the reference temperature is set to 323 K, above the experimental phase transition temperature of DPPC (315 K). For biological applications, it is generally desirable to perform membrane simulations above the phase transition temperature of the lipid membrane, such that it is in a liquid crystalline state. The reference temperature value should be determined not only in terms of the experimental value, but also in terms of the chosen force field. There may be some degree of error in how well a given force field models the phase transition of a given lipid, and the user must take this into account. A list of reference phase transition temperatures and associated literature references is provided in the online tutorial.
Similarly, the net COM motion removal in membrane systems (and any interfacial system, in general) must be given special consideration. Due to the differences in diffusion rates, one should set separate COM motion removal groups in the \texttt{.mdp} file. As such, in the \texttt{nvt.mdp} file, set:
\begin{lstlisting}
nstcomm = 1
comm-mode = Linear
comm-grps = Protein_DPPC Water_and_ions
\end{lstlisting}
Perform NVT equilibration by calling \texttt{grompp} and \texttt{mdrun}:
\begin{lstlisting}
$ gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -n index.ndx -o nvt.tpr