-
Notifications
You must be signed in to change notification settings - Fork 71
/
software.html
231 lines (173 loc) · 8.85 KB
/
software.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 451/651 431/631 Data-Intensive Distributed Computing (Winter 2018) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Data-Intensive Distributed Computing</title>
<!-- Bootstrap core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<script src="js/ie-emulation-modes-warning.js"></script>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li><a href="assignments.html">Assignments</a></li>
<li class="active"><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"><img width="250" src="images/waterloo_logo.png" alt="University of Waterloo logo"/></div>
<h1>Software <br/><small>Data-Intensive Distributed Computing (Winter 2018)</small></h1>
</div>
<div>
<h3>Bespin</h3>
<p><a href="http://bespin.io">Bespin</a> is a software library that
contains reference implementations of "big data" algorithms in
MapReduce and Spark. It provides sample code for many of the
algorithms we'll be discussing in class and also provides starting
points for the assignments. You'll want to familiarize yourself
with the library.</p>
<h3>Linux Student CS Environment</h3>
<p>Software needed for the course can be found in
the <code>linux.student.cs.uwaterloo.ca</code> environment. We will
ensure that everything works correctly in this environment.</p>
<p><b>TL;DR.</b> Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):</p>
<pre>
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
</pre>
<p>You'll want to add the above lines to your shell config file (e.g.,
<code>.bash_profile</code>).</p>
<p><b>Gory Details.</b> For the course we need Java, Scala, Hadoop,
Spark, and Maven. Java is already available in the default user
environment (but we need to point to the right version). The rest of
the packages are installed in <code>/u3/cs451/packages/</code>. The
directories <code>scala</code>, <code>hadoop</code>, <code>spark</code>,
and <code>maven</code> are actually symlinks to specific
versions. This is so that we can transparently change the links to
point to different versions if necessary without affecting downstream
users. Currently, the versions are:</p>
<ul>
<li>Java: 1.8.0_151</li>
<li>Scala: 2.11.8</li>
<li>Hadoop: 2.7.5</li>
<li>Spark: 2.1.1</li>
<li>Maven: 3.3.9</li>
</ul>
</div>
<div>
<h3>Installing Software Locally</h3>
<p>You may wish to install all necessary software packages locally on
your own machine. We provide basic installation instructions here,
but the course staff cannot provide technical support due to the size of
the class and the idiosyncrasies of individual systems. We will be
responsible for making sure everything works properly in the Linux
Student CS Environment (above), but if you want to install everything on your
own machine for convenience, you're on your own.</p>
<p>Both Hadoop and Spark work fine on Mac OS X and Linux, but may be
difficult to get working on Windows. Note that to run Hadoop and Spark
on your local machine comfortably, you'll need at least 4 GB memory
and plenty of disk space (at least 10 GB).</p>
<p>You'll also need Java (JDK 1.8), Scala (use Scala 2.11.x), and
Maven (any reasonably recent version).</p>
<p>The versions of the packages installed
on <code>linux.student.cs.uwaterloo.ca</code> are as follows:</p>
<ul>
<li><a href="http://mirror.its.dal.ca/apache/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz">Hadoop 2.7.5</a></li>
<li><a href="https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz">Spark 2.1.1</a></li>
</ul>
<p>Download the above packages, unpack the tarball, add their
respective <code>bin/</code> directories to your path (and your shell
config), and you should be go to go.</p>
<p>Alternatively, you can also install the various packages using a
package manager, e.g., <code>apt-get</code>, MacPorts, etc. However,
make sure you get the right version.</p>
</div>
<div>
<h3>Altiscale Cluster</h3>
<div style="float:right; padding-left:25px"><img src="images/altiscale-logo.png" alt="Altiscale Logo"/></div>
<p>In addition to running "toy" Hadoop on a single machine (which
obviously defeats the point of a distributed framework), we're going
to be playing with a modest cluster thanks to the generous support of
Altiscale, which is a "Hadoop-as-a-service" provider. You'll be
getting an email directly from Altiscale with account information.</p>
<p>Follow the instructions from the email:</p>
<ol>
<li>Set up your web profile at <a href="http://portal.altiscale.com/">Altiscale Portal</a>.</li>
<li>Follow these instructions to upload your ssh keys: <a href="https://documentation.altiscale.com/uploading-public-key">Uploading and Managing Your Public Key</a></li>
<li>Follow these instructions to ssh into the "workspace": <a href="https://documentation.altiscale.com/connecting-with-ssh">Connecting to the Workbench Using SSH</a>. The workspace is the node from which you submit MapReduce/Spark jobs; it's also where you'll check out code, inspect HDFS data, etc. In class I sometimes refer to this as the "submit node".</li>
<li>Follow these instructions to access the cluster webapps: <a href="https://documentation.altiscale.com/accessing-web-uis-socks">Accessing Web UIs Through a SOCKS Proxy</a>. In particular, you'll need to access the Resource Manager webapp to examine the status of your running jobs at <a href="http://rm-ia.s3s.altiscale.com:8088/cluster/"><code>http://rm-ia.s3s.altiscale.com:8088/cluster/</code></a>.</li>
</ol>
<p><b>The TL;DR version.</b> Configure your <code>~/.ssh/config</code> file as follows:</p>
<pre>
Host altiscale
User YOUR_USERNAME
Hostname ia.z42.altiscale.com
Port 1763
IdentityFile ~/.ssh/id_rsa
Compression yes
ServerAliveInterval 15
DynamicForward localhost:1080
TCPKeepAlive yes
Protocol 2,1
</pre>
<p>And you should be able to ssh into the workspace:</p>
<pre>
ssh altiscale
</pre>
<p>That should do it!</p>
<p><b>Running Spark on Altiscale — the TL;DR version:</b> Add
the following lines to you <code>~/.bash_profile</code> to point at
the correct version of Spark:</p>
<pre>
SPARK_HOME=/opt/spark-beta
SPARK_CONF_DIR=/etc/spark-beta
PATH=$PATH:/opt/spark-beta/bin
</pre>
<p>For additional details, consult the
<a href="https://documentation.altiscale.com/spark-2-0-with-altiscale">Altiscale
Spark documentation</a>.</p>
</div>
<div style="padding-bottom: 100px"></div>
</div><!-- /.container -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>