[go: nahoru, domu]

Jump to content

Wikimedia Labs/Tool Labs/Help: Difference between revisions

From mediawiki.org
Content deleted Content added
→‎Job Status: rename some stuff
mNo edit summary
 
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Interwiki redirect|wikitech:Nova Resource:Tools/Help}}
== Access ==

=== Create an account ===

The first step is to [[wikitech:special:createaccount|get an account]] on Wikitech, which is the general interface for everything Labs. On the account creation form the field "Instance shell account name" will be your Unix username on all Labs projects. Once you have created an account you will be added to a list of users to be approved for shell access which you can see [[wikitech:Category:Shell_Access_Requests|here]].

=== Setup you SSH key ===

In order to access Labs servers using SSH, you also need to provide a public SSH key on the [[wikitech:Special:Preferences|'OpenStack' tab of your Wikitech preferences]] once you have an account.

==== Generating a key in Windows and Putty ====

# Open [http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html PuttyGen]
# Select an SSH-2 RSA key
# Click the Generate button
# Move your mouse around until the progress bar is full
# Type in a passphrase (you will need to remember this) and confirm it
# Save the private key and public key onto your local machine
# From the text field 'Public key for pasting into OpenSSH authorized_keys file' right click and copy
# Inset this into your [[wikitech:Special:Preferences|'OpenStack' tab of your Wikitech preferences]]

=== Getting access to the Tool Labs ===

Once you have a Labs account, you can request access to the tools project (you may look at [[wikitech:Nova_Resource:Tools|the project page]] to see a list of current admins).

The tool labs has a public IP; once you have a tool labs account, you can connect using SSH and FTP to <code>tools-login.wmflabs.org</code>. You will need to use the ''shell'' account name you provided when creating your Wikitech account, and the private key matching the public key you supplied for authentication.

== Tool account ==

The primary concept of the Tool Lab's organization is the ''tool account''; at its core this is a unix uid-gid pair named <code>local-''toolname''</code> which is intended to run the actual tool. Maintainers may have more than one tool account, and tool accounts may have more than one maintainer.

Right now, you have to request a tool account from one of the project administrators; the plan is to make available an interface on wikitech where tool maintainers will be able to create them at need.

The unix group has as its members the tool account itself, as well as the user accounts of the maintainers of the tool. Every member of that group has the authorization to '''sudo''' to the tool account.

Along with the unix uid, the following resources are provided by default for each tool:
* A home directory on shared storage: <code>/data/project/''toolname''</code>
* A web URI mapped to its <code>~/''public_html''</code>: <code>http://tools.wmflabs.org/toolname</code>
* A mysql database for local use (the credentials to which are stored in <code>~/.my.cnf</code>)
* Access to the continuous and task queues of the compute grid (explained below)

;Hint: As a convenience, tool maintainers can switch to the tool account with:
maintainer@tools-login:~$ '''become ''toolname'''''
local-''toolname''@tools-login:~$

== Grid engine ==

Every non-trivial task performed by tools should be dispatched by the grid engine so that a suitable place to run them is found with sufficient resources. Gridengine is highly flexible system for assigning resources to jobs, including parallel processing.

You can find documentation [http://gridscheduler.sourceforge.net/ on the website]; you may wish to pay particular attention to the <code>qsub</code>, <code>qdel</code> and <code>qacct</code> commands which are most important to users.

=== How ===

The basic principle of running jobs is fairly straightforward:
*You '''''submit''''' a '''''job''''' to a work queue from a submission server (-login<!--, -dev-->) and the web servers.;
*The gridengine master finds a suitable execution host to run the job on, and starts it there once resources are available; then
*As it runs, your job will send their output and errors to files until it completes or it is aborted.

You can submit jobs with <code>qsub</code>, modify some of their settings while they are waiting or running with <code>qalter</code>, get information on your queued and running jobs with <code>qstat</code>, and abort or cancel them with <code>qdel</code>. Those commands are very flexible, but a little complex at first &ndash; you might prefer to use the simplified alternatives [[#Simple utilities|below]].

There are a few caveats to keep in mind when submitting a job:
*You do not normally control which execution host will eventually run it, and should therefore only access directories that are shared between all hosts (specifically, <code>/data/project</code> and <code>/home</code>)
*Your job's memory usage has a ''hard'' limit it cannot grow beyond. By default, that is 256MB but you can request more (or less) with <code>qsub</code>'s <code>-l h_vmem=''memory''</code> or <code>jsub</code>'s <code>-m ''memory''</code> options.<p>Bear in mind that jobs that request more memory may be penalized in priority and may have to wait longer before being run until sufficient resources are available.

=== Simple utilities ===

For most tasks, helper scripts are provided to abstract away some of the complexities of using the grid engine. Almost all use scenarios are covered with reasonable defaults by the <code>jsub</code> script:

<code>'''jsub ''[options...] program [args...]'''''</code>

:Options include many (but not all) qsub options, along with:
:;<code>-stderr</code>: Send errors that occur during the submission to stderr rather than the error output file (the errors while ''runnning'' the script always go do the error file).
:;<code>-mem ''value''</code>: Request <value> amount of memory for the job. (Where ''value'' is number suffixed by 'k', 'm' or 'g')
:;<code>-once</code>: Only start one job with the name of this one (see below), fail if another is already started or queued.
:;<code>-continuous</code>: Start a self-restarting job on the continuous queue (default if invoked as ''jstart''). Please see the section on continuous jobs below.
:Some of the more useful <code>qsub</code> supported are:
:;<code>-i</code>, <code>-o</code>, and <code>-e</code>: Selects the file used for standard input, output and error of the job, respectively. By default, <code>jsub</code> will append stdout and stderr to the files <code>''jobname''.out</code> and <code>''jobname''.err</code> in the tool account's home directory, and will not have standard input.
:;<code>-j</code>: send standard output and error together to the output file
:;<code>-sync y</code>: Normally, <code>jsub</code> queues up the job and returns immediately. This allows you to wait for the job to be complete instead.
:;<code>-cwd</code>: Start the script in the same directory you invoked <code>jsub</code> from.
:;<code>-N ''jobname''</code>: Pick a different job name (see below).

By default, jobs are allowed 256MB of memory; you can request more (or less) with the <code>-m</code> option but keep in mind that a job that requests more resources may be penalized in its priority and may have to wait longer before being run.

==== Job names ====

By default, jobs have the same name as the program, minus extensions. (For instance, if you had a program named <code>foobot.pl</code> which you started with <code>jsub</code>, the job's name would be ''foobot''). You can pick a different name for the job when starting it (with the <code>-N</code> option of <code>qsub</code> and <code>jsub</code>); this name identifies the jobs on statuses, but can also be used to control it.

It's important to note that you can have more than one job, running or queued, bearing the same name. Some of the utilities that accept a job name may not behave as expected in those cases.

==== Simple, one-off job ====

The simplest scenario is when you want to run a job on demand that has a finite duration (at interval from cron, for instance, or from a web tool or the command line).

$ '''jsub''' ''program-or-script''

Will schedule the job to be run as soon as possible, and put eventual output from the job to files in your home; this is done asynchronously in the background. If you need to wait until the job has completed (for instance, to do further processing on its output), you can add the <code>'''-sync y'''</code> option to the <code>jsub</code> command.

If you need to make certain that the job isn't running multiple times (such as when you invoke it from a crontab), you can add the <code>'''-once'''</code> option. If the job was already running or queued, it will simply mark the failed attempt in the error file and return immediately.

$ '''jsub''' -once -N jobname php /data/project/tool/task/execute.php

==== Continuous tasks (such as bots) ====

Continuous tasks have a dedicated queue, ''continuous'', which has a slightly different setup:
*Jobs started on that queue are automatically restarted if they, or the node they run on, crash
*In case of outage or lack of resources, they will be stopped and restarted automatically on a working node
*only tool accounts can start continuous jobs

The queue will not restart jobs that exited normally (i.e., were not killed) unless they are wrapped in a script to do so; starting a job with the <code>-continuous</code> option of <code>jsub</code> does so automatically until they exit normally with an exit value of zero, indicating completion.

One would normally start continuous jobs with the <code>-once</code> option as well so that they can be managed reliably with <code>job</code> and <code>jstop</code> utilities.

For convenience, there is an utility to start a continuous bot with reasonable default options:
<code>'''jstart ''script'''''</code>
(which is equivalent to <code>jsub -once -continuous ''script''</code> and accepts the same options. This would start the ''script'' program in continuous mode if it is not already running, making certain that it is kept running.

==== Job Status ====

Once your jobs has been submitted to the grid using one of the above commands you will receive an output similar to the one below which includes the job id and job name.

Your job 120 ("xbot") has been submitted

You can see the status of all your running and pending jobs with the <code>qstat</code> command. If you know that your job can only have one instance running (such as when you use the <code>-once</code> option when starting it) you can also use the <code>job</code> command to get its job id (or a more verbose status with <code>-v</code>). The latter is particularly useful from scripts or web services.

{{collapse top|example output for '''qstat''' and '''job'''}}
For instance:
local-xbot@tools-login:~$ '''qstat'''
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
120 0.50000 xbot local-xbot r 04/01/2013 21:00:00 continuous@tools-exec-01.pmtpa 1

Reports that you have one job, with id 120 and name '''xbot''' that is currently running (<code>r</code> in the state column). You could also have used <code>job</code>:

local-xbot@tools-login:~$ '''job xbot'''
120
local-xbot@tools-login:~$ '''job -v xbot'''
Job 'xbot' has been running since 2013-04-01T21:00:00 as id 120

{{collapse bottom}}

Once you have the job Id you can find out more information about a job using the <code>-j</code> parameter with <code>qstat</code> command and a job number.

{{collapse top|example output for qstat -j 990}}

local-toolname@tools-login:~$ qstat -j 990
==============================================================
job_number: 990
exec_file: job_scripts/990
submission_time: Wed Apr 13 08:32:39 2013
owner: local-toolname
uid: 40005
group: local-toolname
gid: 40005
sge_o_home: /data/project/toolname/
sge_o_log_name: local-toolname
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin
sge_o_shell: /bin/bash
sge_o_workdir: /data/project/toolname
sge_o_host: tools-login
account: sge
stderr_path_list: NONE:NONE:/data/project/toolname//taskname.err
hard resource_list: h_vmem=256m
mail_list: local-toolname@tools-login.pmtpa.wmflabs
notify: FALSE
job_name: epm
stdout_path_list: NONE:NONE:/data/project/toolname//taskname.out
jobshare: 0
hard_queue_list: task
env_list:
script_file: /data/project/toolname/taskname.py
usage 1: cpu=00:21:08, mem=158.09600 GBs, io=0.00373, vmem=127.719M, maxvmem=127.723M

{{collapse bottom}}

==== Stopping a running job ====
To stop a running job (or prevent it from being run if it had not already started), you can use the <code>'''qdel ''job_number'''''</code> command. The job number was output when the job was started, or you can get it from the <code>qstat</code> command.<p>If you started the job with the <code>jstart</code> command, or you know there is only one job with the same name, then you can use the <code>'''jstop ''jobname'''''</code> utility command.

== Web services ==

Every tool account has a web interface made available (though, in cases of bots with no web interactivity, you may simply wish to have a static page that describes the tool, or a simple status report). Users do not and can not have a web directory in /home.

=== Published directories ===

Each tool account has two corresponding URI that are automatically made available from two directories in its home:
;http://tools.wmflabs.org/toolname/ :which maps to the tool's <code>~/public_html/</code>
;http://tools.wmflabs.org/toolname/cgi-bin :which maps to the tool's <code>~/cgi-bin/</code>

The latter directory will attempt to run the files it contains as CGI scripts rather than display them. In addition, files in either directory that end with the <code>.php</code> or <code>.php5</code> extensions will be run as PHP CGI scripts.

All CGI scripts must be marked ''executable'', and are run with the permissions of '''''the user account that owns the script'''''. In almost all cases, you want to make certain that they are owned by the tool account.

The web server allows overrides of <code>AuthConfig FileInfo Options=IncludesNOEXEC</code> from <code>.htaccess</code>.

=== Cookies ===

Since all tools in the Labs reside under the same domain, you should prefix the name of any cookie you set with your tool's name. In addition, you should be aware that cookies you set may be read by every other web tool your user visits.

Accordingly, you should avoid storing privacy-related or security information in cookies. A simple workaround is to store session information in a database, and use the cookie as an opaque key to that information. Additionally, you can explicitly set a path in a cookie to limit its applicability to your tool; most clients ''should'' obey the Path directive properly.

=== Logs ===

The access logs for your tool's web interface are placed in the tool account's <code>~/access.log</code>, in common format. Please note that the web logs are anonymized such that the user's IP address appears to be that of the local host. In general, the privacy policy will not allow logging of personally identifiable information by tool maintainers (including IP addresses); special permission from Foundation legal counsel would be required to get that information.

Error logs, because of limitations of the Apache web server, are not made directly available to tool maintainers. There is a workaround in place for PHP, which allows per-user logging (PHP error logs are placed in the tools account's <code>~/php_error.log</code>), but until a newer version of Apache can be deployed it is recommended that you use your language's facilities to log errors to a file under the tool account's home.

In particular, however, this means that if you have a CGI which is unable to start you will not be able to see the error preventing it without help from a tool labs admin. There are a few common errors you can check against which cover most cases:
* The CGI's file is not owned by the tool account
* The CGI's file does not have its execute bits set
*:(You can use the <code>chmod</code> command to set the script as executable)
* The CGI is a script and does not start with a Unix "shebang" invocation, or it points to the wrong path:
*:A unix "shebang" is the first line of a script that specifies the program meant to execute that script. It has the form
*::<code>'''#! /path/to/interpreter'''</code>
*:Where the path is, for instance, <code>/usr/bin/perl</code> for perl scripts. You can check the path to a language interpreter by using the <code>which</code> command:
*::<code>maintainer@tools-login:~$ '''which python'''</code>
*:would output the path to the python interpreter.

== Database access ==

Every tool account automatically gets a database for general usage on the project itself. The information and credentials to that database are put automatically in the account's <code>~/.my.cnf</code> on creation. Full control over that database is granted to the tool account (with grant option).

Normally, tools do not have access to create new databases dynamically; please consult with a project admin if you require that functionality.

''(Soon)'' Tool accounts are also granted to the production database replicas, and may create and manage databases there at need. Because of technical limitations, actual revision text is not available on the replicas (but can be fetched at fairly high efficiency with the API).

Latest revision as of 12:46, 8 August 2021