To avoid scanning the replay open list every time in the
ptlrpc_free_committed(), the fix of LU-2613 (4322e0f9) changed
the ptlrpc_free_committed() to skip the open list unless the
import generation is changed. That introduced a race which could
make a closed open being replayed:
1. Application calls ll_close_inode_openhandle()-> mdc_close(),
to close file, rq_replay is cleared, but the open request is
still on the imp_committed_list;
2. Before the md_clear_open_replay_data() is called for close,
client start replay, and that closed open will be replayed
mistakenly;
3. Open replay interpret callback (mdc_replay_open) could race
with the mdc_clear_open_replay_data() at the end;
This patch fix the ptlrpc_free_committed() to make sure the
open list is scanned on recovery to prevent the closed open request
from being replayed.
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5507
Reviewed-on: http://review.whamcloud.com/12667
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some customer's systems, kernel was compiled with HZ defined to
100, instead of 1000. This improves performance for HPC applications.
However, to use these systems with Lustre, customers have to re-build
Lustre for the kernel because Lustre directly uses the defined
constant HZ.
Since kernel 2.6.21, some non-HZ dependent timing APIs become non-
inline functions, which can be used in Lustre codes to replace the
direct HZ access.
These kernel APIs include:
jiffies_to_msecs()
jiffies_to_usecs()
jiffies_to_timespec()
msecs_to_jiffies()
usecs_to_jiffies()
timespec_to_jiffies()
And here are some samples of the replacement:
HZ -> msecs_to_jiffies(MSEC_PER_SEC)
n * HZ -> msecs_to_jiffies(n * MSEC_PER_SEC)
HZ / n -> msecs_to_jiffies(MSEC_PER_SEC / n)
n / HZ -> jiffies_to_msecs(n) / MSEC_PER_SEC
n / HZ * 1000 -> jiffies_to_msecs(n)
This patch replaces the direct HZ access in lustre modules.
Signed-off-by: Jian Yu <jian.yu@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5443
Reviewed-on: http://review.whamcloud.com/12052
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Extend the llite layer to support specifying individual target
OSTs. Support specifying OSTs for regular files only. Directory
support will be implemented later in a separate project. With
this a file could have for example a OST index layout of
2,4,5,9,11. In addition, duplicate indices will be eliminated
automatically. Calculate the max easize by ld_active_tgt_count
instead of ld_tgt_count. However this may introduce problems
when the OSTs are in recovery because non sufficient buffer
may be allocated to store EA.
Signed-off-by: Jian Yu <jian.yu@intel.com>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4665
Reviewed-on: http://review.whamcloud.com/9383
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Original linkea was only used for the lustre server code
so it was removed from the upstream client. Now it needs
to be restored for client work that uses this infrastructure.
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pool related codes have some inconsistency about the length
of pool name. Creating and setting a pool name of length 16
to a directory will succeed. However, creating a file under
that directory will fail.
This patch disables any pool name which is longer or equal to
16. And it changes LOV_MAXPOOLNAME from 16 to 15 which might
cause some invalid LLOG records of OST pools with 16 byte names.
It is not a problem since invalid LLOG records are just ignored.
And OST pools with 16 byte names won't work well anyway on the
old versions. There will be problem of inconsistency if part of
the servers have this patch and part of the servers don't. But
it would be safe to assume that this is not a normal
configuration.
Signed-off-by: Li Xi <lixi@ddn.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5054
Reviewed-on: http://review.whamcloud.com/10306
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Testing multi-threaded single shard file write performance has shown
the inode mutex to be a limiting factor when using the
generic_file_write_iter function. To work around this bottle neck, this
change replaces the locked version of that call with the lock less
version, specifically, __generic_file_write_iter.
In order to maintain posix consistency, Lustre must now employ it's
own locking mechanism in the higher layers. Currently writes are
protected using the lli_write_mutex in the ll_inode_info structure.
To protect against simultaneous write and truncate operations, since
we no longer take the inode mutex during writes, we must down the
lli_trunc_sem semaphore.
Unfortunately, this change by itself does not garner any performance
benefits. Using FIO on a single machine with 32 GB of RAM, write
performance tests were ran with and without this change applied; the
results are below:
+---------+-----------+---------+--------+--------+
| fio v2.0.13 | Write Bandwidth (KB/s) |
+---------+-----------+---------+--------+--------+
| # Tasks | GB / Task | Test 1 | Test 2 | Test 3 |
+---------+-----------+---------+--------+--------+
| 1 | 64 | 452446 | 454623 | 457653 |
| 2 | 32 | 850318 | 565373 | 602498 |
| 4 | 16 | 1058900 | 463546 | 529107 |
| 8 | 8 | 1026300 | 468190 | 576451 |
| 16 | 4 | 1065500 | 503160 | 462902 |
| 32 | 2 | 1068600 | 462228 | 466963 |
| 64 | 1 | 991830 | 556618 | 557863 |
+---------+-----------+---------+--------+--------+
* Test 1: Lustre client running 04ec54f. File per process write
workload. This test was used as a baseline for what we
_could_ achieve in the single shared file tests if the
bottle necks were removed.
* Test 2: Lustre client running 04ec54f. Single shared file
workload, each task writing to a unique region.
* Test 3: Lustre client running 04ec54f + this patch. Single shared
file workload, each task writing to a unique region.
In order to garner any real performance benefits out of a single
shared file workload, the lli_write_mutex needs to be broken up into a
range lock. That would allow write operations to unique regions of a
file to be executed concurrently. This work is left to be done in a
follow up patch.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-1669
Reviewed-on: http://review.whamcloud.com/6672
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Testing has shown the ll_inode_inode's lli_write_mutex to be a
limiting factor with single shared file write performance, when using
many writing threads on a single machine. Even if each thread is
writing to a unique portion of the file, the lli_write_mutex will
prevent no more than a single thread to ever write to the file
simultaneously.
This change attempts to remove this bottle neck, by replacing this
mutex with a range lock. This should allow multiple threads to write
to a single file simultaneously iff the threads are writing to unique
regions of the file.
Performance testing shows this change to garner a significant
performance boost to write bandwidth. Using FIO on a single machine
with 32 GB of RAM, write performance tests were run with and without
this change applied; the results are below:
+---------+-----------+---------+--------+--------+--------+
| fio v2.0.13 | Write Bandwidth (KB/s) |
+---------+-----------+---------+--------+--------+--------+
| # Tasks | GB / Task | Test 1 | Test 2 | Test 3 | Test 4 |
+---------+-----------+---------+--------+--------+--------+
| 1 | 64 | 452446 | 454623 | 457653 | 463737 |
| 2 | 32 | 850318 | 565373 | 602498 | 733027 |
| 4 | 16 | 1058900 | 463546 | 529107 | 976284 |
| 8 | 8 | 1026300 | 468190 | 576451 | 963404 |
| 16 | 4 | 1065500 | 503160 | 462902 | 830065 |
| 32 | 2 | 1068600 | 462228 | 466963 | 749733 |
| 64 | 1 | 991830 | 556618 | 557863 | 710912 |
+---------+-----------+---------+--------+--------+--------+
* Test 1: Lustre client running 04ec54f. File per process write
workload. This test was used as a baseline for what we
_could_ achieve in the single shared file tests if the
bottle necks were removed.
* Test 2: Lustre client running 04ec54f. Single shared file
workload, each task writing to a unique region.
* Test 3: Lustre client running 04ec54f + I0023132b. Single shared
file workload, each task writing to a unique region.
* Test 4: Lustre client running 04ec54f + this patch.
Single shared file workload, each task writing to a unique
region.
Direct IO does not use the page cache like normal IO, so
concurrent direct IO reads of the same pages are not safe.
As a result, direct IO reads must take the range lock
in ll_file_io_generic, otherwise they will attempt to work
on the same pages and hit assertions like:
(osc_request.c:1219:osc_brw_prep_request())
ASSERTION( i == 0 || pg->off > pg_prev->off ) failed:
i 3 p_c 10 pg ffffea00017a5208 [pri 0 ind 2771] off 16384
prev_pg ffffea00017a51d0 [pri 0 ind 2256] off 16384
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Patrick Farrell <paf@cray.com>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-1669
Reviewed-on: http://review.whamcloud.com/6320
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6227
Reviewed-on: http://review.whamcloud.com/14385
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Alexander Boyko <alexander.boyko@seagate.com>
Reviewed-by: Hiroya Nozaki <nozaki.hiroya@jp.fujitsu.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Earlier a bunch of interval handling got removed since it wasn't
used by the upstream client. Now some of it is needed again for
the client code so this patch restores what is needed.
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While clients will resend client->server RPCs, servers would not
resend server->client RPCs such as LDLM callbacks (blocking
or completion callbacks/ASTs). This could result in clients being
evicted from the server if blocking callbacks were dropped by the
network (a failed router or lossy network) and the client did not
cancel the requested lock in time.
In order to fix this problem, this patch adds the ability to resend
LDLM callbacks from the server and give the client a chance to
respond within the timeout period before it is evicted:
- resend BL AST within lock callback timeout period;
- still do not resend CANCEL_ON_BLOCK;
- regular resend for CP AST without BL AST embedded;
- prolong lock callback timeout on resend;
some fixes:
- recovery-small test_10 to actually evict the client
with dropped BL AST;
- ETIMEDOUT to be returned if send limit is expired;
Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5520
Reviewed-by: Alexey Lyashkov <Alexey_Lyashkov@xyratex.com>
Reviewed-by: Andriy Skulysh <Andriy_Skulysh@xyratex.com>
Xyratex-bug-id: MRP-417
Reviewed-on: http://review.whamcloud.com/9335
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Statahead thread should wait for inflight stat RPCs to finish in
case statahead RPC callback may access data allocated in statahead
thread context.
ll_sa_entry_fini() should keep old entry if stat RPC is not
finished yet.
Simplify sai refcounting:
* newly allocated sai will hold one refcount, and it will put it
after starting statahead thread.
* statahead thread holds one refcount.
* agl thread holds one refcount.
* stat process calls do_statahead_enter() which will try to get
sai, and if it's valid, it will revalidate from statahead cache,
and put refcount after use.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-3270
Reviewed-on: http://review.whamcloud.com/9663
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>