The client_obd::cl_default_mds_easize field should track the largest
observed EA size advertised by the MDT, subject to a reasonable upper
bound. The MDC uses cl_default_mds_easize to calculate the initial
size of request buffers. The default value should be small enough to
avoid wasted memory and excessive use of vmalloc(), yet large enough
to accommodate the common use case.
In the current code, the default value is only updated if
client_obd::cl_max_mds_easize is strictly less than
mdt_body::mbo_max_mdsize. This condition is almost never met, because
client_obd::cl_max_mds_easize is computed at client mount-time based
on the number of OSTs in the filesystem, so the MDT won't ever observe
and advertise an EA size larger than that.
As a result, client_obd::cl_default_mds_easize indefinitely retains
its initial value, which is computed at client mount-time based on
the filesystem's default stripe width. Any getattr() requests for
widely striped files will consequently allocate a request buffer
that is too small, forcing reallocations on both the client and
server side. To avoid this, update client_obd::cl_default_mds_easize
independently of the value of client_obd::cl_max_mds_easize.
In addition, this patch includes these changes:
- Add comments to the client_obd structure to clarify what the
cl_{default,max}_mds_{cookie,ea}size values mean.
- Prevent mdc_get_info() from storing uninitialized data in
client_obd::cl_max_mds_cookiesize.
- Use 4096 as an upper bound for the default values. The former
bound of PAGE_CACHE_SIZE is too large on 64k-page platforms
(i.e. PPC), so it fails to prevent the vmalloc() spinlock
contention described in LU-3338. The new value was chosen to
be large enough to accommodate common use cases while staying
well below the 16k threshold at which allocations start using
vmalloc().
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5549
Reviewed-on: http://review.whamcloud.com/11614
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allow default_easize to be tuned via /sysfs. A system administrator
might want this if a rare access to widely striped files drives up the
value on a filesystem where narrowly striped files are the more common
case. In practice, however, this is wanted primarily to facilitate
a test case for LU-5549.
- Plumb the necessary interfaces through the LMV and MDC layers
to expose write access to this value by higher layers.
- Add block comments to modified functions.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5549
Reviewed-on: http://review.whamcloud.com/13112
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add indexing option to default dirstripe EA. If MDT find
out the client send the create req to the wrong MDT because
of default stripeEA, it will return -EREMOTE, then client
will retrieve default stripeEA through xattr cache, and
re-create the object.
Also merged patch for LU-6341 to resolve the following problem.
Use ll_dir_getstripe to get default stripeEA in ll_new_node(),
Because ll_getxattr_common requires admin rights for retrieving
default LMVEA (because of trusted- prefix), which might cause
mkdir (from normal user) failure.
If parent does not have default stripeEA, then child should always
be in the same MDT for mkdir. Otherwise MDT should return -EREMOTE,
then client will refresh the default stripe index, and recreate
the object.
Signed-off-by: wang di <di.wang@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5523
Reviewed-on: http://review.whamcloud.com/13360
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6341
Reviewed-on: http://review.whamcloud.com/13990
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch fixes the issue seen on the client with growing request
timeout which occurred after the server side patch landed for
LU-5079. While commit itself is correct, it reveals another
issue. If request is being processed for a long time on server
then client adaptive timeouts will adapt to that after receiving
reply and new requests will have bigger timeout. Another problem
is that server AT history is corrupted by recovery request
processing time which not pure service time but includes
also waiting time for clients to recover.
Patch prevents the AT stats update from early replies on client and
from recovering requests processing time on server.
The ptlrpc_at_recv_early_reply() still updates the current request
timeout as asked by server, but don't include this into AT stats.
The real reply will bring that data from server after all.
Signed-off-by: Mikhail Pershin <mike.pershin@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6084
Reviewed-on: http://review.whamcloud.com/13520
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jian Yu <jian.yu@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a parameter is permanently changed on the MGS the
MGS send a changelog packet to the proper nodes that
are affected by the change. Once the nodes receive the
change they then call the userland utility lctl to
change its local value. When calling a userland
application from the kernel you specify a flag to
control the interaction with the application. Originally
by default the flag was set to 0 which is UMH_NO_WAIT
which meant lctl was being called asynchronously. In
older kernels this was fine since UHM_NO_WAIT and
UHM_WAIT_PROC had nearly the same logic. This changed
with newer kernels which broke updating our parameters.
Plus doing a UHM_NO_WAIT doesn't report back a error
if something goes wrong with lctl. The fix is to set
the flag to UHM_WAIT_PROC so kernel space waits until
lctl has finished and we get a proper error code if
something does go wrong with lctl.
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6063
Reviewed-on: http://review.whamcloud.com/13677
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After DNE phase 2 has been added to client it sends
create request to slave MDT. DNT1-only server doesn't
expect request to slave MDT from client. It expects
only cross-mdt request from master MDT. Thus if DNE2
client tries to "mkdir -i 1" on DNE1 server, then
LBUG happened.
This patch adds OBD_CONNECT_DIR_STRIPE connection
flag check on client side. If striped directories are not
supported by server, then create requrest is sent to
master MDT.
Signed-off-by: Artem Blagodarenko <artem_blagodarenko@xyratex.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6071
Xyratex-bug-id: MRP-2319
Reviewed-on: http://review.whamcloud.com/13189
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: wang di <di.wang@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ll_lookup_it() may trigger IT_OPEN RPC to open a file by name.
But at that time, the client does not know the target file's GID,
so it cannot pack the necessary supplementary group ID in the RPC.
Because of missing the supplementary group ID, the RPC maybe fail
for open permission check on the MDS. Under such case, MDS should
return the target file's GID, if the current thread on the client
in the right group (according to the file's GID), the client will
try the IT_OPEN RPC again with the right supplementary group ID.
This patch is also helpful if some other(s) changed the file's GID
after current RPC sent to the MDS with the suppgid as the original
GID by race.
Signed-off-by: Fan Yong <fan.yong@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5423
Reviewed-on: http://review.whamcloud.com/12476
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add comment blocks on lprocfs_stats_lock() and lprocfs_stats_unlock().
Move common NOPERCPU code out of the switch() statements to reduce
code size and complexity, since it doesn't depend on the opc at all.
Replace switch() in lprocfs_stats_unlock() with a simple if/else,
since the lock opc was already checked in lprocfs_stats_lock().
Add an enum for the lprocfs_stats_lock() operations to make it clear
what the valid values are and allow compiler checking.
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5946
Reviewed-on: http://review.whamcloud.com/12872
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Buggy code at ptlrpc_connect_interpret()
finish:
rc = ptlrpc_import_recovery_state_machine(imp);
...
Set import connection flags
When import has FULL state ptlrpc_import_recovery_state_machine()
wakeup all waiters on import and all delayed request, which was
resented. And it could happened that request was send without
updated flags and AT is disabled. If such request is in progress
on the server, server drop the new instance, and could do early reply
for it. But this early reply confuse client, cause it wait real
reply(no AT for this request). Client try to touch buffer outside
reply and got EPROTO error.
The same bug existed for initital connect too. Import became FULL
before import connection flags was set.
Signed-off-by: Alexander Boyko <alexander_boyko@xyratex.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5528
Xyratex-bug-id: MRP-2034
Reviewed-on: http://review.whamcloud.com/11723
Reviewed-by: Li Wei <wei.g.li@intel.com>
Reviewed-by: Alexander Boyko <alexander.boyko@seagate.com>
Reviewed-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change return type and size argiments of lustre_msg_hdr_size(),
lustre_msg_buf{len,count}() and req_capsule_*_size() to __u32.
Change type of req_format->rf_idx and req_format->rf_fields.nr
to size_t. Also return zero for incorrect message magic instead
of -EINVAL. This will be more robust because of few of them after
LASSERTF(0, "...") and will not be returned. In the rest places
it return zero size instead of huge number after implicit
unsigned conversion.
Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5577
Reviewed-on: http://review.whamcloud.com/12475
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
llog_process_thread() can be called from llog_cat_process_cb with an
index already out of bound, leading to the following crash:
LustreError: 3773:0:(llog.c:310:llog_process_thread())
ASSERTION(index <= last_index + 1 ) failed:
LustreError: 3773:0:(llog.c:310:llog_process_thread()) LBUG
#0 [ffff8801144bf900] machine_kexec at ffffffff81038f3b
#1 [ffff8801144bf960] crash_kexec at ffffffff810c5d82
#2 [ffff8801144bfa30] panic at ffffffff8152798a
#3 [ffff8801144bfab0] lbug_with_loc at ffffffffa02f8eeb [libcfs]
#4 [ffff8801144bfad0] llog_process_thread at ffffffffa0413fff [obdclass]
#5 [ffff8801144bfb80] llog_process_or_fork at ffffffffa041585f [obdclass]
#6 [ffff8801144bfbd0] llog_cat_process_cb at ffffffffa0418612 [obdclass]
#7 [ffff8801144bfc30] llog_process_thread at ffffffffa0413c22 [obdclass]
#8 [ffff8801144bfce0] llog_process_or_fork at ffffffffa041585f [obdclass]
#9 [ffff8801144bfd30] llog_cat_process_or_fork at ffffffffa0416b9d [obdclass]
If index is too big, simply return success.
Signed-off-by: frank zago <fzago@cray.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5635
Reviewed-on: http://review.whamcloud.com/12161
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The changes for LRU lock policy was introduced by commit bfae5a4e,
where I was trying to revise the policy to pick locks for canceling.
However, this caused two problems as mentioned in LU-5727. The first
problem is that the lock can only be picked for canceling only if
the number of LRU locks is over preset LRU number AND it's aged; the
second problem is that mdc_cancel_weight() tends to not cancel OPEN
locks, therefore open locks can be kept forever and finally exhausts
memory on the MDT side.
The commit 7b2d26b0 ("revert changes to ldlm_cancel_aged_policy") fixed
the first problem. This patch will revert the rest of changes related
to LRU policy revise.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5727
Reviewed-on: http://review.whamcloud.com/12733
Reviewed-by: Niu Yawei <yawei.niu@intel.com>
Reviewed-by: Bobi Jam <bobijam@gmail.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To avoid scanning the replay open list every time in the
ptlrpc_free_committed(), the fix of LU-2613 (4322e0f9) changed
the ptlrpc_free_committed() to skip the open list unless the
import generation is changed. That introduced a race which could
make a closed open being replayed:
1. Application calls ll_close_inode_openhandle()-> mdc_close(),
to close file, rq_replay is cleared, but the open request is
still on the imp_committed_list;
2. Before the md_clear_open_replay_data() is called for close,
client start replay, and that closed open will be replayed
mistakenly;
3. Open replay interpret callback (mdc_replay_open) could race
with the mdc_clear_open_replay_data() at the end;
This patch fix the ptlrpc_free_committed() to make sure the
open list is scanned on recovery to prevent the closed open request
from being replayed.
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5507
Reviewed-on: http://review.whamcloud.com/12667
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>