[38899] in Kerberos

home help back first fref pref prev next nref lref last post

Re: Concurrency issues with FILE ccache

daemon@ATHENA.MIT.EDU (Osipov, Michael (LDA IT PLM))
Fri Apr 9 11:38:40 2021

To: Greg Hudson <ghudson@mit.edu>, <kerberos@mit.edu>
From: "Osipov, Michael (LDA IT PLM)" <michael.osipov@siemens.com>
Message-ID: <283ec56f-84fc-bd5c-43c6-773202505e38@siemens.com>
Date: Fri, 9 Apr 2021 17:35:26 +0200
MIME-Version: 1.0
In-Reply-To: <87bb255b-0092-6e72-bd43-3d35149dac82@mit.edu>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: kerberos-bounces@mit.edu

Am 2021-04-06 um 19:28 schrieb Greg Hudson:
> On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:
>> gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure.  Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)
> 
> This is not expected, and bears investigation.  It suggests an EINVAL,
> EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
> operations performed by fcc_store(), none of which are expected.  If
> you're building libkrb5, you could try modifying interpret_error() to
> pass those error codes through in order to find out which one is happening.
> 
> Getting multiple cache entries for a service is normal when multiple
> threads or processes initiate contexts to the same (new) service within
> a short window.
> 

Hi Greg,

so I was able to properly compile and install 1.19.1 in the GitLab 
Runner and verified that py-gssapi picks it up from LD_LIBRARY_PATH.
Unfortunately, 1.19.1 still suffers from the same problem as 1.17. I 
tried to narrow it down with strace, but that changes the runtime 
behavior of the application and the error disappears. I did patch the 
fcc_store() funtion:
> $ git diff
> diff --git a/src/lib/krb5/ccache/cc_file.c b/src/lib/krb5/ccache/cc_file.c
> index 9a9b45a6e..7f604c0f4 100644
> --- a/src/lib/krb5/ccache/cc_file.c
> +++ b/src/lib/krb5/ccache/cc_file.c
> @@ -1000,8 +1000,9 @@ fcc_store(krb5_context context, krb5_ccache id, krb5_creds *creds)
>      if (ret)
>          goto cleanup;
>      nwritten = write(fileno(fp), buf.data, buf.len);
> -    if (nwritten == -1)
> +    if (nwritten == -1) {
>          ret = interpret_errno(context, errno);
> +        printf("errno: %d, ret: %d\n", errno, ret); }
>      if ((size_t)nwritten != buf.len)
>          ret = KRB5_CC_IO;

but the output did not appear. Then I patched the interpret_errno() 
dirctly for the internal error:
> @@ -1293,6 +1294,7 @@ interpret_errno(krb5_context context, int errnum)
>      case EWOULDBLOCK:
>  #endif
>          ret = KRB5_FCC_INTERNAL;
> +        printf("errnum: %d, ret: %d\n", errnum, ret);
>          break;
>      /*
>       * The rest all map to KRB5_CC_IO.  These errnos are listed to
I had exactly one faiure in the job and received exactly this:
> errnum: 17, ret: -1765328188
which maps to EEXIST

I am quite sure that this is a race condition where stat() is performed, 
file does not exist, open() with write is performed, in parallel it is 
already created and the later call returns in EEXIST.
I assumed it to be fcc_initialize() and added a printf():
> fcc_initialize()
> errnum: 17, ret: -1765328188
> fcc_initialize()
> errnum: 17, ret: -1765328188

What now?

Michael
________________________________________________
Kerberos mailing list           Kerberos@mit.edu
https://mailman.mit.edu/mailman/listinfo/kerberos

home help back first fref pref prev next nref lref last post