A technical intro to ILM–Identity Lifecycle Manager

3/4/2009 Edited list of out-of-box management agents

So recently I was able to take a training class on ILM. This post contains some of the core concepts and info I learned in that class, along with a few interesting bits about our ILM implementation here. I’ve written once before about ILM here, Identity Lifecycle Manager: Directory synchronization nirvana, Delegated Group Management, Certificate Provisioning, and more, mostly as an intro without any technical depth. This post will pe much further into the technical details.

In general, ILM is a directory synchronization tool and a //certificate management and deployment tool. Since directories commonly hold identity information, this set of functionality neatly rolls up into the product name. ILM is commonly used to provision user accounts, and has some capability to manage password synchronization.

In a short time, Microsoft will release a new version of ILM, currently code-named ILM2. This product has a web portal front-end to it (more on that in a second), workflow functionality, and some very cool cross-product tie-ins. This combination of new features allows very cool new functionality, such as allowing you to provision groups that are dependent on something else. Imagine you know you need to be in a certain group to get access to something. ILM can capture your request via its web portal, or even via an Outlook snap-in. It then sends an email request along to the designated admin for the group, asking if you should be added. If the admin has the outlook snap-in, they can approve the request from within outlook. If they don’t use outlook, they can follow a link in the email to the web portal to approve/deny. I don’t know whether we’ll ever see this feature set of the product in use here at the UW, but it is fun to imagine. 🙂

Anyhow, let’s move back to the core product.

ILM is a state-based synchronization product. This means that it reads in the state of all the various data sources, compares them determining what has changed, and then acts on just the changes. As you might imagine, this has both advantages and disadvantages:

Advantages:

  • it isn’t dependent on a specific order of events
  • it isn’t reliant on installing agent code at each data source to send data to it, and any unreliability/security issues such agent code might bring with it
  • it can enforce that the state be kept as expected

Disadvantages:

  • Requires processing power to evaluate state differences
  • the timeliness of changes aren’t necessarily as good as an event-based process, i.e. there is some synchronization latency

ILM comes out of the box with a wide variety of management agents for common directory and data source products. An ILM management agent is responsible for managing the flow of data to and from a data source.

These include:

    • AD
    • ADAM
    • AD Global Address List
    • Attibute-value pair (AVP) file-based
    • Delimited text file
    • DSML
    • Exchange 5.5
    • Exchange 5.5 (bridgehead)
  • Extensible connectivity
  • Fixed-width text file
  • IBM DB2 Universal Database
  • IBM Directory Server
  • LDIF file-based
  • Lotus Notes
  • Novell eDirectory
  • Oracle
  • SQL
  • Sun and Netscape directory servers
  • Windows NT 4.0

There are also 3rd party management agents.

Conceptually, ILM has 2 object spaces you need to understand. Each management agent has its own connector space (CS). This includes all the data source objects for that management agent. And then there is a single metaverse space (MV). The metaverse space represents those connector space objects which have been projected or joined (more on what those two new words mean in a minute). In other words, the metaverse is the space where things come to together.

Each management agent defines which objects in its connector space should be projected or joined to the metaverse. To project means that the resulting object in the metaverse will consider this management agent’s object as authoritative for that object. And if you have no management agent’s projecting, then your metaverse will be empty. Put another way, projecting is the way to provision objects into the metaverse, and for each metaverse object there is a special relationship dependent upon which management agent or agents projected it. To join means what you might imagine; it connects an object in this management agent’s connector space with objects in the metaverse. Both projection and joining are dependent upon filters and rules that determine which objects should do what, and in what way. A projection filter determines which objects should project. A projection ruledetermines which connector space attributes should map to which metaverse attributes. A join filter determines which objects should attempt to connect to which metaverse objects. A join rule determines which connector space attributes should map to which metaverse attributes. Both join rules and projection rules can go either direction. In order to achieve some synchronization, you’ll need rules that go both in and out of the metaverse.

This leads us to schema, i.e. the definition of what kinds of objects there are and what kind of data can be associated with each object. Each data source comes with its own schema. Each management agent (and therefore its connector space) has a schema (which may or may not match the data source). And the metaverse has a schema that somehow melds all of this together.

Without any special skills, one can directly map an attribute in one connector space to an attribute in the metaverse. If you need to make any sort of changes, have any kind of logical dependency, then you need to use an extended rule, which involves writing some code. Which is not especially hard.

Moving back to concepts, the way in which data is moved around and synchronized is all tied to the management agents. Each management agent has a set of run profiles.

Each run profiles can be one or more of the following actions:

  • Delta import (staged). Evaluate only changed objects in data source, and make changes only to CS.
  • Full import (staged). Evaluate all objects in data source, and make changes only to CS.
  • Delta import and delta sync. Evaluate only changed objects in data source, make changes to CS, then evaluate projection/join rules for only those CS objects which changed, make resulting changes to MV, follow any external attribute flow rules making changes to other management agent’s CS.
  • Full import and delta sync. Evaluate all objects in data source, make changes to CS, then evaluate projection/join rules for only those CS objects which changed, make resulting changes to MV, follow any external attribute flow rules making changes to other management agent(s)’s CS(s).
  • Full import and full sync. Evaluate all objects in data source, make changes to CS, then evaluate projection/join rules for all CS objects, make resulting changes to MV, follow any external attribute flow rules making changes to other management agent’s CS. A full sync is needed after a rules change to apply these new rules to CS and MV objects which didn’t change during a delta.
  • Delta sync. Evaluate projection/join rules for only those CS objects which changed, make resulting changes to MV, follow any external attribute flow rules making changes to other management agent’s CS.
  • Full sync. Evaluate projection/join rules for all CS objects, make resulting changes to MV, follow any external attribute flow rules making changes to other management agent’s CS.
  • Export. Push *all* CS objects to data source per external attribute mapping.

So imports move data into the connector space.

Syncs move data from one connector space into the metaverse and back out to other connector spaces. Note that if you have many management agents, you need to run a sync on each of them to achieve complete synchronization.

Exports move data from the connector space back to the data sources. Exports are the only way to achieve some kind of change in the world outside ILM; without them, you are just playing around in an ILM universe.

We rolled ILM out depending heavily on a Microsoft Consulting engagement to get things in place so the UW Exchange deployment could have an address book which didn’t look rotten.

Our existing implementation of ILM connects an AD with an OpenLDAP directory, or more specifically UWWI with PDS. As you might have noticed, OpenLDAP wasn’t one of the out of the box management agents listed above. At the time of our implementation, there was a 3rd party management agent, but it had known problems which made it not acceptable for our deployment. Since then, an open-source openldap management agent which addresses those problems has been released, but we haven’t had a chance to evaluate it.

Our deployment uses an extensible management agent built around the LDIF management agent to connect to openldap. Our openldap directory dumps change logs to ILM, and ILM reads those in delta imports.

In the existing implementation, AD projects to the MV, and PDS joins. And only objects from PDS which are of class uwPerson are joined. The attribute flow is mostly online already, so I won’t go into that. Our AD has about 450000 users, while PDS has about 1.5 million objects.

One of the gotchas in our existing implementation is the high number of disconnectors. A disconnector is any connector space object which is not related to a metaverse object (i.e. joined or projected). The problem with lots of disconnectors is that every time you do a sync, *all* disconnectors are re-evaluated to see if they now join up. I’m in the process of investigating whether some design changes wouldn’t eliminate that problem in our implementation, and allow us to go from a 3 hour sync cycle to something much shorter. Currently the delta imports take about 5 minutes, while the delta sync takes 90 minutes.

Another gotcha is that while the MS consultant refined the rules a couple times before he left, it doesn’t appear that he ran a full sync after those rule changes. This means that only objects which have changed since then have had those updated rules applied. So there is inconsistency in UWWI in terms of what should be there. Now I may be the only one who has come across examples of this, but it really bugged me until I found out why.

The final notable gotcha in this space resolves around the “name” attributes. And there’s a complex story here, which for the sake of my sanity, I am not going to go into details about. The problems here are:

  • the mapping logic is inconsistent depending on the initial state,
  • not all UWWI objects get joined to a PDS object (b/c of the uwPerson join rule),
  • the input validation and formatting from the source systems to the PDS attribute we use is non-existent,
  • and our mapping logic code falls short of addressing all cases (but there’s no way you can address all possible cases given no input validation and inconsistent formatting).

Oh … and only employees have any real ability to change their name.

We’ve got an imagined fix for all of this, where the ‘manage your uw netid’ page would allow *every* uw netid to manage their name information in a consistent format, with input validation that would then flow through, but it hasn’t gotten enough priority to be resourced.

Windows SIG getting started

A Windows SIG is kicking off! Details below.

Nathan & I will be presenting, and as a teaser, take a look at this new architectural picture. I’m hoping to have some UWWI stats done in time for the presentation too.

Windows Admin SIG Meeting:  “UWWI:  What’s in it for me?”

What: Windows Administration Sig Meeting

Where: Allen Auditorium, Allen Library

When: Wednesday, January 28, 2009

Time: 3:00PM-5:00PM

Please RSVP to coston@u.washington.edu

This will be our official Kick off meeting, so make sure you’re there! 

 

To kick things off right, Brian Arkills and Nathan Dors will be presenting on UW Windows Infrastructure (UWWI) and the UW Groups service.  This is intended to be a highly interactive discussion, so bring your questions and use cases! 

More information about the UW Windows Administration SIG is available athttps://sig.washington.edu/itsigs/SIG_Windows_Administration.

Join the discussion by subscribing to our Mail List at http://mailman2.u.washington.edu/mailman/listinfo/mswinadmins.

A Dev environment for UWWI and Random Password Generation

Since the uw exchange project, uwwi has had a development environment, but it’s been at best a very poor facsimile of the real thing.

More recently, I’ve been pouring time into making it a more realistic environment for testing our core components.

From time to time, we must make changes to the core infrastructure components:

  • the domain controllers themselves, whether that’s the operating system, or other significant changes
  • fuzzy_kiwi, the account provisioning agent
  • slurpee, the group provisioning agent
  • subman, the service lifecycle agent
  • ilm, the directory synchronization agent

Testing those changes, up until very soon in the future, has been hard, and involved finding test cases within the production environment.v

Of course, some changes are of a nature that you can’t test them in production–which is why a development environment is required.

In fact, there are a slew of ilm changes queued up in my task list which are blocked because I currently have no safe way to test them before implementation.

We plan to upgrade the UWWI DCs to Windows Server 2008 sometime in the coming months, but first, we wanted to test that WS2008 didn’t cause any problems with our core infrastructure components. Would fuzzy_kiwi run on WS2008?

Getting fuzzy_kiwi installed and running in a separate domain instance was an adventure of its own, because there was no existing documentation on getting it running (there is now). But I’ll skip that story. 🙂

However, there are a couple interesting things here: I’ve made some key changes to fuzzy_kiwi so that it is now self-aware of where it is running. If it detects that it is running in our development domain, then it does things differently. Otherwise, it acts normally. In our development domain, fuzzy_kiwi creates accounts disabled and ignores the password it is given. Instead, it asserts its own very long, random password–and each instance has a new random password. That was a trick I’ll get to later. There is also a new feature in fuzzy_kiwi where some accounts can be ‘untouchable’ by kiwi requests. This is needed especially in dogfood where you want the administrators to have different passwords on their accounts and you don’t want those accounts to be disabled. It wouldn’t do to have the only folks who can make changes locked out of that domain. 🙂

Getting back to the random password in our dev environment feature, I made a few interesting discoveries about coding such a thing. I knew from my own math and computer science coursework background that generating truly random numbers was a very difficult thing. And generating a random password is at its heart all about generating random numbers. Without going into the details of the algorithm I used (that wouldn’t be very smart, now would it?), I do want to make a few remarks about some of the building blocks.

Within the .net framework, I came across the System.Random class and its System.Random.Next method as a way of generating random numbers. The class and method are very easy to use, and even give you a way to specify a lower and upper bound on the random integer returned. It wasn’t until I started looking at what it generated that I saw a significant problem with the class: by default, it generates exactly the same sequence of “random” numbers on successive runs (within a suitably short period of time). This is because the algorithm used behind the class focuses on randomness within the sequence it generates–not randomness of what is used to initially generate the sequence. By default, the class “seeds” the sequence by using the tick count of the time you instantiate it. But in practice, that means that there is often duplication on subsequent runs. You can supply your own “seed”, but then you are stuck a circular problem: generating a random seed so you can get a random number.

So I set off looking for something better. And I came across the System.Security.Cryptography.RNGCryptoServiceProvider class and its System.Security.Cryptography.RNGCryptoServiceProvider.GetBytes method. This class is a bit harder to use than System.Random. It requires an array of bytes as a method of data output. Output is a random number from 0-255 (it’s a byte after all), and there’s a slight variant method of .GetNonZeroBytes which outputs random numbers from 1-256. This option doesn’t allow you to ask for lower and upper bounds, so you end up performing modulo operations on the output (assuming you want something smaller than 256) and addition/subtraction to fit your needs. From what I’ve seen the numbers generated are pretty random, and there isn’t duplication on successive runs like with System.Random.

This post is probably beyond what most folks are interested in, but you never know what will be useful, or be a catalyst to generate that critical feedback loop that brings in vitally relevant information. I’m likely to have more technically detailed and arcane posts in a vein similar to this in the future. In an effort to balance this likely trend, I’ll try to keep the more technically heavy content later in the post, and keep the more widely useful and relevant information near the top.

Fun with Two Hop Kerberos

So as part of a recent Datawarehouse initiative here at the UW, there’s been quite a bit of activity around Windows authentication delegation, sometime more well known as Kerberos two hop authentication. I know the Law School has been using two-hop authentication for awhile now, and recently had a problem so I think this post is likely relevant to quite a few.

To explain what two hop authentication is, we’ll need to jump back to Windows authentication basics to make sure we are all on the same page. If you already understand it, then jump down to “The Meat”.

So when you login, you give your password (or some other credentials) to the lsass.exe process on what is usually a (physically) local (to you) computer. The lsass.exe process on your computer hashes some other info (a timestamp) using the password to create the hash, then sends that hash over the wire to a domain controller for verification. Note that the info on the wire doesn’t contain any form of your password. The domain controller compares that hash to what it expects, and if successful, passes back a login token that can be used. Depending on the details of the authentication scenario, that login token might have additional stuff (usually domain local & local groups) added to it before you receive it. Then you can use that token to access stuff on the local computer and over the network. The reason you can use it to access stuff off the local computer is because the token itself has been marked as re-usable, and the local lsass.exe process considers that mark as inviolable.

Sometimes you access stuff over the network, and you are challenged for your credentials. When that happens, you actually do send your password over the wire, and the lsass.exe process on the remote computer takes your password, does the same dance with the domain controller, *except* this time the token doesn’t get the re-usable mark. This is because that remote computer doesn’t need your login token except for resources local to it. It uses that token, and we say the remote computer is impersonating you to gain access to resources (local to itself) on your behalf. In Windows terminology this is called Impersonation. Impersonation can also happen without a password challenge, and in that case, your local lsass.exe which has a re-usable copy of your login token passes that token to the remote computer, which then uses that token to ask the domain controllers for a non-reusable token. You might also think of this scenario as one-hop, as the login token is one hop removed from where the user physically is.

Now, say that the remote computer needed to access network resources that aren’t local to it, as you. That’s the scenario we are concerned with here. If it helps, imagine a web service that needs to access a sql service as you to provide the right data. In this two hop scenario, you pass your creds (either password or login token) to the first remote server. That remote server has something special about it. The user account that is running the network service process has been granted a special ability called delegation. The user account might be SYSTEM, in which case the user account is the computer object in Active Directory, or it might be some specific service account. Using delegation, the first remote server can take the creds you’ve provided to get a login token that is re-usable. It then can reach out to the 2nd remote server, and provide a non-reusable token to access whatever it needs on that 2nd remote server. There are two levels of delegation: unconstrained delegation and constrained delegation. With unconstrained, the 1st remote server can get a re-usable login token that can be used to access *any* network services that token has access to. With constrained, the 1st remote server is limited so that the re-uable login token can only be used locally and with specific network services. Obviously unconstrained delegation is more secure and therefore preferable to unconstrained.

A few relevant factoids about delegation:

  • Delegation relies on Kerberos authentication. If you can’t do Kerberos to the 1st remote server, then you can’t use delegation to achieve the second hop.
  • Kerberos authentication relies on a bunch of pre-requisites, so it can sometimes be tricky to achieve.
  • You can have as many hops as you’d like, as long as each server in the chain has delegation privileges to the next server in the chain.
  • Granting the delegation privilege is practically an all or nothing thing. If you grant it to user serviceX, it means that *every* user who passes creds to serviceX will have a re-usable login token available to serviceX. If serviceX is insecure or not trustworthy, then really bad things can happen. Aside from the constrained level, there is one check on this privilege–you can mark certain user accounts as being “sensitive”. This means that they can not be used via delegation at all. You will want to mark all your domain admin accounts as sensitive, and likely quite a few others too.

So that’s the basics, and now we’ll move onto the more interesting stuff.

The Meat

So I was saying that the datawarehouse project here has chosen an architecture design that relies on two-hop authentication. The primary components which do this are sql servers that via the sql linked server functionality bring data from many sql servers together into a view. Complicating this picture is the fact that our user accounts and the sql servers themselves are in two different forests.

We had a lot of problems getting this to work correctly. For Kerberos to work correctly, you have to make sure you have all the service principal names registered correctly. You also have to ensure all the computers are within a given time window. And that they all are trying to use Kerberos. And that you have a forest trust, not a domain trust.

In the course of all these problems, we finally asked PSS to come help us. They sent two consultants on site on two separate occasions, but both left stumped. We were left with all kinds of additional (undocumented) claims from the two different consultants, in some cases contradictory to each other.

We eventually figured out both the problems which were ailing us.

The most serious problem was that in marking a wide variety of accounts as sensitive, we (actually it was me) accidentally marked a special built-in Windows account as sensitive. That account is the KrbTgt account. This account has a very special function. It issues *every* login token for your Windows domain. So, obviously, it’s very important. By marking the KrbTgt account sensitive, apparently every login token it issues is also marked sensitive. This is undocumented behavior, but from a logical perspective it makes sense. So for the span of about 3 months I can say with definitive authority that there was absolutely no delegation used from that domain–because every account was effectively marked as sensitive. Fortunately, not many folks are using delegation from that domain as of yet.

Note that for some domains this might be desired behavior, and that it’s really a shame that this is undocumented behavior. I’d imagine that quite a few Windows security organizations might want to add this to their locked down configuration guides.

We also had a sporadic problem on certain servers with hostnames where the DNS suffix of those server’s hostnames corresponds to a MIT Kerberos realm which happens to have a Kerberos trust from one (and only one) of the forests involved. That problem happens because Kerberos uses mutual authentication–meaning one computer verifies that any other computer it talks to is who it claims to be. For this, it uses what are called servicePrincipalNames (SPNs). But you have to find the right authority for a given SPN, and of course, the Windows logic assumes that the DNS suffix on a SPN is meaningful even though that isn’t necessarily true. It turns out that if the servers involved have the registry keys needed to resolve the KDCs for a MIT Kerberos realm in this scenario, then Windows works as you’d like. In other words, if it can find the MIT Kerberos realm, then it can check it for the SPNs, find out that they aren’t there, and then look elsewhere for the SPNs. But if it can’t find the KDCs for that MIT Kerberos realm, then it gets stuck. Putting the registry keys for resolving the MIT Kerberos realm on all relevant computers is one fix, another is not using that DNS suffix in any server hostnames.

Put another way:

Windows domain blah.doodoo.com has a Kerberos realm trust with jojo.com. Windows domain blah.doodoo.com has a server named sql1.jojo.com in it. Out of the box Windows clients in blah.doodoo.com *can’t* negotiate Kerberos with sql1. Windows clients with the appropriate KDC registry keys referencing the Kerberos realm jojo.com *can* negotiate Kerberos with sql1.

In other words, because you have that Kerberos realm trust, you can’t plan on having Kerberos auth to any computers with a DNS suffix that matches that realm unless all your clients have got the KDC reg keys to that realm. Somewhere in the background it’s likely that there’s an error happening which won’t give up and allow the local Windows KDC to issue a TGS for a host with that DNS suffix, unless it can contact the external Kerberos realm KDCs to see if they have a more authoritative SPN.

If you do want to read up on this technology, my favorite blog site, the MS Directory Services blog, has a very useful post that you can add to your reading list:

http://blogs.technet.com/askds/archive/2008/06/13/understanding-kerberos-double-hop.aspx