Last week someone asked a question on ActiveDir which many folks appreciated my response to. I’ve summarized the original question and included my response here for future reference.
Where should “prefix” words (Tussenvoegsels) in a name be stored in Active Directory? And how do I reflect cultural differences in names? For example, “Johannes van der Waals” vs. “Johannes Van Der Waals”.
The problem space here is bigger than you’ve outlined. It’s my understanding that there are cultures where the surname is presented first. There are also important name suffixes such as Jr., Sr., II (the second). As well as less important name suffixes. And name prefixes that you may consider important like Dr. There are also some regulations that may limit your ability to actually publish name data–in my sector, FERPA restricts our ability to publish student names when the student has blocked that. Case sensitivity is also a relevant factor in this space. And then there’s the issue of legal name vs. preferred name, which can be a highly political issue when transgender issues are taken into account.
I’d note that the LDAP specification includes support for attribute options. This is the idea that the attribute value might have different values depending on the context desired. It’s like a multi-valued attribute, but each value has additional context that “labels” when that value has meaning. And LDAP clients can request the right contextual value. In the context of this problem, imagine the clients could request the “Belgian” option of the displayName. Or the “Dutch” option.
From a data modeling perspective, it’s much cleaner, and for example, it’d be a much better way to store a bunch of things in AD than what various product teams ended up with. Take for example, the displaySpecifier excerpt I sent earlier this week:
attributeDisplayNames (85): co,Country/Region; c,Country/Region Abbreviation; mSDS-PhoneticCompanyName,Phonetic Company Name; mSDS-PhoneticDepartment,Phonetic Department; mSDS-PhoneticDisplayName,Phonetic Display Name; mSDS-PhoneticFirstName,Phonetic First Name; mSDS-PhoneticLastName,Phonetic Last Name; distinguishedName,Distinguished Name; <snip>
An attribute option would have been a cleaner way to represent this data, instead of asserting an arbitrary delimiter of a comma to denote the data relationship. The Exchange product has a *lot* of examples in AD where this would be a better data modeling choice too–for example, the proxyAddresses attribute.
But AD doesn’t support attribute options.
So why did I bring it up? Well, I know a lot of places that use other LDAP directories to master their name data. And those that have tackled these harder issues leverage attribute options to accurately represent the name data in all these various formats.
And using an arbitrary delimiter—mimicking what Microsoft has done in a couple cases–isn’t really a good option here.
There’s another AD limitation too. By RFC, sn is supposed to be multivalued. But in AD, it’s single valued! So with that attribute, you can’t even invent an arbitrary scheme to represent the various versions of the data you’d like to store. 8-10 years ago, when I pressed Microsoft PMs hard on this issue, I was told that yes, they recognized they screwed up, but there was no way they could fix it because it’d cause too many problems.
I’ve just talked about a bunch more challenges and not provided any help moving forward—but I think that’s actually helpful, because context is important.
From a high level, I think you need to recognize that your best outcome is providing something pragmatically useful because the underlying product hasn’t given you the flexibility required to solve all these issues ideally. You do have the option of creating custom attributes, recognizing that none of the MS products that integrate with AD user data will leverage your custom attributes (I think Exchange is partially an exception here). Maybe that’s OK, and you use the displayName attribute to roll up all the bits of name data–both from the standard schema and your custom schema. But you’ll need to find a pragmatic compromise for the displayName value and the sn value–as those are single-valued and you really have no flexibility with those.
Diving deeper into the mechanics of which parts of the name data you store in which attributes, I’d bet there is a large diversity of choices taken by each AD deployment. The key here will be to make a decision for your implementation, document it, and then implement it.
Another thing to keep in mind when doing the data design is the ANR attribute set. The ANR attribute set by default is: givenName, sn, displayName, RDN, legacyExchangeDN, physicalDeliveryOfficeName, proxyAddresses. The ANR is a common search built into a lot of the MS integrated products, and this may influence where you decide to stick data. One thing to keep in mind about the ANR search is that if the ANR search has a space in it, it will break up those substrings and try both of those substrings against givenName & sn (in addition to all the other attributes). In other words, the ANR search makes it less important how you choose to format the displayName value. The ANR may influence your data design because if there’s name data you want to store, but most folks wouldn’t want to search on it, then you’d want to avoid putting it in that set of attributes. Or on the other side of the coin, maybe you’ll want to change the ANR attribute set from the default so more attributes are searched by default.
My AD has some name data standardization applied to it. But I have a mess when it comes to the source systems that supply the name data–and merging all those data sources. The source systems gave very little thought to these kinds of issues, in some cases have case insensitive data, there are a mix of legal and preferred names to choose from, and I have to deal with the FERPA issue I mentioned earlier. I’ve got a bunch of code to wrangle all of that, but to be honest, maintaining it and having to live with the limitations is the worst technical part of my job. For the past 5+ years, I’ve been pitching a different approach which ditches all of the source systems, and creates a new single source of preferred name data. That new single source allows the users to assert their preferred name, but applies a set of name data rules at the point of input to ensure consistency and prevent deviation that is too great from the various source systems. This eliminates the need for the code I maintain by leveraging input validation, and eliminates the complexity associated with wrangling all the various source systems.
I hope all of that is helpful in some part.