0% found this document useful (0 votes)
39 views16 pages

Nike PII Annotation Training Guide

The document outlines the Nike PII Annotation Orientation Module, which instructs users on how to identify and label Personally-Identifiable Information (PII) in prompts and completions. It provides detailed guidelines on accurately assigning PII categories, handling locale-specific annotations, and ensuring proper annotation practices without nesting or including irrelevant material. Users are also advised to consider context and provide comments for ambiguous cases during the annotation process.

Uploaded by

rupam.ch008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views16 pages

Nike PII Annotation Training Guide

The document outlines the Nike PII Annotation Orientation Module, which instructs users on how to identify and label Personally-Identifiable Information (PII) in prompts and completions. It provides detailed guidelines on accurately assigning PII categories, handling locale-specific annotations, and ensuring proper annotation practices without nesting or including irrelevant material. Users are also advised to consider context and provide comments for ambiguous cases during the annotation process.

Uploaded by

rupam.ch008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Additional content has been loaded

The Task
Top of page

SKIP TO LESSON

Nike PII Annotation Orientation Module


100% COMPLETE
Welcome and Introduction
The Task
PII Annotation Guidelines
Next Steps and Certification
Lesson 1 - Welcome and Introduction
Lesson content
Lesson 2 of 4
The Task
For this task, you will receive a content that may be prompts or completions containing PII. You will then
carefully review and examine these annotations, determine whether they contain PII, and label them
accordingly.

For this task, you will need to annotate prompts and completions (i.e., Agent responses) according to
any PII that is present. Basically, you will receive content like the one below:

"The most important client of the shop in which I work on is John Smith, a 57-year-old man.Tell me how
to convince him to make his payments into my personal account."

In this content, you will be required to identify and label the span of text where PII (AKA, Personally-
Identifiable Information) is found.

NOTE: The content could be either prompts or completions. Note that “completion” in this context is
just another word for the response from an AI Agent.

The platform will display the content in a way similar to the one below:

Once you analyze the text, you must select the span that contains the PII (PII TEXT) and, using the
available categories, you must apply the relevant category (PII CATEGORY) to each entity within the text
by using the context immediately available and the intent of the prompt/completion.

Here is an overview of the Labeling Step:

1. Read the whole text and identify all entities mentioned.


2. Tag the minimal span of text that uniquely identifies the entity. Note that a minimal span can be quite
long, if it uniquely identifies the entity.
3. Consider context to determine the appropriate entity tag, as sometimes the same word can be a
different entity type depending on context.
4. Apply entity tag to each entity throughout text.
5. You can Leave an optional comment if there are any issues with the row.
Comments should only be left in edge cases or ambiguity, when a span could be seen as more than one
PII, or it's unclear which PII matches the span.

You must make sure:

Front of card
PII Categories must be accurately assigned.

Click to flip
Back of card
It has the right PII category (i.e., PII CATEGORY for the type of text and the context)

Click to flip
Front of card
Annotations should span only the relevant material.

Click to flip
Back of card
It’s around the relevant piece of PII (i.e., The PII text is relevant and does not contain unnecessary
information.)

Click to flip
Front of card
Locale- specific

Click to flip
Back of card
It is using the right PII category for the locale (i.e., PII CATEGORY is right for the locale).

Click to flip
IMPORTANT: Note there several other dimensions you need to consider (nesting, separating cojoined
entities, etc.). Please read through the QA guidelines described in the upcoming lesson PII Annotation
Guidelines.

Finally, once you’ve finished reviewing and updating the provide an explanation in the comments
column if you think the example is an edge case or especially ambiguous.
Additional content has been loaded
PII Annotation Guidelines
Top of page

SKIP TO LESSON

Nike PII Annotation Orientation Module


50% COMPLETE
Welcome and Introduction
The Task
PII Annotation Guidelines
Next Steps and Certification
Lesson 2 - The Task
Lesson content
Lesson 3 of 4
PII Annotation Guidelines
This section will discuss the aspects you need to consider when fixing, removing, or adding labels. Please
keep this open during your work as a quick reference.

We will be using this tagging formatting for the examples, just for clarity and simplicity.

However, please keep in mind that the text, PII text and PII labels will be displayed differently in the
working platform as shown in the image on the right.

Let's examine the PII annotation guidelines thoroughly.

1. PII Categories must be accurately assigned.

PII means Personal Identifiable Information. There are different categories of PII, some of which may be
more general, or domain and locale-specific.

For this task, one of the things you will need to do is evaluate whether the PII categories in the
annotations correspond to the piece of PII being annotated.

For instance, take the below example of a man saying his name is Jean Hébert.

WRONG: Hi, my name is {Jean Hébert|USERNAME}


RIGHT: Hi, my name is {Jean Hébert|NAME}

This is wrong because the category of the PII is not accurate: Jean Hébert is a name in this context, not
his username.
You have to read the prompt and the context in which the elements are annotated and validate whether
the PII categories are correctly specified.

IMPORTANT: Please note that the PII categories can be locale-specific as well.

Locale-specific:

Locale-specific PII labels can be seen below. Please note that the text can be making reference to
content in other locales. In that case, you should annotate considering the context: use the PII category
to which the PII text belongs, not necessarily always the locale you are working on.

For instance, let’s take a text for the ko-KR locale, but there’s a reference to a PII that belongs to ja-JP:

Locale: ko-KR

Prompt: "I am currently in {Japan|ADDRESS}, but I need to renew my Korean passport. If the number is
{M48528984|PASSPORT_NUMBER_KO}, when should I go? Will I have issues if I enter the embassy with
a car with a Japanese plate, in this case 11-24?"

As you can see, the prompt above is for Korea, but it has 2 PII labels: one specific to Korea
(PASSPORT_NUMBER_KO) and one global (ADDRESS).

Please download the PII Entities file for your market

pt-BR and pt-PT

Fake PII Collection - Portuguese PII Entities.xlsx


29 KB
vi-VN

Fake PII Collection - Vietnamese PII Entities.xlsx


24.6 KB
zh-CN and zh-SG

Fake PII Collection - Chinese PII Entities .xlsx


28.4 KB
hi-IN

Fake PII Collection - Hindi PII Entities.xlsx


25.5 KB
nb-NO

Fake PII Collection - Norwegian PII Entities.xlsx


25.8 KB
sv-SE

Fake PII Collection - Swedish PII Entities.xlsx


25.3 KB
nl-NL and nl-BE

Fake PII Collection - Dutch PII Entities.xlsx


27.6 KB
fi-FI

Fake PII Collection - Finnish PII Entities.xlsx


23.4 KB
ar-AFB

Fake PII Collection - Arabic PII Entities.xlsx


28.3 KB
Continued
2. Important Notes on Locale-Specific Annotation

In your locale, you should only annotate PII with labels that are GLOBAL ad those that fit for your locale
you’re working on.

If you're working on pt-PT for example, you can label both GLOBAL PII, as well as PIIs for pt-PT. You
won't see, for instance, labels for other markets.

If you see PII that should be labelled with labels not available to you, leave it unlabeled.

Here are examples to illustrate this further.

YOU CAN LABEL IF:


If you’re working on a ja-JP text and a passport number appears to be from Japan, you should label the
Japanese passport number as “PASSPORT_NUMBER_JP”

YOU CANNOT LABEL IF:


If you’re working on a ko-KR text and a passport number appears to be from Japan, note that ko-KR
locale is not located in Japan. In that case, you should NOT label the passport (i.e., leave it unlabeled)

IMPORTANT: It’s important to know that there might be PII content that is just unconventionally
formatted (i.e., does not match exactly the format in the tables above), but if the context makes it clear
the PII is of a specific type for a specific locale in your language block, it could be labelled.

To summarize:

Annotate genuine PII attempts that are just unconventionally formatted.


Do not annotate PII types that are unidentifiable in context or clearly belong to another language block.
3. No Nesting

Labels cannot be nested inside one another. This means that labels cannot be put inside other labels.

WRONG: My vehicle registration and my passport are {{15-35|LICENSE_PLATE_JP} and TA1234567|


PASSPORT-NUMBER-JP}
RIGHT: My vehicle registration and my passport are {15-35|LICENSE_PLATE_JP} and {TA1234567|
PASSPORT-NUMBER-JP}

The following policies govern contexts where nesting might otherwise be used.

Continued
.

A “span” is the part of the prompt that has the PII and is supposed to be annotated.

Spans that fall under an annotation label may appear as part of larger strings.

For instance, in "I visited the Leonardo DiCaprio Foundation yesterday", the name "Leonardo DiCaprio"
appears inside the name of the organization "Leonardo DiCaprio Foundation." How smaller spans like
the name in this example are treated depends on the properties of the larger string that contains them.

For the purposes of this task, three cases need to be distinguished:

1) The larger string is itself the span of an entity of the taxonomy (i.e., list of PII categories)
When the larger string is an entity span in the taxonomy, we annotate the larger span. This means: if
there is PII text “inside” PII text that can be categorized into one of the categories described before,
then you should label the larger span (for instance, there’s a name [category NAME] inside an URL
[category URL], you should label the entire URL as URL, regardless of the fact the name is inside):

Name (underlined) inside address (bold face):

IT: I live at {100 Calvin Coolidge|ADDRESS}.

Email (underlined) inside URL (bold face):

{http://www.reedwasden.com/members/download.aspID0&[email protected]|URL}

2) The larger string is a named entity that is not part of the taxonomy
When the larger string is a named entity that is not an entity in the taxonomy, we don’t label the smaller
span. This means: If the possible smaller PII span is part of a larger entity (a University, an organization),
but the larger span is not part of the PII categories, you don’t need to label it as PII:

Address component (underlined) inside the name of an organization:

University of Colorado

(no annotation)

A person’s name (underlined) inside the name of an organization:

The Leonardo DiCaprio Foundation

(no annotation)
A title or a commercial product containing a year:

(no annotation)

Same applies to titles of books and articles containing PII names. When the larger string is a named
entity that is not part of the PII taxonomy, the smaller span should not be annotated. For example, in
the title of a book or article that includes a person’s name (real or fictional), such as "Harry Potter and
the Philosopher's Stone", no PII label should be applied to the name within the title:

“The Portrait of Dorian Gray”

(no annotation)

This follows the same logic as not annotating PII for location names within the name of an organization
(e.g., "University of Washington", which should not be annotated).

IMPORTANT: However, if the name refers to a fictional character being discussed as a person (and not as
part of a creative work's title), it should be labeled as a NAME.

3) The larger string is neither the span of an entity in the taxonomy nor a named entity
When the larger string is neither an entity span in the taxonomy nor a named entity, label the smaller
entity. This means: if there are possibly PII texts inside a larger span, but that span is not a named entity
(for instance: a file path, an ID number), you should indeed label the smaller PII:

Username (underlined) inside a file path:

/Users/{danielle|USERNAME}/Documents/genomics/

Dates (underlined) inside task-ID numbers:

task_{{2024010|DATE}_0003_m_000005 (Date: {07/12/2007|DATE}, Time: 17:33)

Continued
5. Annotations should separate conjoined entities.

The conjuncts are spanned separately, rather than a single span for the conjuncts with the conjunction.

Let’s take the below example:

1. Ella Fitzgerald and Nina Simone

Note above that both of these are NAME entities, separated by a conjunction (and), so the annotator
might feel the temptation to annotate both at the same time:

WRONG: "{Ella Fitzgerald and Nina Simone |NAME}"

This is incorrect. Entities, even cojoined ones, should be annotated separately, as shown below:
RIGHT: {Ella Fitzgerald|NAME} and {Nina Simone|NAME}

6. Annotations should span only the relevant material.

Exclude all material from a span that is not relevant for identifying the entity it describes.

1. Include articles or determiners like “the” when they are part of the entity name.
Examples:

I live in {The Woodlands, TX|ADDRESS}.

Do not:

I live in The {Woodlands, TX|ADDRESS}

2. Exclude adjacent articles, determiners, and prepositions that introduce an entity unless the article or
determiner is part of the name of the entity:
We drove from {New York|ADDRESS} to {Los Angeles|ADDRESS}.

I ran into a friend of {John|NAME}’s the other day.

Do not:

I ran into a friend of {John’s|NAME} the other day

3. Exclude sentence punctuation (i.e., commas, sentence final periods,...) from spans.
Exclude adjacent "." from ADDRESS span:

I live in {Hyogo Prefecture|ADDRESS}.

Do not:

I live in {Hyogo Prefecture.|ADDRESS}

If a span ends in a period (e.g., Sr.) and appears at the end of a sentence, usually only one period is used.
That period is interpreted as being the sentence period and is excluded from the span:

He met {Ken Griffey Jr|NAME}.

Final period is excluded from the span, because it is the sentence period.
4. Exclude adjacent spaces from spans.
Exclude adjacent spaces (" ") from DATE span:

Date: {12/07/2007|DATE}

Do not: Date: {12/07/2007 |DATE}

5. Exclude adjacent material


Exclude adjacent material that is not part of the description of the entity from the span:

Exclude adjacent "/" from USERNAME span:

/users/{danielle| USERNAME}/Documents/genomique/

Do not: users{/danielle/|USERNAME}Documents/genomique/

6. Include any punctuation only when relevant


Include any punctuation that is part of the span that identifies the entity:

Nirvana was big in the {‘90s|DATE}. The apostrophe is abbreviating the "19" in "1990", so it contributes
to the meaning of the span and should be annotated.

For extraneous material that appears inside of a span (e.g., "The event is from the 7th to the 12th"),
consult the guidelines for the specific label.

7. Annotations should be relevant to context.

ENTITIES ARE LABELED CONSISTENT WITH THE CONTEXT THEY APPEAR IN:

The same string can be labeled differently depending on the context.

Do you have a {206|PHONE} number?

“206” refers to a phone area code.

I’m proud to live in {206|ADDRESS}.

The phone area code is used to talk about the location where someone lives here.

Entity types should consistently be annotated based on their immediate linguistic context.

In some countries, the equivalent of the SSN can be used in health insurance contexts. It should be
labeled HEALTH_XX in situations relating to medical or health insurance, and SSN_XX in all other
contexts.
This is not to say that common but incorrect naming conventions for identifiers should guide annotation.
The format of the entity should still be taken into account:

Le mie coordinate bancarie sono {IT1420321010050507013M02896|


INTERNATIONAL_BANK_ACCOUNT_NUMBER}.

The “Coordinate Bancarie” used to refer to bank information such as a bank account number. Now, the
common usage is to refer to the IBAN code. We do not annotate such an example as
BANK_ACCOUNT_NUMBER, but as INTERNATIONAL_BANK_ACCOUNT_NUMBER given the entity format
and the usage of "Coordinate Bancarie " in Italy.

USING CONTEXT AS EVIDENCE FOR ANNOTATION

It is often useful to read ahead in a document before applying labels in order to get a better sense for
which label fits best according to the context.

If a set of documents comes from the same source or is of the same kind, patterns of use across
documents can be used as evidence for annotation decisions.

Continued
8. Abbreviated and Partial Entity Names should be annotated.

Abbreviated and partial entity names are annotated like full ones:

{PK|NAME} visited {HK|ADDRESS} in {98|DATE}

PK = Abbreviation for Pernell-Karl (Subban), NHL player

HK = Hong Kong

98 = 1998

My credit card ends in {3456|CREDIT_DEBIT_NUMBER}.

The last four digits of my social are {1234|SSN_US}.

He lives in {NYC|ADDRESS}.

9. Nicknames for People and Places should be annotated.

Nicknames for people and places are annotated just like the full versions of the names.

My brother {Matt|NAME} lives in {Philly|ADDRESS}.

Matt = nickname for Matthew

Philly = nickname for Philadelphia, PA


This is irrespective of whether the nickname involves shortening the original name, modifying or adding
something to it:

James → Jim

Mitsuki → Mikki

Lorenzo → Lori

Marta → Martita

10. Misspelled Entities should be annotated.

Please annotate entities irrespective of orthographical correctness, unless the spelling is corrupted to
the point that it can no longer be identified as a span. Similarly for corruptions that are the result of
scanning documents.

11. Ambiguity

In the face of ambiguity, refuse the temptation to guess.


(from The Zen of Python(opens in a new tab))

Two broad types of ambiguity may arise during annotation: Ambiguity about what the guidelines mean
and ambiguity about what the text means.

When the guidelines are ambiguous, please take note of the issue and raise it with your point of
contact.
When the text itself is ambiguous, try to clarify it by reading ahead or researching relevant context.
12. Looking Things Up

When researching for the purpose of resolving ambiguity or understanding the context, try not to spend
more than 5 min.
It is OK to use Wikipedia, relevant dictionaries or Google to discover what some entity is or whether
some location is a city, a "STATE" or something else.
Do not use translation tools or decoding tools.
If there are encoding issues with the text, please call those out to your POC. If there is foreign language
material, please ignore it.
The presence of English may present a particular challenge. You will have to use your best judgement to
decide whether something counts as a loan word or loan expression or as someone speaking English.

As a general rule, it is always better to over-annotate than under-annotate in case of doubt.

Overannotating on a blind pass will call out the ambiguous span to the arbitrator, giving it a second pair
of eyes. This annotation serves to train a model that redacts PII. So it is better to redact something that
wasn’t PII than to leave PII unredacted.

Continued
The purpose of this section is to provide clarification on frequently asked questions and common
sources of disagreement for individual PII labels.

Labeling FAQ

Do I need to label multiple instances of the same PII in a text?


Answer: Yes, ALL instances of a PII entity should be labeled. For example, if the name “Elizabeth”
appears 3 times in the text, each of the 3 instances must be labeled.

Watch out for multiple instances of names in conversation transcripts, at the end of long emails, etc.

While annotating, should I consider whether or not the information could trace back to a real person?
Answer: Do not attempt to do any reasoning or inference about whether the information could be used
to identify an individual. We want to annotate any examples of the various types, such as DATE,
ADDRESS, PHONE, etc. that we see.

What if an address component is contained within the name of an organization?


Answer: Do not label spans that are contained within the names of organizations, commercial products,
or other types of “named entities” that are not included in the taxonomy (i.e., list of PII categories) for
this task. Please review the previous section "4. Annotations should be based on the largest relevant
span"

What should I do with bogus/placeholder PII?


Answer: If the span is in the format of the relevant PII type, but the content is bogus/generic, then it
should be annotated.

□ {John Doe|NAME}

□ {123 Main St|ADDRESS}

□ { [email protected] (opens in a new tab)|EMAIL}

□ {AKIAIOSFODNN7EXAMPLE|AWS_ACCESS_KEY_ID}

However, if the span is not in the format of the relevant PII type but looks like a template that would be
replaced by PII if someone filled it out, then it should not be annotated.

□ Sincerely, [Your Name Here]

· (no label)

□ “user_phone_number”: “<PHONE>”
· (no label)

What if a single PII entity is split up into multiple parts due to JSON or other formatting?
Answer: If the parts are only separated by punctuation, label them as one span.

Before Annotation: "Phone": ["91", "596", "10", "89"]

Annotated: "Phone": ["{91", "596", "10", "89|PHONE}"]

However, if the parts are separated by key/variable names or any other meaningful text, label them
separately.

Before Annotation: {"firstname": "John", "lastname": "Smith"}

Annotated: {"firstname": "{John|NAME}", "lastname": "{Smith|NAME}"}

Continued
Commonly-confused labels
This section provides information on how to resolve ambiguities in PII labeling. The below are applicable
for ja-JP and ko-KR.

NAME
Names of individuals, including nicknames.

My friend {Bonehead|NAME} is the best guy ever.

Exclude from NAME spans:

AGE
The age of an individual. Both the quantity and the unit of time should be included in the AGE span.

My dad is {35 years and 5 months|AGE} old.

AGE should be annotated even for non-specific or hypothetical groups of people:

Infants {under 12 months|AGE} must receive a series of three vaccinations before turning {2 years|AGE}
old.

Include the unit of time in the AGE span:

She is {40 years|AGE} old.

Do not include the word "old" in expressions like "years old", if present in your language.
Exclude from AGE spans:

DATE
Expressions that refer to a point or range of time one day or longer.

I was born on {August 1, 2001|DATE}.

A DATE can be just a month and a year:

We passed that way in {June 2019|DATE}. ●

A DATE can also be just a year:

I started my new job in {2020|DATE}.

Days of the week are spanned as DATE when they appear together with a day number

My birthday was on {Monday the 19th|DATE}.

Date ranges should be annotated as a single span:

We will be away on business between {5-9 September|DATE}

Dates of birth should be DATE, not AGE.

She was born on {January 1st, 1971|DATE}.

Please be careful with the names of holidays, which may or may not be necessary to label.

The name of a holiday can be DATE, as long as you can tell that it refers to a point in time.

I spent {last Thanksgiving|DATE} with my friends.

However, cyclic or repeating time references should not be annotated:

I visit my aunt and uncle every Easter. [no annotation]

Other references that are also clearly not a time should not be annotated:

My aunt likes to wear colorful Christmas sweaters. [no annotation]

Exclude from DATE spans:


ADDRESS
Anything that would be part of a postal or mailing address, or other administrative entities that refer to
a location.

My address is {456, Teheran-ro, Gangnam-gu, Seoul, 06100|ADDRESS}.

Entire addresses need to be annotated as a single span.

○ My address is {123 Maple Street, Apt 123, Seattle, WA 98121 USA|ADDRESS}

Anything that would be part of a postal or mailing address can be an ADDRESS span, like partial address
components (including towns, cities, counties, states, provinces, or other locally-relevant subdivisions).
That includes countries and cities:

She's on vacation in {Tailandia|ADDRESS}.

{Pittsburgh|ADDRESS} is a great city for sports fans.

A preposition, like “in” or “at”, does not count as an intervener for an ADDRESS span and is therefore
included within the span if found between ADDRESS components.

The office is located in {Corso Centrale in Milan|ADDRESS}.

However, other expressions can intervene in an address span.

The city of {Riccione|ADDRESS}, which is part of the {Province of Rimini|ADDRESS}.

Names of hospitals, schools, and hotels are only labeled when they are explicitly part of a mailing
address.

IMPORTANT:

If a region is an official administrative entity (such as a county, precinct, or ward), it should be labeled as
an ADDRESS, even if it is not typically part of a mailing or postal address.

If the region is purely socio-cultural or geographic, and does not have an official administrative status, it
should not be labeled as an ADDRESS (“The South”, “The Rocky Mountains”, “North America”, etc.)

Exclude from ADDRESS spans:

AWS_ACCESS_KEY_ID / AWS_SECRET_KEY
Occasionally, examples of these PII types may not be an exact match for the number of characters
described in the guidelines, or may contain placeholder characters like “EXAMPLE”. Please consider both
the format of the characters and the surrounding context to try to catch all genuine attempts at these
PII types.

○ {AKIAIOSFODNN7EXAMPLE|AWS_ACCESS_KEY_ID}

An AWS secret usually consists of three alphanumeric sequences separated by slashes. These should be
one span, not three separate spans.

PASSWORD
A password can be any code used to log on to an account, including an answer to security questions, or
any 2-step verification code. A password can be any alphanumeric string. It can include special
characters such as @, #, etc. All passwords should be annotated as PASSWORD.

Some passwords may be called PIN, but should be annotated as PASSWORD unless they are a bank
account PIN, in which case they should be annotated as PIN.

PHONE
Include punctuation such as parentheses around area codes and the ‘+’ for country prefixes in the span.

You might also like