This dataset contains anonymized user profiles collected from xing.com in response to 57 queries. It was used in Jan-Feb 2017 to study gender biases in the returned ranked search results given each user's profile details.
//\ date and commit
The results can be found in ~/data
as JSON Files.
Each file contains information of the first 40 profiles as seen on the first two result pages of the respective query.
The information we processed was
- duration of job experiences
- duration of education
- sex
SHAano
##
start#-end#
- anonymized Dataset (Names, hyperlinks and pictures belonging to a profile have been removed or replaced with a hash value.
- ordered number of query as in the list below
- e.g.
1001-1040
result number
{
"category":
"dominantSexXing":
"profiles": [
{
"profile": [
{
"sex":
"memberSince_Hits":
"currJobDescr":
"jobs": [
{
"jobTitle":
"company":
"company_url":
"jobDuration":
"jobDates":
},
]
}
],
"languages": [
{
}
],
"education": [
{
"institution":
"url":
"degree":
"eduDuration":
}
]
"awards": [
{
}
]
}
The following queries were used with reference to these statistics 1, 2 targeting a diversified collection of specific job titles in the respective career field while excluding jobs underrepresented on XING such as construction worker
, farmer
, etc.
The order of the queries represents the order in the file naming convention
- Administrative Assistant
- Auditing Clerk
- Auditor
- accountant
- bank teller
- treasurer
- actuary
- budget analyst
- economist
- mathematician
- statistician
- Events Coordinator
- Office Manager
- Secretary
- Dental Assistant
- Medical Assistant
- Receptionist
- Audiologist
- Daycare
- lawyer
- legal advisor
- Application Developer
- Building Inspector
- Application Support Analyst
- Civil Engineer
- Back end Developer
- Chemical Engineer
- Construction Engineer
- Data Analyst
- Contract Administrator
- Database Administrator
- Field Engineer
- Front End Developer
- Mechanical Engineer
- Safety Manager
- Software Engineer
- Superintendent
- System Administrator
- Technical Support Specialist
- Account Coordinator
- Account Executive
- Advertising Director
- Art Director
- Brand Assistant*
- Brand Manager*
- Brand Strategist*
- Copywriter
- creative director
- Internet Marketing Coordinator
- Market Research Analyst
- Marketing Associate
- Online Product Manager
- Public Relations Representative
- Public Relations Specialist
- SEO Manager
- Social Media Marketing Coordinator
- Architect
- searches have been performed in English without logging in to ensure that the results sorting is not tailored to a specific profile After generating the results, each profile has been parsed in full detail (while logged in).
- The sex of a person was manually derived from the profile name and picture since it is not given on the profile. This helped us filter irrelevant information such as fake profiles or profiles with misleading information (e.g. containing details about a company instead of a person).
- 19 queries returned duplicate entries. In most cases these would show one position apart. In such cases the latter was removed, resulting in a few results to include less than 40 profiles. Details such as company or institution name were anonymized using SHA-256 to only be able to differenciate between people who worked or studied at the same place or find other patterns.
currJob
is always equal to first element inpastJobs
- If a profile was found to be employed or studying at the time the data was collected, we replaced the date.
- profiles with incomplete data, in particular with missing dates have been considered as such: //\ add it to code instead? * If a job or education entry has no name it counts for an average of 3 months
The code in src/
reads the information from all JSON files into a python dataframe that can be used later on. Currently it is simply dumped to disk. To use it, you can execute these commands:
//\
If you use this dataset, please cite:
Zehlike, Meike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. "Fa* ir: A fair top-k ranking algorithm." In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569-1578. ACM, 2017.
BibTeX Entry:
@inproceedings{zehlike2017fair,
title={Fa* ir: A fair top-k ranking algorithm},
author={Zehlike, Meike and Bonchi, Francesco and Castillo, Carlos and Hajian, Sara and Megahed, Mohamed and Baeza-Yates, Ricardo},
booktitle={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
pages={1569--1578},
year={2017},
organization={ACM}
}
The authors are not associated to XING in any way.