Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H. and Crawford, K. (2018) ‘Datasheets for Datasets’, Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning.
Gebru, Morgenstern, Vecchione, Vaughan, Wallach, Daumé and Crawford’s “Datasheets for Datasets” argues that machine-learning datasets require systematic documentation because data are not inert inputs but foundational infrastructures that shape model behaviour, social impact, and downstream harm. Their central claim is that the machine-learning community lacks a standard mechanism for explaining why a dataset was created, what it contains, how it was collected, what it should or should not be used for, and what ethical or legal risks it carries. By analogy with electronics, where components are accompanied by datasheets specifying operating characteristics, limitations, and safe use, the authors propose datasheets for datasets as a documentation practice organised around motivation, composition, collection, preprocessing, distribution, maintenance, and legal or ethical considerations. This framework matters because biased or poorly documented datasets can propagate through systems like faulty components, producing discriminatory outcomes in hiring, criminal justice, facial recognition, finance, or infrastructure. Their case studies of Labeled Faces in the Wild and the Movie Review Polarity dataset show how datasheets can expose hidden assumptions, demographic imbalance, consent problems, sampling limits, preprocessing decisions, and unsuitable uses. The LFW example is especially revealing: a dataset widely used for face recognition contains public images scraped from news sources, uneven demographic representation, limited consent, and potential compliance issues, all of which affect responsible deployment. In conclusion, the article reframes dataset documentation as an ethical and epistemic obligation; transparency about data provenance, limits, and risks is not bureaucratic excess, but a necessary condition for accountable machine learning.