The term, FAIR, was officially coincided in 2016, which stands for findability, accessibility, interoperability, and reusability . During my omics integration project, I realized the importance of practicing FAIR data management in scientific fields. In current days, not only the methods of storing data vary among different databases, but also the formats of datasets are often inconsistent within the same database. Inconsistency in data creates many obstacles for programmatic applications, especially for machine learning which requires datasets to be organized in a specific way. Here, let’s talk about how FAIR data will allow decrease of human involvement during data preparation for machine learning models.
Findability and accessibility of data allows (meta)data to be assigned unique identifiers and retrieved by using these identifiers . These two characteristics of data are fundamental for machine learning projects which often require enormous amount of data. Findability and accessibility are relatively simple to achieve, and most databases nowadays provide application programming interfaces (APIs) for users to acquire data through programs.
Interoperability requires (meta)data to have the shared vocabulary and language for knowledge representation and record references to related (meta)data . The shared vocabulary and language not only offer human-readable information, but also enabling machines to have clear search space. In my omics integration project, dealing with synonyms of diseases becomes a great difficulty. Although scientists in the related fields can handle various synonyms in data, it is difficult for machines to achieve the same flexibility humans have, so machines either provide unrelated search results or lose related information. Many search engines nowadays have certain level of capabilities of dealing with synonyms by regularly updating dictionary or natural language processing (NLP), but there are still notable flaws in many occasions. Having interoperability, data retrieval can achieve higher precision and sensitivity, and it is also more efficient to add new data in the future.
Last but not least, reusability of data needs them to be described by rich and detailed attributes and consistent formats . Clearer attributes of data also provide the basis of interoperability above, for search engines can go through more information while searching a certain keyword. In my opinion, reusability is important for machine learning for two reasons. First, in some domains, such as the fields of life science, collecting data requires lots of efforts, and the lack of reusability of these data results in a huge waste of resources. Especially since many machine learning algorithms need abundant amount of data, the sustainable data prove to be useful. Second, machine learning algorithms often require data to be formatted and organized. If attributes vary largely from one data to another, the efforts in data preprocessing will increase. Reusable data are therefore crucial for machine learning projects to proceed more efficiently.
Nowadays, machine learning is ubiquitous in many areas so that having FAIR data management becomes more and more important. Recently, NIH/NIEHS Superfund Research Program (SRP) started the End-User Computing (EUC) Data Supplement Project which practices FAIR principles in environmental health sciences. Prof. Ilias Tagkopoulos’ lab will support UCD SRP team to build Toxicology Integrated Platform (TIP). It will be exciting to see how machine learning and data science facilitate FAIR data management!
- Featured Image. Retrieved from https://www.paperlesslabacademy.com/2019/06/12/fair-principles/
- Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3, 160018. doi:10.1038/sdata.2016.18