It is time to take action! Join our Tribe of Changemakers. Sign up for Virtual Conversations!
The era of data is upon us.The data revolution creates endless opportunities to confront the grand challenges of the 21st century.
What are the issues around Data Science?
Data Literacy and Education
Academia, businesses, and government agencies have all seen a sharp increase in data-driven approaches in recent years. As a result, the demand for workers trained in collecting and analyzing data continues to grow (LinkedIn’s 2020 Emerging Jobs Report registered 37% annual growth in demand for data scientists in the US).
Teaching the skills necessary to collect, manage and analyze data for practical purposes should be a priority. This demand is not restricted to technical skills, but also includes a broad need for an ability to understand and correctly interpret the information that can be extracted from data. Due to the speed at which demand for data science skills has increased, educational institutions at every level have lagged behind, and educators are struggling to figure out how to best contribute to the training of this new workforce.
Simultaneously, there is a lingering lack of consensus on the fundamental principles, expertise, skills, and knowledge base needed to define an academic discipline for this specific purpose. The challenge is not solely to train skilled data scientists, but also to educate policy-makers, industry leaders, and the general public in how to best comprehend concepts that have to date generally been poorly understood.
One particularly important aspect of this dynamic to consider is that the knowledge already accumulated from centuries of experience in data analysis - delivered mainly through the discipline of statistics - can be revamped to play a part in meeting this new educational demand. At present, for example, the average high school mathematics curriculum focuses on building the necessary knowledge base to prepare students to understand calculus.
While calculus is an extremely important element of math originally developed by 15th century astronomers, in a digital and data-driven world it might be more appropriate to prepare students to understand complex statistical concepts, to be able to reason under uncertainty, and to solve problems via data analysis.
The current statistics curriculum needs to be updated to catch up with advances in computing; however, the main priority should be to bring practical applications to the forefront - the people tasked with developing data science courses should therefore not only have statistical training, but also have experience analyzing data with the objective of solving real-world problems in responsible and sustainable ways.
Data Generation, Processing, and Curation
In a world where volumes of both helpful information and potentially harmful misinformation are continuously increasing, an ability to identify high-quality data sources and well-curated data is critical for making the best possible evidence-based decisions. Well-curated and clearly documented data is essential for making the best possible decisions.
This is equally true for anyone making personal or political decisions, administrators running institutions, the leaders of international organizations, or heads of state. High-quality, well-curated data need not necessarily mean complete data - but requires an understanding of how the data were generated and for what purpose, what is being included and what is missing, clear descriptions or documentation about features and variables, an assessment of privacy and access restrictions, and - whenever possible - comparisons with other, similar data sets.
Providing this kind of information about data sets helps verify their validity, and support proper reuse. High-quality data is easier to attain when field-specific or community standards are rigorously implemented at the time of collection or generation - and when adequate planning is made for documenting the data transformation and processes applied at each step, alongside careful considerations about privacy, access, and stewardship.
When any new data are needed, the collection process should take into account interoperability with existing standards, in order to make it comparable with other data. Although it may be tempting to collect massive amounts of data - sometimes even without a clear purpose - it is important to first evaluate what questions need answering through the use of the data, and focus collection only on the most appropriate type for a specific purpose.
In other contexts, automated pipelines are used to collect, clean, process, and aggregate data - though they carry the risk of obscuring the process and misusing the resulting data.
Automated processes and software tools can be used to facilitate and document the transformation from data generation to data analysis - in terms of research, for example, computational and lab notebooks and workflow tools can help capture the entire research data lifecycle. In terms of data credibility and validity, it is important that related processes and tools are transparent, and that any documentation about its provenance and transformation are shared.
Data Analysis and Uncertainty Assessment
Data science is not just about extracting information, but also about quantifying its quality. The fundamental aim of any constructive data analysis is to extract the information encoded in the data, and use it to both update our understanding of the world and guide our collective behavior in a positive way. This process of extracting information and analyzing its quality sits at the core of data science - in many cases, this analysis will reduce large datasets to a few key summary statistics, reveal hidden patterns or relationships, or involve implementing methods of rendering that aid human interpretation.
There is rarely a single correct methodology or answer; data scientists often adopt a variety of techniques, and compare the results for consistency and new insights. While applications vary from one field to another, the fundamental tools of data analysis - mathematical modelling, statistics, numerical analysis, optimization, and computer science - are commonly shared and reflect the disciplines and industries that have all contributed to the data analyst’s toolkit. These tools can in turn be applied to everything from predictions made in the form of political polls to the visualizations used bolster news reporting.
It is of special importance to gain a quantitative understanding of the quality of the information extracted from data. Conclusions drawn from data should always be accompanied by an estimation of error or uncertainty. It is also crucially important to quantify the size of this uncertainty in order to help an audience ascertain how much confidence it should have in the results. One key element of uncertainty assessment is understanding a dataset’s limitations.
For example, confidence in the results of a political poll will depend in part on its size - the statistical error associated with its results decreases, as its size increases. However, the accuracy of a poll also relies on surveying a representative sample of the population, a factor that can be even more critical than sample size.
If a particular demographic is under-represented in the polling, it must be acknowledged and corrected for. A failure to do so may result in a biased estimate of voter intentions. On the other hand, the process of correcting an unrepresentative survey can introduce a new source of uncertainty.
Data and Algorithm Ethics
Data Communication and Visualization
Predictive modelling is one of the most common, and promising, applications of modern data science. In many disciplines and endeavors we can apply models that support or automate decisions that have traditionally been performed by human experts - who are often expected to follow ethical principles. Responsible data science means articulating ethical principles and developing the tools to enforce them.
When it comes to medical diagnosis, for example, physicians are expected to follow the principles of beneficence (balancing benefits against risks) and non-maleficence (avoiding harm). These principles are enforced on multiple fronts, from professional training, to fiduciary duties, to regulation.
On the research front, this requires articulating ethical principles tailored to each application, and developing tools to help facilitate and enforce them. For example, to return to the example of medical diagnosis, there is a pressing need to articulate a precise definition of what it means for a diagnostic to be "fair," as applied to many different types of patients.
Responsible data science involves developing and deploying predictive models that are subject to the same - or even greater - degree of ethical scrutiny as their human counterparts. So far, we have yet to develop either the verification tools necessary to check that a diagnostic is actually fair, or methods to learn fair diagnostics from data. Within the US justice system, there has been considerable debate about how to evaluate the fairness of methods for predicting the recidivism of convicted criminals - as well as about whether it is even appropriate to apply predictive models to such an issue.
When it comes to the provision of public aid such as housing, principles must be developed to provide recourse for people unfairly affected by algorithmic decisions. Ultimately, ethical principles should also inform both the design of regulations that protect people, and of methods that facilitate their enforcement.
On the educational front, universities and other institutions are integrating ethics into their data science curricula, while on the professional front companies that use models are developing frameworks and best practices. These efforts will be essential for ensuring the responsible development and application of data science - though ethics are no substitute for regulation.
Well before writing systems emerged, visualizations such as maps and agricultural record-keeping systems were used as means of communication. More recent varieties have included a wide array of data visualizations for scientific exploration and new discovery, as well as for communicating the information carried in data to large audiences.
Data visualization is a powerful means to explore and communicate. Data visualization, whether it is an animation tracking the planets in our solar system or a chart illustrating COVID-19’s death toll, has the power to transcend language and cultural barriers - and effectively speak to people all over the world.
As visualization techniques continue to evolve, they will expand to new display devices, present increasingly complex data to the public, and integrate more seamlessly into our everyday lives. Data visualizations used for communication and education often serve to distil complex data into more accessible and actionable information. For example, graphs depicting the exponential growth of death rates during the COVID-19 pandemic have helped to elicit responses from every level of society, including concerned individuals, responsible government agencies, and private companies.
Another example of data-driven visualization compelling action is the depiction of lives lost to gun violence, which helps bring widespread awareness to the issue. When used to support the scientific exploration of data, visualization becomes a powerful tool that combines the nuance of human reasoning with the precision of computer algorithms. In the field of genetics, for example, visualizing someone's ancestry alongside a detailed clinical history enables experts to better understand diseases such as autism.
Future systems will automatically generate, recommend, and critique visualizations, while new devices such as wearables, ubiquitous large-scale displays, augmented reality glasses, and virtual reality goggles will promote more seamless integration of data visualizations into our everyday lives - providing them in ways that make data more accessible to everyone.
Similarly, visualizations of the connectivity of neurons in the human brain support researchers exploring cognitive functions. In order to further improve the power of visualizations to foster communication and discovery, we need to increase visualization literacy and enable real-world applications through novel techniques for handling larger and more complex data.
Data Governance and Sharing
The sharing of data, the use of software necessary to generate and process it, and the models that are trained from it are becoming key elements of any research process. Ready access to data and responsible use are necessary to inform research and evidence-based policy. Any form of data sharing should include guarantees for its owners to retain their rights to any that may be shared, and to ensure that the data are shared responsibly - with the aid of privacy-preserving methods or access controls when needed.
Data sharing enables the verification of published scientific results and the reuse of data - something that is ideally put into practice by governments, companies, and academic researchers in order to accelerate discovery and make timely, informed decisions. Ultimately, sharing data with an electorate, shareholders, and the scientific community provides greater accountability and transparency. It is already generally understood that data associated with publicly-funded research should be made available to the public, whenever possible. But there should also be incentives for private sector entities to share more of their data, in order to help advance related research and bolster accountability.
The methods employed for data sharing should focus on providing greater privacy, fairness, and utility. We should not fall into a false dichotomy that holds that data must be either fully open, or not shared at all. The infrastructure, technologies, methods, and policies for responsible, privacy-preserving data sharing are in continuous development, and their use should be encouraged by anyone or any institution involved in related processes.
In order to advance artificial intelligence and automated pipelines for discovery, for example, data must be findable, accessible, interoperable, and reusable - and not only by humans, but also by machines. This is in line with the “FAIR” principles (findability, accessibility, interoperability, and reuse), an international effort to provide guidelines for data sharing and stewardship.
These principles have been endorsed and implemented by a growing number of data repositories. Given that research- and evidence-based decision making is increasingly international and collaborative, an open, distributed network of FAIR repositories and services that support quality control and the sharing of data, publications, and other digital assets has become a necessity.
DATA SCIENCE
The era of data is upon us. It is proliferating at an unprecedented pace, reflecting every aspect of our lives and circulating from satellites in space through the phones in our pockets.
The data revolution creates endless opportunities to confront the grand challenges of the 21st century. Yet, as the scale and scope of data grow, so must our ability to analyze and contextualize it. Due to the mass online migration of global businesses, 2021 witnessed the birth of many new Data Science trends in the data technology industry. Many of these trends started before 2021, such as cloud and scalable AI, graph analytics, or blockchain in analytics, and these will have additional impact in the future.
Drawing genuine insights from data requires training in statistics and computer science, and subject area knowledge. Data Science enables countries to utilize modern methods which includes machine learning and distributed data processing among others to exploit new and alternative data sources.
These data sources could include social media, mobile phone data, and data from the Internet of Things. Furthermore, the disaggregated data from these sources is then compiled into summaries of data, typically used for the purposes of public reporting or statistical analysis.
For instance, to promote sustainable industrialization as laid out in SDG 9, small and medium-sized enterprises can use data analytics to improve production; create new goods and services, improve processes and marketing strategies. To respond to some SDG indicators related to sustainable cities and communities (goal 11), climate change (goal 13), and zero hunger (goal 2) for example, it requires utilizing data from mobile phone devices, and satellite imagery data.
Putting insights into action requires a careful understanding of the potential ethical consequences - for both individuals and entire societies. To achieve the targets of the SDGs, Data Science can and will play a large role to develop disaggregated indicators, to ensure that those at risk of disadvantage because of their characteristics, location, or socio-economic status are recognized.