Points of View

��ñ�� alumni data scientists share perspectives

Nana Banerjee, PhD '96 and Beth Plale, PhD '97 Image Credit: Jonathan Cohen.

April 18, 2019

Nana Banerjee, PhD ’96

President and CEO of McGraw-Hill

Banerjee earned his PhD in systems science, specializing in applied mathematics, while studying under James Geer, now adjunct professor emeritus of mechanical engineering. “My work was on identification of cracks in vibrating bodies such as airplanes, and partly funded by NASA,” Banerjee says. “To me, it was the most exciting work in the world.”

Is it unusual to choose a data scientist to lead McGraw-Hill?

NB: It absolutely is. Any time you bring in the CEO from an outside sector, it is atypical. For a data scientist and technologist to come into the role was a clear and bold signal from the board to our stakeholders that McGraw-Hill is serious and committed to its vision of being the leader in next-gen technology and analytically backed education solutions.

What is data science?

NB: It can mean many different things, but some common themes apply:

There must be scaled means of capturing and storing data. The capturing of data at scale [i.e., at a size large enough to solve the problem] often requires sensors to track temperature, movement, images, stimuli-response, etc.
Automated and algorithmic analysis of the data, often not requiring a mathematician or statistician to implement at scale.
Quick and meaningful inferences of the analysis that lead to better outcomes or actions, such as the generalized profiling or analysis of biomarkers from a blood test that then leads to easier identification of a specific disease.

How can data science help education?

NB: An ideal education setting is one in which the student is provided highly bespoke and curated attention to drive deep immersion and retention of knowledge. Use of data science and technologies allows us to create such an environment at scale. The software can be so well optimized that it literally changes the questions and experience of what a student is reading and practicing. You and I might be in the same classroom taking the same math class, but the practice problems being shown to us might be completely different based on our individual states of understanding and what we need to improve on.

That is where analytics is at its best, and these are the kinds of solutions that McGraw-Hill is building now.

Should every student have some data experience?

NB: Yes, but only to the extent that they have a rudimentary appreciation of what it means and what it can do for them. One shouldn’t be required to be a data scientist to know what data science can do.

For those who want to specialize in data science, there are so many options available.

My path started with a strong interest in mathematics, based on a deep love for number theory and differential equations, and an innate desire to simulate and solve complex, real-world problems.

Can anyone use data science?

NB: Yes, but the complexity can vary. A simple form of data science can be the cataloguing of observations, such as weight on a weighing machine, and the visual seeking of a trend or pattern to the dataset, ultimately leading to an extrapolated inference of what the future might hold. If you wish to solve a problem, you will often use data science as an aid. It’s not the only thing, but it does find a way to be helpful.

Beth Plale, PhD ’97

Professor of Intelligent Systems Engineering at Indiana University Bloomington and currently serving as Science Advisor for Public Access at the National Science Foundation (NSF)

Plale studied computer science, specifically operating systems, distributed systems and databases, in the Thomas J. Watson School of Engineering and Applied Science. The growth in Watson’s research-oriented faculty is important, she says: “This has created opportunities for PhD students to be much stronger and go on and have great research careers.”

What goes on “behind the scenes” in data science?

BP: When data science first came into being, it was equated with machine learning. I always thought of data science as more than that because data has a life cycle: It’s born from a sensor, an instrument or simply someone walking city streets taking down responses to a survey. Then it moves through analysis, storage, preservation, eventual reuse, and is later discarded. If you regard data science as just machine learning, there’s an assumption that the data arrived very clean and marked up in a way that algorithms can instantly deal with it — and all the steps to get it there, to produce results and to visualize it are minimized or forgotten.

Increasingly, funding agencies are recognizing that data collected to answer one research question has value in helping other researchers answer different questions. How that happens is addressed by open science. This is not the remote satellite data that I use to produce my paper, this is the data that I’ve collected and used, and then curate and make available to other people to use.

What is open science?

BP: The objective of agencies like the NSF in public access is to show that open science equates to good science and, in the end, benefits both science and society. Open science is today’s data for tomorrow’s discoveries.

Open science urges scientists to be more attentive to research processes, to give more thought to the subsequent uses of data and more thought to the reproducibility and replicability of their data.

The FAIR principles, developed in 2011, are guidelines for making data Findable, Accessible, Interoperable and Reusable.

Openness and transparency are critical to moving research forward. Publication is a starting point to reproducibility, and the more we know how something is done, what tools are used, the easier it is to try it again. If we don’t commit to this process of reproducing to increase the credibility of hypotheses, then we are not sufficiently advancing science.

Can anyone do data science?

BP: One of the exciting developments in data science is community interest.

The California Safe Drinking Water Data Challenge in 2018 demonstrated the power of data to solve one of California’s most pressing issues. People in academia and industry, and people just interested in science, came together for a couple of days to look at water-related information that the state of California made available. People basically hacked into the data with tools they had, tools they were learning about. They focused on what they could find in the data and on making that information visible in some way.

One of the organizers said what came of that gave the state justification to make even more data available.

So, data science has gotten big; the public is interested in data. You don’t need to know computer science programming languages to get at it. You need a little bit of Python, the Jupyter Notebooks, the R Suite of tools and you can manipulate data. It’s an exciting time.

Posted in: Science & Technology, Watson

��ñ��