September 26, 2017

Getting serious about research ethics: AI and machine learning

[This blog post is a continuation of our series about research ethics in computer science.]

The widespread deployment of artificial intelligence and specifically machine learning algorithms causes concern for some fundamental values in society, such as employment, privacy, and discrimination. While these algorithms promise to optimize social and economic processes, research in this area has exposed some major deficiencies in the social consequences of their operation. Some consequences may be invisible or intangible, such as erecting computational barriers to social mobility through a variety of unintended biases, while others may be directly life threatening. At the CITP’s recent conference on computer science ethics, Joanna Bryson, Barbara Engelhardt, and Matt Salganik discussed how their research led them to work on machine learning ethics.

Joanna Bryson has made a career researching artificial intelligence, machine learning, and understanding their consequences on society. She has found that people tend to identify with the perceived consciousness of artificially intelligent artifacts, such as robots, which then complicates meaningful conversations about the ethics of their development and use. By equating artificially intelligent systems to humans or animals, people deduce its moral status and can ignore their engineered nature.

While the cognitive power of AI systems can be impressive, Bryson argues they do not equate to humans and should not be regulated as such. On the one hand, she demonstrates the power of an AI system to replicate societal biases in a recent paper (co-authored with CITP’s Aylin Caliskan and Arvind Narayanan) by letting systems trained on a corpus of text from the World Wide Web learn the implicit biases around the gender of certain professions. On the other hand, she argues that machines cannot ‘suffer’ in the same way as humans do, which is one of the main deterrents for humans in current legal systems. Bryson proposes we understand both AI and ethics as human-made artifacts. It is therefore appropriate to rely ethics – rather than science – to determine the moral status of artificially intelligent systems.

Barbara Engelhardt’s work focuses on machine learning in computational biology, specifically genomics and medicine. Her main area of concern is the reliance on recommendation systems, such as we encounter on Amazon and Netflix, to make decisions in other domains such as healthcare, financial planning, and career decisions. These machine learning systems rely on data as well as social networks to make inferences.

Engelhardt describes examples where using patient records to inform medical decisions can lead to erroneous recommendation systems for diagnosis as well as harmful medical interventions. For example, the symptoms of heart disease differ substantially between men and women, and so do their appropriate treatments. Most data collected about this condition was from men, leaving a blind spot for the diagnosis of heart disease in women. Bias, in this case, is useful and should be maintained for correct medical interventions. In another example, however, data was collected from a variety of hospitals in somewhat segregated poor and wealthy areas. The data appear to show that cancers in children from hispanic and caucasian races develop differently. However, inferences based on this data fail to take into account the biasing effect of economic status in determining at which stage of symptoms different families decide seek medical help. In turn, this determines the stage of development at which the oncological data is collected. The recommendation system with this type of bias confuses race with economic barriers to medical help, which will lead to harmful diagnosis and treatments.

Matt Salganik proposes that the machine learning community draws some lessons from ethics procedures in social science. Machine learning is a powerful tool the can be used responsibly or inappropriately. He proposes that it can be the task of ethics to guide researchers, engineers, and developers to think carefully about the consequences of their artificially intelligent inventions. To this end, Salganik proposes a hope-based and principle-based approach to research ethics in machine learning. This is opposed to a fear-based and rule-based approach in social science, or the more ad hoc ethics culture that we encounter in data and computer science. For example, machine learning ethics should include pre-research review through forms that are reviewed by third parties to avoid groupthink and encourage researchers’ reflexivity. Given the fast pace of development, though, the field should avoid a compliance mentality typically found at institutional review boards of univeristies. Any rules to be complied with are unlikely to stand the test of time in the fast-moving world of machine learning, which would result in burdensome and uninformed ethics scrutiny. Salganik develops these themes in his new book Bit By Bit: Social Research in the Digital Age, which has an entire chapter about ethics.”

See a video of the panel here.

BlockSci: a platform for blockchain science and exploration

The Bitcoin blockchain — currently 140GB and growing — contains a massive amount of data that can give us insights into the Bitcoin ecosystem, including how users, businesses, and miners operate. Today we’re announcing BlockSci, an open-source software tool that enables fast and expressive analysis of Bitcoin’s and many other blockchains, and an accompanying working paper that explains its design and applications. Our Jupyter notebook demonstrates some of BlockSci’s capabilities.

Current tools for blockchain analysis depend on general-purpose databases that have full support for transactions. But that’s unnecessary for blockchain analysis where the data structures are append-only. We take advantage of this observation in the design of our custom in-memory blockchain database as well as an analysis library.

BlockSci’s core infrastructure is written in C++ and optimized for speed. (For example, traversing every transaction input and output on the Bitcoin blockchain takes only 10.3 seconds on our r4.2xlarge EC2 machine.) To make analysis more convenient, we provide Python bindings and a Jupyter notebook interface. This interface is slower, but is ideal for exploratory analyses and allows users to quickly iterate when developing new queries.

The code below shows the convenience of traversing the blockchain using straightforward Python idioms, built-in currency conversion using historical exchange-rate data, and the use of pandas DataFrames for analysis and visualization..

fees = [sum(block.fees) for block in chain.range('2017')]
times = [block.time for block in chain.range('2017')]
converter = blocksci.CurrencyConverter()
df = pandas.DataFrame({"Fee":fees}, index=times)
df = converter.satoshi_to_currency_df(df, chain)

When plotted, it results in the following graph showing the average transaction fee per block:

BlockSci uses a custom data format; it comes with a parser that generates this data from the serialized blockchain format recorded by cryptocurrency nodes such as bitcoind. The parser supports incremental updates when new blocks are received, and making it easy to stay up to date with the latest version of the blockchain. We’ve used BlockSci to analyze Bitcoin, Bitcoin Cash, Litecoin, Namecoin, Dash, and ZCash; many other cryptocurrencies make no changes to the blockchain format, and so should be supported with no changes to BlockSci.

In our working paper, we present four analyses that show BlockSci’s usefulness for answering research questions. We show how multisignatures unfortunately weaken privacy and confidentiality; we apply the cluster intersection attack to Dash, a privacy-focused altcoin; we analyze inefficiencies in the usage of block space; and we present improved methods for estimating of how often coins change possession as opposed to just being shuffled around.

Here’s an illustrative example. Exploratory graph analysis using BlockSci allowed us to discover a behavioral pattern in the usage of multisignatures that weakens security. Multisignatures are a security-enhancing mechanism that distribute control of an address over a number of different public keys. Surprisingly, we found that users often negate this security by moving their funds from a multisig address to a regular address and then back again after a period of a few hours to days. We think this happens when users are changing the access control policy on their wallet, although it is unclear why they transfer their funds to a regular address in the interim, and not directly to the new multisig address. This pattern of behavior has led over $12 million dollars to be left insecure over the course of  over 22,000 transactions. What users may not appreciate is that the temporary weakening of security is advertised to potential attackers on the blockchain.

There’s far more to explore on public blockchains. BlockSci is publicly available now, and we hope you’ll find it useful. It is easy to get started using the EC2 image we’ve released, which includes the Bitcoin blockchain data in addition to the tool. BlockSci is open-source, and we welcome contributions. This is an alpha release; we’re continuing to improve it and the interface may change a bit in future releases. We look forward to working with the community and to hearing about other creative uses of the data and the tool.