We introduce the novel Wino-X benchmark to investigate whether translation models can perform coreference resolution that requires commonsense knowledge and whether multilingual language models are capable of commonsense reasoning across multiple languages. Our findings indicate that models are prone to biases and often fail to identify disambiguating information.