top of page

Should I Care About My Data?

  • 6 days ago
  • 4 min read

Personal Data in the Era of Big Data

Gabriella Nolan


For years, voice assistants relied on pre-recorded audio clips to sound human. Today, that limitation is gone. Advances in voice synthesis mean that a voice, including that of a celebrity, can be artificially produced with impressive accuracy. In fact, Scarlett Johansson recently criticised OpenAI for allegedly using her voice without consent to develop its “Sky” assistant.


But the bigger questions this raises is: If technology can replicate a celebrity’s voice without permission, what does that mean for you and me? How did we reach a point where data, whether voice, image, or behaviour, can be captured, processed, and repurposed without our explicit control? And what are the implications when this data powers systems that directly influence and affect us?


To understand why this matters, let’s look at how these technologies work, and why your data is central to them.


How the Tech works


Voice imitation technologies rely on advanced machine learning and speech synthesis to generate audio that sounds convincingly human. These systems learn from large, diverse datasets, capturing variations in tone, emotion, volume, and context, such as how speech changes when someone is happy, stressed, or speaking in public versus private. The broader and richer the training data, the better the model becomes at mapping subtle patterns between sound and meaning. This ability to replicate nuance is what makes synthetic voices sound natural, unlike the rigid, robotic voices of earlier generations.

 

This is essentially how most generative AI works in a nutshell.  With celebrities constantly in the public eye, where vast amounts of media are already available in digital format and in large quantities, it creates the perfect foundation for building such systems. These systems can be built with little regard for individual preferences. While laws outline what organisations should and shouldn’t do with data, they don’t change the fact that, technically, almost anything is possible with minimal constraints. Structured or unstructured, data can be aggregated, analysed, and repurposed by anyone with the right tools. With the sheer volume of digital data, the potential is enormous.


Many of these technologies were unimaginable when celebrities and everyday people, first began navigating the web. Actions that once seemed harmless now form the foundation for powerful AI systems. Just as some celebrities have seen their voices cloned without consent, we must ask: what do our past and current behaviours exposed our future to?

 

The data-driven present is the result of a rapid evolution in how we collect and process information at scale.


A Short History on (Big) Data


Before the era of Big Data, information was stored on a much smaller scale. For example, punch cards, filing cabinets, and isolated local databases. These systems were slow, fragmented, and difficult to share, which naturally limited how data could be circulated. Processing and digitising large volumes of information was costly and time-consuming, creating a built-in safeguard against mass exploitation.


Today, those barriers have disappeared. Big Data refers to massive and complex datasets that are difficult to process and manage with traditional tools. It’s characterised by three key traits: large volume, high velocity, and variety. Big Data helps organisations make “data-driven” decisions, power advanced analytics, and enable machine learning.


Without realising it, your data is probably part of this ecosystem. Every time you visit a website, stream a movie, shop online, or use your phone, information about your behaviour and habits is being captured and stored.


But the concept has been with us for a relatively short time, it began to take shape in the 1990s, with the modern phase accelerating throughout the 2000s alongside the growth of mass social platforms.. Its scale shifted dramatically with the rise of the web, cloud computing, and new distributed systems.


What counts as your data?


Personal data is traditionally defined as any information that can identify an individual, such as your name, email address, phone number, IP address, or location. Non-personal data, by contrast, refers to information that cannot directly identify someone, like aggregated statistics, anonymised datasets, weather information, or generic usage patterns.

 

But data can be "about you”, without being classified as personal. Likewise, personal data may include details that seem trivial to others, while non-personal, such as behavioural patterns or preferences, can be far more valuable and, in many cases, more intrusive. This broader category of “your data” is what powers predictive systems and influences decisions about you, often without your awareness.


Emerging privacy proposals suggest that once data is anonymised, or even pseudonymised, it may fall outside the scope of personal data regulations. While intended to simplify compliance, these changes risk creating loopholes that weaken protections and introduce ambiguity. If anonymisation becomes a gateway to sidestep accountability, the future of personal data governance looks increasingly uncertain.


Even without unrestricted access to personal data, these systems already exert considerable influence over our lives. So how safe is this ecosystem? Or is personal data simply the missing piece to erasing any remaining sense of autonomy? The question then becomes: how effective is anonymisation, and is it truly future proof? As datasets multiply and cross-referencing becomes more sophisticated, patterns will inevitably re-emerge that link back to individuals. Advanced analytics and AI accelerate this risk, turning weak anonymisation into little more than a temporary illusion of privacy.


The Road Ahead

For most people, being turned into a voice assistant might seem far-fetched (though you’d probably be annoyed if someone cloned your voice without consent and profited from it). But the real risks go much deeper.

 

Your data is already being used to train algorithms that influence decisions about credit scores, insurance premiums, and even job prospects. Now imagine applying for a role and being scored not just on what you say, but how you say it: tone, pace, stress markers. Now add another dataset, e.g. the system tracks website traffic to gauge your interest in researching the company. Low engagement? Low score.


A major concern is accountability, or rather, the lack of it. With so many players involved in collecting, processing, and using data, responsibility often gets pushed from one side to the other. If an algorithm makes a decision that harms you or others, who should be held accountable? The developer who built the model? The company that deployed it? The data broker who sold your information? The regulator who failed to keep pace? Or even you? And beyond accountability, how comfortable are you knowing that your own data may have contributed to that outcome? In a world where your data fuels every decision, how long before your information and identity become indistinguishable?

 


Comments


bottom of page