Introduction – What is Generative AI?

A major portion of the discourse surrounding emerging technologies over the course of this year has been the advent of generative artificial intelligence (“Generative AI”). The integration of Generative AI into daily tasks has been a fast but permanent process. Various businesses around the world have commenced integrating Generative AI into their workflows and customer services to offer increasingly creative solutions within their existing services. Generative AI is generally considered to be an offshoot of language training models – which have been around for a long time now. Language training models have seen continuous business use, from automated  search results on search engines such as Google to word suggestions on Microsoft Word.

Language training models generally use two types of learnings – supervised learning and unsupervised learning. The system in supervised learning is trained on a dataset in which the input is connected with matching labelled outcome. By modifying its variables depending on the difference between the projections and the real labels, the algorithm develops the ability to convert inputs to outputs. As a result, the model may generalise and make accurate projections on previously unknown data. Unsupervised learning, on the contrary, entails tasks in which the algorithm investigates the intrinsic structure of unlabelled data without specific output direction. It frequently comprises tasks such as clustering or minimising dimensionality, with the goal of discovering trends, correlations, or descriptions in data without predetermined target categories.

An important milestone has also been the introduction of generative AI ChatGPT, a natural language processing tool, which can assist users with answers to questions – in a conversational manner. ChatGPT was a novel innovation because unlike most other natural language processing machine learning models, ChatGPT utilised a combination of neural network architecture, supervised learning and unsupervised learning to be trained to provide responses to questions posed to it. ChatGPT was first allowed to learn from vast text based sources on the internet, and then learned with human supervision on specific correlations to offer a more fine-tuned experience. Essentially, this meant that ChatGPT was one of the first language models to be able to learn to produce correct answers to questions without needing explicit instructions on corelations between specific questions and correct answers.

Legal Hurdle

In March 2023, Italy’s Garante became the first authority to raise concerns regarding the operation of ChatGPT, while asserting compliance with the GDPR. The data protection authority found in its investigation that OpenAI, the entity offering ChatGPT did not comply with some fundamental aspects of the European Union’s Genera Data Protection Regulation (“GDPR”). For instance, the authority stated that there was no legal basis for the collection of massive amounts of personal data to “train” its models. It also stated that individuals did not have the right to object to or stop the processing of their personal data. Further, ChatGPT did not offer any protection to children, which meant that children under the age of thirteen could access and utilise the ChatGPT platform with no parental consent requirements or checks and balances. It was also found that ChatGPT had the ability to provide misrepresenting or wrong information about individuals, also leading to a violation of the GDPR.  

With this, questions surrounding Generative AI’s capability to actually comply with global data protection regulations began to take the foreground. This analysis aims to provide an understanding of Generative AI’s interaction with data protection laws in India at its initial stage of training – to assist businesses in navigating the intricacies of present and future data protection law concerns.

Data Protection and Generative AI – A Perspective As Per Privacy Laws in India

Training Data – Right to Use

One of the first steps in the creation of a Generative AI tool is finding datasets large enough to train the Generative AI tool. It is almost inevitable that such training data also includes large volumes of personal data. This is because training data is usually obtained by scraping different parts of the internet, leading to the use of personal data available on the internet, to train the Generative AI tool.

In the case of supervised learning, human intervention is possible to ensure that personal data is not processed by the Generative AI tool without compliance with applicable laws – this is however, not always possible with large language training models and unsupervised learning techniques. Accordingly, the use of personal data within training datasets ought to be heavily regulated. Training data is usually also obtained through third party sources, and creators of Generative AI tools do not have an interface with individuals at the stage of collecting personal data to use within the training datasets of the Generative AI model. An additional risk is also posed to Generative AI training data since the generation of training data usually results in the scraping of various internet data sources. Admittedly, while India does not have laws relating to data scraping, liability of data scrapers increase when the terms of use of specific websites prevent data scraping activities.

The present Indian data protection law, the Information Technology (Reasonable Security Practices and Procedures and Sensitive Personal Data or Information) Rules, 2011 (“SPDI Rules”), require consent only for the processing of sensitive personal data or information – a subset of personal information which includes passwords, biometric information, medical information, sexual orientation, and financial information. Accordingly, under the stricture of the SPDI Rules, Generative AI need not rely on a specific ground for processing personal information that is not SPDI.

However, other obligations under the SPDI Rules, relating to the provision of privacy policies, implementation of reasonable security practices and procedures, and data subject rights would continue to subsist. Generative AI models must also be trained to not include SPDI within its scope. Due to the manner in which training data is collected in most instances, it is not possible for training data to obtain consent from individuals. Considering that the SPDI Rules do not offer other grounds for the processing of SPDI, Generative AI models in India must ensure that sensitive data is not made a part of the training datasets unless consent has been obtained from individuals to include such SPDI.

Several technical solutions can be used in the framework of generative AI training data incorporating SPDI. Data pre-treatment techniques like as redaction and anonymisation guarantee that identifying information is removed or replaced before hitting the training loop. By replacing sensitive tokens with substitutes, tokenisation together with masking could prevent the algorithm from learning precise details in relation to SPDI. Other privacy-centric methods include synthetic data creation, privacy-preserving models (such as federated learning), and adversarial training against SPDI reconstruction. These methods ensure that Generative AI (i) automatically discards SPDI from its learning process (such as with adversarial learning, or (ii) does not access SPDI related to any one specific individuals even though such datasets continue to exist within training data models (such as in the case of tokenisation or federated learning). Furthermore, enforcing strong access controls and permissions restricts SPDI access, while frequent audits assure continuing compliance with privacy standards and legislation, supporting responsible and safe growth of Generative AI.

The Digital Personal Data Protection Act, 2023 (“DPDPA”), India’s yet-to-be implemented new data protection law may offer Generative AI trainers respite. While the DPDP Act 2023 expands its scope of application to all categories of personal data and not just SPDI, the DPDPA also hosts an important exemption i.e. the applicability of the entire DPDPA is precluded when dealing with personal data which has been made public by the individual themselves. However, further guidance is expected from the government and the incoming Data Protection Board of India to lay down various standards under the DPDPA 2023, including standards on what would be considered as data that has been “made public” by individuals. This may include guidance on whether posting on social media accounts, communication platforms, or other forms of internet based interaction would be included within the scope of “public data”.

Further, where businesses can obtain consent from individuals (and the use of personal data is necessary), consent may be obtained prior to the implementation of the DPDPA – this is because standards of consent under the existing SPDI Rules are largely undefined and include consent obtained in a physical or electronic form. Whereas, consent under the data protection law in India, the DPDPA 2023, consent must be free, specific, informed, unambiguous, unconditional, and must be expressed through a clear affirmative action. The DPDPA provides respite to entities which have already obtained consent prior to the commencement of the DPDPA by allowing such entities to merely provide a privacy notice informing data subjects of the personal data in their possession, the purposes of processing, the manner in which they may withdraw consent, and the manner in which individuals may approach the Data Protection Board of India.

Data Scraping – The Indian Legal Thicket

Most of the training data sets to be used during the machine learning process could arise out of data that has been scraped on the internet. Data scraping in India is not presently a regulated subject. The SPDI Rules, or its enabling legislation, the Information Technology Act, 2000 do not have specific provisions that prevent or heavily regulate the act of data scraping. The risk that Generative AI creators face in this regard, accordingly, arises from individual websites’ terms and conditions, which may regulate data scraping activities.

Most sources on the internet that are rich with varied datasets perfect for training Generative AI models require, either that no scraping of data occur on their websites, or that prior to scraping, permission be obtained from the website owners to facilitate such scraping activities. Generative AI creators ought to pay attention to such website terms and conditions, since such terms and conditions would be binding in India, even if no explicit action is required on these websites for users to accept terms and conditions through affirmative actions. This is because electronic modes of contract acceptance, such as click-wrap and browse wrap contracts are generally considered to be valid in India, while judicial interpretation of such means of contract acceptance has not been exhaustive.

This would require that Generative AI creators be sensitive to the data scraping activities undertaken by them – prior to the collection of data from a website, a short analysis of the website’s terms and conditions must be adhered to, to ensure that no unwarranted data scraping activities occur. Additionally, Generative AI creators may want to consider a collaborative approach – to reach out to the owners of data rich websites to gain permission for scraping. While the success of such a collaborative approach is presently untested in India, it ought to offer respite to Generative AI creators by ensuring contractual protections to their scraping activities and displacing liability on to website creators (for allowing the use of data on their website by the Generative AI creators), than placing liability on Generative AI creators (for using data available on websites).

Solely from a data protection perspective presently, the position as stated in the previous section would continue to subsist. The SPDI Rules do not regulate data scraping activities and have no specific exemption in this regard. The use of SPDI would demand that consent be obtained – implying that this liability would remain with Generative AI creators when such activities are carried on independently, or will need to be passed on to website owners when Generative AI creators follow a collaborative approach. When dealing with the DPDPA in the future, the same exemptions pointed to in the previous section provide respite. While an open interpretation of this provision is that data scraping is technically legal and unregulated by the DPDPA since personal data obtained through the ethical scraping of publicly available websites would not have to go through the processes placed under the DPDPA – including relying on specific grounds of processing, displaying privacy notices, enforcing data subject rights, or undertake additional obligations in relation to children’s data.

Children’s Data – The New “SPDI”

While the data privacy policy in India, SPDI Rules, do not specifically govern the processing of personal data or SPDI of children and impose the same level of protection on the personal data and SPDI of children and adults, the DPDPA requires additional protections for processing personal data relating to children. The DPDPA requires that entities obtain verifiable parental consent prior to the processing of children’s personal data – and also to not undertake specific activities in relation to children’s personal data including not undertaking tracking or behavioural monitoring of children or targeted advertising directed at children. This could prove to be an issue for Generative AI creators. While Generative AI models may not create profiles of specific children, one of the fundamental characters of a Generative AI model is to track and monitor the manner in which specific data interacts, and be able to corelate and mimic such data to provide answers to specific questions.

Two issues therefore, arise in relation to the processing of children’s data – reliance on verifiable parental consent for the processing of children’s data, and ensuring that other duties placed upon entities in relation to children are not violated. To ensure DPDPA compliance, Generative AI models must prioritise anonymization or pseudonymization of children’s data throughout the modelling phase. Anonymization is the process of removing personally identifiable information from data so that unique identities cannot be deduced from it. Pseudonymization, on the other hand, assists by replacing distinguishing features with fictitious indicators, enabling the data to be used for study and instruction without being directly linked to specific persons. These approaches serve as stronger protections, considerably lowering the danger of DPDPA enforcement by side-stepping liability. While this would allow Generative AI models to study information about a type of individual (i.e., someone aged below the legal age of majority), it would not be termed “children’s data” for the scope of the DPDPA, since such data cannot be linked to a specific individual child.

Alongside anonymization or pseudonymization, Generative AI models should deploy a number of different tactics to ensure DPDPA compliance while using children’s data for learning objectives. Explicit and informed policies to inform a parent or legal guardian about processing activities, are critical to guarantee that those with decision-making power over the child are completely conscious of and accept the data processing activities. The idea of purpose limitation must also be strictly maintained, with children’s data being used only for the defined training purposes. Data minimization practises must also be emphasised, with just the information required for the intended purpose being collected, decreasing the possible impact on privacy.

To further protect the confidentiality and integrity of the data, stringent data protection mechanisms such as encryption as well as access restrictions may be applied, reducing the danger of unauthorised access. Open dialogue regarding the model’s data consumption, defined data retention and deletion policies, and frequent audits may all help to ensure transparency and continuous compliance. These safeguards, when taken together, could form a complete framework that not only protects children’s privacy but also assures the ethical and legal use of data in the construction of Generative AI models.

Mechanisms to Implement

Data Subject Rights

Under the SPDI Rules, individuals are provided a right to review and correct personal information to provide to entities, and withdraw consent to the processing of SPDI. When the DPDPA is brought into the foray, a different set of data subject rights must be provided by entities. These include the right to access information about the personal data being processed, the right to review, correct, and erase personal data, the right of grievance redressal, and the right to nominate another person to act on one’s behalf in case of mental incapacity or death. While Generative AI creators may not collect personal data directly from individuals to whom such data relates, they will be expected to adhere to any personal data requests that are raised directly to them, or through their partners from whom such data has been obtained. Generative AI creators must create mechanisms that allow for such rights to be facilitated – this would include creating user interfaces to acknowledge and automate data subject rights, implementing standard operating procedures on how to handle data subject rights both with internal stakeholders and external stakeholders.

With internal stakeholders, it is also important to isolate specific points of contact and create processes to identify exact locations of personal data, their processing activities, and identify processes which can facilitate the specific rights exercised by individuals. This often involves a combination of automated and mechanical processes. While automated processes can take over the identification and actual facilitation of the data subject facing process – human intervention will still be required to take a decision on whether to actual give effect to a data principal right or not. Generative AI creators will have to carefully evaluate the extent to which they may actually be able to give effect to data subject rights relating to correction and erasure – especially in the context of training data. While a specific dataset may be removed or altered in the training database, it may not always be possible to cause the Generative AI tool to “forget” a specific dataset, or alter one particular dataset. This is because Generative AI models based on large language models “predict” text based on what the next most obvious output will be. Therefore, Generative AI models may still give an output that contains inaccurate or “erased information” through prediction. Suitable disclosures will be required to ensure that individuals are aware of such intricacies, and that attempts are made to correct such inaccurate outputs and erase information as may be necessary.

Data Mapping, Appointments, and Grievance Redressal Processes

As a first step to commence compliance, it will be important for Generative AI creators to create a comprehensive map of all training data, containing information about the different categories of personal data it can identify and use, identify the de-identification processes that such data undergoes, and create dynamic, localised data maps to create a comprehensive understanding of internal and external processes.

In this stead it will also be necessary for Generative AI creators to make personnel appointments, such as grievance redressal officers, data protection officers, and internal privacy leads to ensure integration or privacy practices into the working of the Generative AI product. By establishing a transparent framework, creators must be obligated to provide explicit mechanisms for individuals to express objections about data processing, algorithmic decisions, or privacy violations. The inclusion of clear and easily available avenues for communication is critical to creating a successful grievance redressal system, allowing for the timely filing and resolution of concerns. Generative AI creators must also endeavour to ensure that grievance redressal processes allow individuals to have sovereignty over their data and understand the working dynamics of the Generative AI systems. Considering the nature of Generative AI models and their ability to learn and predict, regular audits and evaluations should be performed with the goal of detecting and mitigating possible privacy issues. Concurrently, the grievance redressal processes must, in context, attempt to fully educate users on their GDPR rights and notify them of any changes in data processing processes.

Conclusion

The requirement for Generative AI systems to comply with India’s data protection legislation is emphasised not just to ensure legal compliance, but also to place it within a larger panorama of the importance of ethical and legal issues in the fast growing artificial intelligence sector. The dynamism and complexity inherent with Generative AI need a strong framework that assures compliance with legal regulations, such as India’s upcoming DPDPA 2023. Artificial intelligence creators may need to foster an environment of trust and confidence among users by including mechanisms that prioritise user permission, transparency, and accountability, alleviating concerns about Generative AI’s practices on data privacy and security.