# Uploading datasets

Please read [Downloading datasets](./downloading.ipynb) first as it explains the general setup.

We connect to SciCat and a file server using a [Client](../generated/classes/scitacean.Client.rst):
```python
from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer
client = Client.from_token(url="https://scicat.ess.eu/api/v3",
                           token=...,
                           file_transfer=SFTPFileTransfer(
                               host="login.esss.dk"
                           ))
```
This code is identical to the one used for [downloading](./downloading.ipynb)
.
As with the downloading guide, we use a fake client instead of the real one shown above.

In [None]:
from scitacean.testing.docs import setup_fake_client

client = setup_fake_client()

This is especially useful here as datasets cannot be deleted from SciCat by regular users, and we don't want to pollute the database with our test data.

First, we need to generate some data to upload:

In [None]:
from pathlib import Path

path = Path("data/witchcraft.dat")
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w") as f:
    f.write("7.9 13 666")

## Create a new dataset

With the totally realistic data in hand, we can construct a dataset.

In [None]:
from scitacean import Dataset

dset = Dataset(
    name="Spellpower of the Three Witches",
    description="The spellpower of the maiden, mother, and crone.",
    type="raw",

    owner_group="wyrdsisters",
    access_groups=["witches"],

    owner="Nanny Ogg",
    principal_investigator="Esme Weatherwax",
    contact_email="nogg@wyrd.lancre",

    creation_location="lancre/whichhut",
    data_format="space-separated",
    source_folder="/somewhere/on/remote",
)

There are many more fields that can be filled in as needed.
See [scitacean.Dataset](../generated/classes/scitacean.Dataset.rst).

Some fields require an explanation:

- `dataset_type` is either `raw` or `derived`. The main difference is that derived datasets point to one or more input datasets.
- `owner_group` and `access_groups` correspond to users/usergroups on the file server and determine who can access the files.

Now we can attach our file:

In [None]:
dset.add_local_files("data/witchcraft.dat")

Now, let's inspect the dataset.

In [None]:
dset

In [None]:
len(list(dset.files))

In [None]:
dset.size  # in bytes

In [None]:
file = list(dset.files)[0]
print(f"{file.remote_access_path(dset.source_folder) = }")
print(f"{file.local_path = }")
print(f"{file.size = } bytes")

The file has a `local_path` but no `remote_access_path` which means that it exists on the local file system (where we put it earlier) but not on the remote file server accessible by SciCat.
The location can also be queried using `file.is_on_local` and `file.is_on_remote`.

Likewise, the dataset only exists in memory on our local machine and not on SciCat.
Nothing has been uploaded yet.
So we can freely modify the dataset or bail out by deleting the Python object if we need to.

## Upload the dataset

Once the dataset is ready, we can upload it using

In [None]:
finalized = client.upload_new_dataset_now(dset)

<div class="alert alert-warning">
    <b>WARNING:</b>

This action cannot be undone by a regular user!
Contact an admin if you uploaded a dataset accidentally.

</div>

[scitacean.Client.upload_new_dataset_now](../generated/classes/scitacean.Client.rst#scitacean.Client.upload_new_dataset_now) uploads the dataset (i.e. metadata) to SciCat and the files to the file server.
And it does so in such a way that it always creates a new dataset and new files without overwriting any existing (meta) data.

It returns a new dataset that is a copy of the input with some updated information generated by SciCat and the file transfer.
For example, it has been assigned a new ID:

In [None]:
finalized.pid

And the remote access path of our file has been set:

In [None]:
list(finalized.files)[0].remote_access_path(finalized.source_folder)

## Location of uploaded files

All files associated with a dataset will be uploaded to the same folder.
This folder may be at the path we specify when making the dataset, i.e. `dset.source_folder`.
However, the folder is ultimately determined by the file transfer (in this case `SFTPFileTransfer`) and it may choose to override the `source_folder` that we set.
In this example, since we don't tell the file transfer otherwise, it respects `dset.source_folder` and uploads the files to that location.
See the [File transfer](../reference/index.rst#file-transfer) reference for information how to control this behavior.
The reason for this is that facilities may have a specific structure on their file server and Scitacean's file transfers can be used to enforce that.

In any case, we can find out where files were uploaded by inspecting the finalized dataset that was returned by `client.upload_new_dataset_now`:

In [None]:
finalized.source_folder

Or by looking at each file individually as shown in the section above.

## Attaching images to datasets

It is possible to attach *small* images to datasets.
In SciCat, this is done by creating 'attachment' objects which contain the image.
Scitacean handles those via the `attachments` property of `Dataset`.
For our locally created dataset, the property is an empty list and we can add an attachment like this:

In [None]:
dset.add_attachment(
    caption="Scitacean logo",
    thumbnail="./logo.png",
)
dset.attachments[0]

`Dataset.add_attachment` can load an image from a file and properly encode it for SciCat.
We could also use a more manual approach and construct `scitacean.Attachment` and `scitacean.Thumbnail` objects ourselves and append them to `dset.attachments`.

When we then upload the dataset, the client automatically uploads all attachments as well.
Note that this creates a new dataset in SciCat.
If you want to add attachments to an existing dataset after upload, you need to use the lower-level API through `client.scicat.create_attachment_for_dataset` or the web interface directly.

In [None]:
finalized = client.upload_new_dataset_now(dset)

In order to download the attachments again, we can pass `attachments=True` when downloading the dataset:

In [None]:
downloaded = client.get_dataset(finalized.pid, attachments=True)
downloaded.attachments[0]

In [None]:
# This cell is hidden.
# It should remove *only* files and directories created by this notebook.
import shutil

shutil.rmtree("data", ignore_errors=True)