Anonymous API access, or universal token

lucw · June 9, 2022, 12:58am

I am developing a product on top of Baserow which you can think of like a Language Learning spreadsheets. People will enter text in a certain language, and there will be new field types which automatically translate into another language.

My thought about how to design this whole thing is to keep a fairly standard Baserow deployment, and associate it with an external REST API which will listen for changes using webhooks.

The main issue i’m hitting is how will that external component publish changes back to Baserow. It seems the API assumes every connection will be authenticated using a token associated with a real user. Is there a way of setting up a “god” user which has access to all workspaces, all tables ?

I started looking into how i’d implement that in Baserow and it seems fairly involved, I’d have to touch a large number of places.

Another option would be to embed the language translation logic into Baserow. That way I wouldn’t need an external component, however the changes to Baserow might be even more elaborate. There wouldn’t be any authentication issues.

Which one of the two options might be preferable for me ?

petrs · June 9, 2022, 3:08pm

There are two different tokens in Baserow, JWT (user-related), and database tokens. You can create a database token in Settings → API tokens. It allows you to have a time-unrestricted token with configurable permissions. The main endpoints work with this token, but it is good to check for each API endpoint which tokens are accepted.

Would this be what you are looking for?

lucw · June 9, 2022, 11:32pm

I want a way to authenticate which allows write access to any user’s workspace and database on my baserow instance. This is because I have an external process which is contributing enrichments (language translations).
As far as I know this is not supported in baserow right now ? The database tokens you mention are still associated with a single user ?

lucw · June 10, 2022, 5:08am

@petrs to provide more context on what i’m trying to do, I essentially want to build a Formula Field but which does language translation (like English to French).

After reading through the base FieldType implementation and I think I convinced myself I need to implement this through a custom FieldType , and compute the translation from a member function like row_of_dependency_updated. Baserow has a much better understanding of dependencies and I think my product will work better, rather than trying to go the webhook + API call route. I also won’t have any authentication issues because all of this will happen internally.

I essentially need a Formula Field Type, but which involves a REST API call (500ms-1500ms in duration) to compute the result. What would be a good place to make this REST call ? directly inside row_of_dependency_updated, or should I do this asynchronously ?

nigel · June 10, 2022, 8:41am

@lucw Just to confirm, your new field would reference another field and then show the translated results?

I think you are definitely on the right track if so. Here are some extra pointers:

Calling a rest API in `row_of_dependency_updated`

If I was you to get started I would call the API in a blocking fashion in this hook (and the others below if it makes sense). If that ends up being too slow, you could attempt to run this async using the bundled celery workers. But that will be more complex and I wouldn’t resort to it unless you really encounter issues with just blocking and waiting.

`FieldType.get_field_dependencies`

Your custom field type will need to implement and return FieldType.get_field_dependencies which tells the field dependency system what fields a particular field of your type depends on.

This area is being refactored/changed on this MR Fix bugs with field dependency system (!752) · Merge requests · Bram Wiepjes / baserow · GitLab .

Currently this method should return a list of strings where the string is the user defined name of the field that your new field depends on (the field you are translating) eg:

return [field_instance.the_field_being_translated.name]

The change in the MR above which will affect you once it has been merged and released, and then you merge those new changes into your fork. Once that happens this is the change you need to deal with:
FieldType.get_field_dependencies should then return a list of the FieldDependency model itself. So in your situation this would look something like:

return [FieldDependency(dependant=field_instance, dependency=field_instance.the_field_being_translated)]

FieldType.field_dependency_updated

This is another hook called like the row_of_dependency_updated hook you already found. But this hook is called when a field of your new type, depends on another field, and that other field has been updated somehow. For example someone could Edit->Change Field Type->Click change on the field being translated and change it into a boolean field. You’ll then have to handle this change in this hook.

In the formula field we set the field into an “error” state which is shown to the user if the dependency changes in some invalid way. I’d imagine you’d have to do the same thing as you can’t translate a boolean field. Alternatively you could just blank out all the cells in this situation, don’t do any translations and no need to put in the work to show an error if these fields are always going to be internal perhaps.

FieldType.field_dependency_deleted/created

There are two other hooks called when a field_dependency is deleted or created. One interesting thing to note, in what situations does it ever make sense for FieldType.field_dependency_created to be called? I’ll try to explain:

Field A depends on a field called field “B” in the same table
Field B is deleted
The field dependency system will do something special to the dependency between Field A and B. It will change it to a special state where Field A depends on an non-exsistant field called “B”.
The user makes a new field called field “B”, renames another field to be called field “B” or restores the original field “B”
The field dependency system now can see this new field will fix the old broken dependency. Hooks up field A to this new field B and then calls FieldType.field_dependency_created.

Now all this complexity really exists for a smooth formula UX. For your field you are either going to have to deal with the possibility that the field you are translated is deleted and put your field into some error state. Or just delete your translation field immediately if the source field is deleted. Or some other way not sure.

Crazy alternative idea

I was thinking, what if you instead added this translation functionality as a new function in the formula langauge. By doing it this way you don’t need to worry about making a new field type or doing anything with the field dependency system. Instead you just need to figure out how to do your translation using PostgreSQL as this is what formulas ultimately compile down to and execute as.

However theres a big problem doing it this way, its hard and a bit mad to call a rest api from SQL. You would have to do something like: rest - Calling RESTful Web Services from PostgreSQL procedure/function - Stack Overflow . By using some embedded language inside a postgresql function you’ll also need to enable and install that extension. This is perhaps fine for you however if you never intend to ship this new field type as a Baserow plugin for others to use. However we are currently reworking the plugin system and this repo is an example of the new type of Baserow plugin Nigel Gott / baserow_geo_plugin · GitLab which also installs a postgres extension automatically for users etc, so perhaps this is an option for you also.

Making a super API user

If you ever want to try out using webhooks again, I believe you could easily change baserow.core.models.Group.has_user to always allow users which are a django superuser. Then you could use a django superuser to make changes to all tables etc.

lucw · June 10, 2022, 12:44pm

Hi @nigel , you’ve capture exactly what I’m trying to achieve. Many thanks for your detailed explanation. This and the detailed comments in the code explaining the various methods of FieldType have been the final confirmation for me that this is a high quality project I need to support, I just became a github sponsor, it’s not much but if i’m able to scale up revenue i’ll be able to contribute more. Huge thanks for the effort you and the team are putting into this project. I’ll no doubt have more questions for you after I experiment with the various approaches.

lucw · June 16, 2022, 3:53am

@nigel are there any tricks to updating a field’s value in row_of_dependency_updated ? With the following code, I end up with the N-1 value. For example if I type “hello 1” under the source field, I get "translation: " (using the previously blank source value) in the target field, and entering “hello 2” will result in “translation: hello 1”. Basically translation output is always one version behind. With the logging statements, I was able to verify that source_value is indeed the most up to date version, it’s really the output which is one version behind.

The mysterious part for me is the update statement. What is it exactly ? Is it an SQL statement ? I just plug in the actual string I want to have in that field, it seems to work but with this weird N-1 delay. Maybe i’m just not using the right way of updating the field within my TranslationFieldType's row_of_dependency_updated method. I realized that FieldFormulaType may not be a perfect example for me due to its reliance on PostgresSQL syntax.

Edit 1: note that if I F5 on the grid view, i get the right value on the translation field, so maybe there’s something i’m not doing right with notifying that my field has changed (the webhooks also have this N-1 value)

    def row_of_dependency_updated(
        self,
        field,
        starting_row,
        update_collector,
        via_path_to_starting_table,
    ):
        logger.info(f'row_of_dependency_updated, row: {starting_row}, {type(starting_row)}')

        # source_value = getattr(starting_row, field.source_field)
        source_value = getattr(starting_row, 'field_2034')
        translated_value = 'translation: ' + source_value
        
        logger.info(f'translated_value: {translated_value}')

        update_collector.add_field_with_pending_update_statement(
            field,
            translated_value,
            via_path_to_starting_table=via_path_to_starting_table,
        )        
        ViewHandler().field_value_updated(field)        

        super().row_of_dependency_updated(
            field,
            starting_row,
            update_collector,
            via_path_to_starting_table,
        )

lucw · June 16, 2022, 6:53am

Found it, I need:

class TranslationTextField(models.TextField):
    requires_refresh_after_update = True

and in my TranslationFieldType class, I need:

    def get_model_field(self, instance, **kwargs):
        return TranslationTextField(
            default=None,
            blank=True, 
            null=True, 
            **kwargs
        )

Then my translation field works, and this N-1 lag goes away.

lucw · June 18, 2022, 6:06am

@nigel the “source field” has a “field id”, like field_1992 and a name like “English”. What terminology do you use to describe these two identifiers ?
Assuming:

field_1992 is the field ID
“English” is the field name

it looks like get_field_dependencies needs to return ["English"] (return an array of field names), while as row_of_dependency_updated will need to know the field ID, in order to do source_value = getattr(starting_row, 'field_1992')

here’s my question:

should I store the field ID or the field name in my TranslationField instance ?
if I need to store the field ID, how do I convert it to field name within get_field_dependencies ?

lucw · June 19, 2022, 1:58am

Looks like the following is working for me, though i’m open to comments.
The TranslationFieldType will store the source field ID, such as field_1992

    def get_field_dependencies(self, field_instance: Field, field_lookup_cache: FieldCache):
        # logger.info(f'get_field_dependencies')
        table_model = field_lookup_cache.get_model(field_instance.table)
        candidates = [field for field in table_model._meta.fields if field.name == field_instance.source_field]
        if len(candidates) != 1:
            raise Exception(f'could not find {field_instance.source_field} in table {table_model}')
        field = candidates[0]
        result = [field.verbose_name]
        logger.info(f'result: {result}')
        return result


    def row_of_dependency_updated(
        self,
        field,
        starting_row,
        update_collector,
        via_path_to_starting_table,
    ):
        # logger.info(f'row_of_dependency_updated, row: {starting_row} vars: {vars(starting_row)}')
        source_value = getattr(starting_row, field.source_field)

        # add translation logic here:
        translated_value = 'translation: ' + source_value

        update_collector.add_field_with_pending_update_statement(
            field,
            translated_value,
            via_path_to_starting_table=via_path_to_starting_table,
        )        

        super().row_of_dependency_updated(
            field,
            starting_row,
            update_collector,
            via_path_to_starting_table,
        )

lucw · June 19, 2022, 1:34pm

Continuing to share my progress. I found that it’s not difficult at all to integrate with the celery worker in baserow. There is a downside for single row updates, because doing a row update from the celery worker may create a race condition where the ultimate value gets overwritten (it’s due to the ordering of the signals which emit the update on the websocket). So for single row updates, I may still synchronously process the translation from within row_of_dependency_updated. It’s nice and clean, and results in a single webhook / websocket update.
However if the user changes the translation field’s settings, using celery is a must, because a table may have thousands of fields, and I probably won’t want to to perform those synchronously from after_update.

nigel · June 20, 2022, 8:26am

Hey @lucw , just got back from holiday, your progress looks great! Here are some comments:

You can use the db_column field attr when accessing the starting_row to get the cell value:

source_value = getattr(starting_row, field.source_field.db_column)

The starting_row parameter in your row_of_dependency_updated is potentially not safe to access in your current way. When a row is updated, it triggers these row_of_dependency_updated hooks to be called in dependency order across the field dependency graph. There is no guarantee that the field parameter is present in the starting_row as these hooks might have followed a field dependency into a new table. The starting row is always the initial row/rows that was modified in the starting table. Additionally starting_row is badly named atm and can be a list of rows when a bulk_update API call occurs.

Instead, I was thinking we could add the ability to the update_collector itself to take a lambda/callable that is given the rows in the table of the updated field, and then your function could modify those rows for them to be bulk_updated.

Firstly here is a patch which I hope adds the ability to register a lambda/function with the update_collector:

add_python_row_update_funcs_to_update_collector.patch (11.2 KB)

The idea being you can now do something like this in your row_of_dependency_updated:

def row_of_dependency_updated(
    self,
    field,
    starting_row,
    update_collector,
    via_path_to_starting_table,
):
    def translate_rows(rows):
        for row in rows:
            translated_value = "translation: " + getattr(row, field.db_column)
            setattr(row, field.db_column, translated_value)

    update_collector.add_field_with_pending_update_function(
        field,
        update_function=translate_rows,
        via_path_to_starting_table=via_path_to_starting_table,
    )

    ViewHandler().field_value_updated(field)

    super().row_of_dependency_updated(
        field,
        starting_row,
        update_collector,
        via_path_to_starting_table,
    )

Hm, it’s unfortunate you have to use a custom model field with requires_refresh_after_update. This attribute was intended for situations where the new cell values are computed in SQL itself and so you are forced to do a UPDATE + SELECT to get the new values when responding with the new row in the API. In your situation we don’t do the translation itself in SQL and so we do already know the new translated value in python. However your solution of requires_refresh_after_update=True even though it runs a potentially needless SELECT it still the simplest way I can see of fixing this N-1 for your situation.

the “source field” has a “field id”, like field_1992 and a name like “English”. What terminology do you use to describe these two identifiers ?
Assuming:

So we haven’t 100% nailed down the terminology for field_1992 identifiers vs English. Right now I refer to the first one as the “internal field name” and the second the “user field name”. You are 100% correct that get_field_dependencies on your fork needs to return the list of “user field name”'s, however when accessing a row instance/model you should instead be usingg the internal field name. To answer your question, I would store a FK to the source field in your TranslationField instance:

class TranslationField(models.Field):
    source_field = models.ForeignKey(
        "Field",
        on_delete=models.CASCADE,
        help_text="The field to translate.",
        null=True,
        blank=True,
    )

And then in get_field_dependencies you would:

    def get_field_dependencies(self, field_instance, field_lookup_cache):

        if field_instance.source_field is not None:
            return [field_instance.source_field.name]
        else:
            return []

No need to get a model/loop over the fields in your model like in your example. You can just access the user field name directly on a field by doing .name

Makes sense that using celery is a must. We’ve been considering for a while moving all formula/lookup/maybe even all field updates off into an async celery task, or doing so based on some sort of heuristic. If you are comfortable with sharing your github publicly I’d love to see how it ends up working/looking. Hopefully my suggested patch above can also improve your performance as you can hopefully do some sort of mass translate single API call given the list of all rows that have changed + the bulk_update usage.

Let me know if you have any other questions / would like a full MR review at some point later on Would you consider offering your new field type as a publicly available plugin in the future also perhaps?

lucw · June 21, 2022, 2:11pm

Hi @nigel , thank you for your detailed answer. Not sure how long I can abuse your kindness this way but here are a few more questions:

class TranslationField(Field):
    source_field = models.ForeignKey(
        "Field",
        on_delete=models.CASCADE,
        help_text="The field to translate.",
        null=True,
        blank=True,
        related_name='+'
    )

I had to add related_name='+', without it, I run into this error:

ERRORS:
database.TranslationField.source_field: (fields.E305) Reverse query name for 'database.TranslationField.source_field' clashes with reverse query name for 'database.TranslationField.field_ptr'.
        HINT: Add or change a related_name argument to the definition for 'database.TranslationField.source_field' or 'database.TranslationField.field_ptr'.

note that I don’t fully understand what this does, it’s just something I got on stack overflow. But I am going to study how foreign keys work on django.

Second, when creating the field using the frontend, I get the following error in the devtools console. I guess i’m going to have to write some additional code to build this foreign key based on the internal field ID.

error	"ERROR_REQUEST_BODY_VALIDATION"
detail	Object { source_field: […] }
source_field	[ {…} ]
0	Object { error: "Incorrect type. Expected pk value, received str.", code: "incorrect_type" }
error	"Incorrect type. Expected pk value, received str."
code	"incorrect_type"

All my code will be open source, however in some cases i’ll use APIs that require payments (for example translation API from google/azure). Some part of my product will be free and people will be able to install it and test it out. Some of those APIs are pretty long calls exceeding a few seconds, so i’ll have to use asynchronous logic in those cases. And it looks like Celery is well equipped to monitor queues, maybe alert if they get too long, so i’ll be able to monitor performance. In general if people want to add “calculated / derived fields” computed using external REST APIs, I expect they should be able to copy my approach.

nigel · June 22, 2022, 9:45am

No abuse at all, we love to see more users/extensions/plugins for Baserow and I think your new field type will be incredibly useful for lots of people!

The related name error makes sense / is expected and your fix is also sensible. You might want to change

source_field = models.ForeignKey(
        "Field",

to

source_field = models.ForeignKey(
        "database.Field",

To be clearer about which app that model is coming from.

That frontend error is also expected. In your component Form in the frontend you need to make sure you are sending the integer field.id value for the source_field.

You might find some inspiration in the modules/database/components/field/FieldLookupSubForm.vue which also shows dropdowns for other fields in the table and sends them to the backend (note you don’t need to do any extra http queries, the lookup sub form does a http api request to get the fields of another table, I believe for your source field the only choices should be fields in the same table, which should already be present in the field store, see the linkRowFieldsInThisTable property in that component for something i’d imagine you’d have to do).

Additionally the baserow.contrib.database.fields.field_types.LookupFieldType shows how to have a field in Baserow which has an FK to another field. It does much more than you need to as it also lets you set the target/through fields via name ,where I think you only need to be able to set your source field using the internal field.id (unless you also want to set it via its user name then have a look at how through_field_name works instead of through_field_id).

lucw · August 8, 2022, 2:37pm

hi @petrs , @nigel , I’ve been making good progress on my project and everything is working beautifully. I am hitting a tiny glitch in the front-end when doing large number of updates and i opened another thread to discuss: Large number of row updates, how to notify front-end
but mainly i wanted to say things are working really well for me, I find baserow is really well designed !

lucw · September 4, 2022, 4:43am

@nigel is it possible to merge your add_python_row_update_funcs_to_update_collector.patch for 1.12 ? Now that i’ve converted by app to a plugin, it’s harder to apply patches to the core baserow code.

And in the mean time, would you mind giving me an updated patch which can be applied cleanly on 1.11? I wasn’t able to apply the patch.

lucw · September 8, 2022, 2:43pm

@nigel could I ask for your help on the above? It’s preventing me from making progress on my project. Soon after that I’d like to share my plug-in code for your review as you gracefully offered previously.

nigel · September 9, 2022, 8:59am

Hey @lucw , i’ll get you an updated patch today. Plugins making it harder to modify Baserow internal code is both a feature and bug atm. You can literally patch code using the plugin’s build.sh scripts, but this is probably a dangerous and tough road to go down.

One alternative that you could go with is not using the update_collector at all. Instead in your row_of_dependency_updated field type hook methods you could just directly update the rows yourself like so:

def row_of_dependency_updated(
    self,
    field,
    starting_row,
    update_collector,
    via_path_to_starting_table,
):

    model = field.table.get_model()
    rows_to_bulk_update = []
    for row in model.objects.all():
        translated_value = "translation: " + getattr(row, field.db_column)
        setattr(row, field.db_column, translated_value)
        rows_to_bulk_update.append(row)
    model.objects.bulk_update(rows, updated_fields=[field.db_column])

    ViewHandler().field_value_updated(field)

    super().row_of_dependency_updated(
        field,
        starting_row,
        update_collector,
        via_path_to_starting_table,
    )

Edit:
One caveat with the above code is that the update collector does a bunch of extra work for you, it makes sure you only end up updating rows that need to be updated etc, so i’d say the above solution is more inefficient temporary fix until the patch can be release in Baserow.

lucw · September 9, 2022, 10:31am

Great, this works for now, I can use the enhanced updatecollector later when you get a chance to integrate it into main baserow. thank you.

lucw · January 19, 2023, 1:43pm

Hi @nigel , did you eventually integrate the “Enhanced Update Collector” into the main codebase ? the alternative way works fine for me though.